Make `--no-mmap` calls still use parallelism when filesizes are large

This change uses double buffers that are each 1MiB large, while one buffer is filling from the OS, the other buffer is hashed using update_rayon. This is around twice as fast as just using update_reader for files of 1GiB in size on my machine (ryzen 2600), and half as fast as using mmap.

The code also accounts for small files, if a file is under 1MiB it will fall back to update_reader, this ensures that the change is always at least neutral in performance, because we overshot the actual place where update_rayon becomes faster, we never see cases where it is slower.

Currently the code uses the read_chunks crate, which is something I made to handle EINTR and try and fully fill the read buffer, if this is approved to merge I would want to take the function it calls and just cut it into this project somewhere, instead of adding an extra dependency.

Some crude benchmarks below, hashing a gibibyte of random data; (b3sum 1.5.0 vs 03e0949d13cebe3c04e1c908d25cf1e22bc71623)

# this PR
[b3sum]$ time ./target/release/b3sum --no-mmap gigafile
303966b0ba3c0766247f911d8f7dd172cffa1952bf1106f801fcf7e1455ce5c0  gigafile

real    0m0.253s
user    0m1.234s
sys 0m0.501s
# unmodified binary
[b3sum]$ time b3sum --no-mmap gigafile
303966b0ba3c0766247f911d8f7dd172cffa1952bf1106f801fcf7e1455ce5c0  gigafile

real    0m0.570s
user    0m0.477s
sys 0m0.091s
# unmodified binary, with mmap enabled
[b3sum]$ time b3sum gigafile
303966b0ba3c0766247f911d8f7dd172cffa1952bf1106f801fcf7e1455ce5c0  gigafile

real    0m0.126s
user    0m1.067s
sys 0m0.103s

BLAKE3-team / BLAKE3

Make `--no-mmap` calls still use parallelism when filesizes are large #361