Improve hashing and verification performance

casey commented 4 years ago

The speed of torrent creation and verification could use some work. I picked SHA1 and MD5 implementations basically at random, and there are lots of potential performance bottlenecks in how I hash and verify pieces.

Also, #440 reports performance problems when linking against musl on Linux. If allocation, I/O, or threading is slow with musl, then general performance improvements might cause less allocation and/or I/O (I don't do any threading) which might lessen the performance impact of linking against musl.

First thing will be to set up repeatable benchmarking, so there's stable baseline to compare against. Next there are a bunch of things that are worth trying.

[x] Split crate into binary and library, to make benchmarking possible
[x] Set up benchmarking
[ ] Set up flamegraph generation
[ ] Tests which exercise incomplete reads and uneven files, and probably some known good infohashes of large files and directories of files. (They can be generated with an algorithm, so they don't need to be saved in the repo.)
[ ] Test for musl vs libc builds
[ ] Try jemalloc (There might be an issue with musl and jemalloc.)
[ ] Consider unifying hashing and verification. Hashing and verification are implemented separately, but with some work they could be unified. This would be a good idea before pursuing performance improvements that will otherwise need to be duplicated.
[x] Fix ultra-cringe byte-at-a-time hashing inner loop
[ ] Experiment with buffering more data from each file
[ ] Experiment with reading file contents in a background thread
[ ] Experiment with async I/O
[ ] Experiment with multiple threads
[ ] Make new demo gif with faster speeds
[ ] Consider mmaping files

quietvoid commented 4 years ago

I've tried replacing the SHA1 loop and the performance improved a lot.

master 27,83s user 0,44s system 99% cpu 28,300 total

Info Hash 88ea2b80ae3dba07ee1c1fa89a10f395809e612a Piece Size 4 MiB Piece Count 1003

new loop 5,93s user 0,34s system 99% cpu 6,279 total

Info Hash 88ea2b80ae3dba07ee1c1fa89a10f395809e612a Piece Size 4 MiB Piece Count 1003

The I/O rate is messed up (goes to 953MB/s), but at least the hash is identical :smile:

Actually I removed some parts that were used in finish(). So this bit of code seems to make the tests pass: https://0x0.st/ipvh.rs

It works because reading into the buffer either fills it entirely with the piece length, or partly (e.g the last piece) and finish() clears the rest up.

casey commented 4 years ago

Sweet, it's good to see that there's a lot of room for optimization.

I'm surprised that the tests pass! I suspect that adding some test cases where files don't line up with the pieces might show some issues. I think that a file read could potentially overshoot a piece boundary, in which case it wouldn't be added to the piece list. The current version checks for a piece boundary every byte, so it can never miss a piece boundary, and thus add a piece to the piece list, but that's obvs super slow. I'll need to make it figure out where the next piece boundary in advance, so it never overshoots.

(0x0.st looks like a sweet file hosting service btw, I've never encountered it before.)

quietvoid commented 4 years ago

I don't think overshooting is possible, you either read the entire buffer length or whatever's left until the EOF, which is necessarily lower than the piece length. I did edit it late but isn't what you're considering just the finish function?

It could however be relevant to compare against other tools like mktorrent.

casey commented 4 years ago

I might be missing something, but I think that there are multiple scenarios in which the code might overshoot.

One scenario is if a read does not read an entire piece, which I think is possible. (And would be expected at very large piece sizes.)

If a file is 8kib in length and the piece size is 4kib, so we want 2 4kib pieces:

First read is 3kib, total != piece length (3kib != 4kib) , no piece is hashed.
Second read is 2kib, total != piece length (5kib != 4kib) no piece is hashed
Third read is 3kib, total != piece length (8kib != 4kib) no piece hashed
Final piece is hashed during finalization, but it contains 8kib of data

The second is if we hash multiple files that don't align to piece sizes. So, for a 3kib file and 5 kib file, with a 4kib piece size so we want two 4kib pieces covering 3 kib of the first file and 1kib of the second file, and the last 4kib of the second file:

First read gets entire 3kib file, total != piece length (3kib != 4kib) , no piece is hashed.
Second read gets 4kib of second files, total != piece length (7kib != 4kib), no piece is hashed
Final piece is hashed during finalization, but it contains 8kib of data

casey commented 4 years ago

I just landed #442, which tries to read up to the next piece boundary and hashes everything it gets back, instead of hashing a byte at a time. This was a 4x performance improvement, going from ~120MiB/s on my machine to ~480MiB/s. (Terrible code == easy performance wins!)

I wanted to land it, but I still want to add some tests which exercise incomplete reads and uneven files, and probably some known good infohashes of large files and directories of files. (They can be generated with an algorithm, so they don't need to be saved in the repo.)

I'm curious if this helps with the performance of musl binaries. I'd be a little surprised if it did, since this was a pure CPU optimization, and it would make sense if any musl performance hit were at the libc/syscall layer, but it might be worth a shot. I don't have a linux box, so I'll have to set up a benchmarking setup that lets me test musl vs glibc, but if you're up for it I'd definitely be curious how the master branch is doing performance wise vs before.

quietvoid commented 4 years ago

I hadn't considered the multiple files situation, then you'd have to handle different buffer sizes depending on the previously read bytes (which seems to be what was done in #442).

I tested it with both targets:

stable-x86_64-unknown-linux-gnu 5,18s user 0,45s system 99% cpu 5,626 total

x86_64-unknown-linux-musl 5,87s user 0,30s system 99% cpu 6,176 total

So it does affect the binary built against musl too, which is great.

As for the device I/O, the results above are on an SSD. On a Debian 10 system with some HDDs:

master real 0m8.924s user 0m8.300s sys 0m0.624s

v0.1.7 real 0m36.361s user 0m35.701s sys 0m0.660s

Very good gains overall.

The hashing rate is always stuck to 953 MB/s (Arch machine) or 476 MB/s (Debian machine), I haven't looked further at the code but it's something worth fixing.

About the benchmarks, I still can't manage to run criterion (had the same issue trying to use the imdl crate).

use imdl::bench::{Bench, HasherBench}; | ^^^^^ could not find bench in imdl

casey commented 4 years ago

I tested it with both targets:

stable-x86_64-unknown-linux-gnu 5,18s user 0,45s system 99% cpu 5,626 total

x86_64-unknown-linux-musl 5,87s user 0,30s system 99% cpu 6,176 total

So it does affect the binary built against musl too, which is great.

That's so interesting! I should do proper profiling, and figure out what's faster on musl. I'm very surprised that it made an impact there.

It looks like musl is still slow, but they're very close. I'm considering sticking with musl for now, due to compatibility issues, but I think that I should quantify the difference with some measurements.

As for the device I/O, the results above are on an SSD. On a Debian 10 system with some HDDs:

master real 0m8.924s user 0m8.300s sys 0m0.624s

v0.1.7 real 0m36.361s user 0m35.701s sys 0m0.660s

Very good gains overall.

Thank you for measuring! That's awesome. I'll cut a new release with the improvements soon.

The hashing rate is always stuck to 953 MB/s (Arch machine) or 476 MB/s (Debian machine), I haven't looked further at the code but it's something worth fixing.

That's really interesting, when you say it's stuck, does it vary at all? Is it the same number across runs? And is it way off from what it should be?

About the benchmarks, I still can't manage to run criterion (had the same issue trying to use the imdl crate).

Ahh, sorry. The benchmarking interface is only exposed when the bench feature is enabled. You can run it with cargo bench --features bench. I'll have to document that.

casey commented 4 years ago

Also, if the CPU is pegged at 99%, then multithreading will certainly help performance.

I'm curious if the current performance is good enough, in the sense that it's fast enough that nobody will care if it gets much faster with multithreading.

Multi-threaded hashing is somewhat complex, because the data that goes into pieces comes from multiple files, so some piece hashes cover data from two files. However, maybe you could just get the size of every file at the beginning, and then farm out pieces to worker threads. You might wind up opening and closing files more times than going through once, and buffered file reads might mean you would read some data off disk or out of some cache more than once, but it would probably still be way faster.

Some work-stealing scheme might also work. Initially, you assign different threads to hash different files, but if one thread finishes early (Which will be common, since files differ greatly in size) it just steals a piece-range of work from another thread.

quietvoid commented 4 years ago

The hashing rate is always stuck to 953 MB/s (Arch machine) or 476 MB/s (Debian machine), I haven't looked further at the code but it's something worth fixing.

That's really interesting, when you say it's stuck, does it vary at all? Is it the same number across runs? And is it way off from what it should be?

It behaves like v0.1.7 not varying at 119.21 MiB/s (such as the example images in the readme). But now it's not varying at 953.67 MiB/s or does jumps in intervals of ~50 MiB/s.

My SSD definitely does not read at that rate, and the time spent hashing actually reflects the speed/size better.

At this point the hashing rate is still mostly I/O bound, there might be optimizations possible but it would require profiling.

casey commented 4 years ago

It behaves like v0.1.7 not varying at 119.21 MiB/s (such as the example images in the readme). But now it's not varying at 953.67 MiB/s or does jumps in intervals of ~50 MiB/s.

Really interesting, this could be due to some issue with the progress bar.

At this point the hashing rate is still mostly I/O bound, there might be optimizations possible but it would require profiling.

Good to know! Profiling should be relatively easy to set up, so I'll probably do that, even if I don't wind up doing much optimizing.

quietvoid commented 4 years ago

By the way, I have played with different ways to increment the progress bar but the result is always the same. Probably something upstream in indicatif that loses precision when averaging the progress every second.

casey commented 4 years ago

Hmm, interesting. I should check out how indicatif is calculating progress. There might also be a setting to get it to refresh more often.

casey / intermodal

Improve hashing and verification performance #26