Improve tagger: Return iterators over `WordData`, remove groups, parallelize deserialization

bminixhofer commented 3 years ago

I had another look at the tagger today. This PR:

Changes all the get_tags_* methods to return iterators instead of Vec.
Removes the groups. These were only used by nlprule in the PosReplacer which wasn't used anywhere as it is not fully implemented. Some of the currently unimplemented rules might need the groups in some form though, but we can probably get away with search + caching since the groups are only needed if a rule actually matches there.
Iterates over the FST in parallel in chunks with disjoint words, this allows populating the tags without synchronization.
Replaces the word HashMap with a Vec since the IDs go from zero to n_words anyway, so we don't need to hash anything.

I see another ~ 30% speedup in loading the Tokenizer. This could also have positive impact on rule checking speed, but there's some weird behavior in the local benchmark on my PC so I have to double check.

@drahnr you might be interested in this PR. It would also be great if you could double check the speedup.

drahnr commented 3 years ago

Could you provide some instructions how you run your benches? Just copied the .bin files from the last release, but to no avail - I assume the structure of the .bins changed as well.

Benchmarking load tokenizer: Warming up for 3.0000 sthread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Serialization(Io(Custom { kind: UnexpectedEof, error: "failed to fill whole buffer" }))', nlprule/benches/load.rs:6:76
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

bminixhofer commented 3 years ago

Oh, I forgot that the binaries have changed. You can rebuild them with:

./scripts/build_and_test.sh en xxx

drahnr commented 3 years ago

Alright, it was a bit more than that, but it's running now.

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 36.1s, or reduce sample count to 10.
load tokenizer          time:   [350.09 ms 350.94 ms 351.92 ms]                           

load rules              time:   [32.718 ms 32.819 ms 32.929 ms]

A few observations (again, cargo-spellcheck PoV):

Numbers look really good, getting closer to the point where I would consider not thinking about launching another process just to avoid re-parsing - so this is a huge :+1:
since the data is constant, a 3.0 second warmup does not represent what a common user of cargo-spellcheck will face (talking mostly caches here, black_box will only prevent the compiler to pre-maturely optimizing, but the actual CPU doesn't care) - See more details in https://github.com/bminixhofer/nlprule/pull/71 - EDIT: even for the case of a long running service, you are mostly intersted in the one-off execution, both rules and tokenizer loading are not common hot-paths, but rather one-off operations.
criterion has options for, so I would recommend not much of a warmup, so the worst case represents what cli programs will see

Thanks for working on the perf :heart:

drahnr commented 3 years ago

Benches were all done with

vendor_id   : AuthenticAMD
cpu family  : 23
model       : 113
model name  : AMD Ryzen 7 3700X 8-Core Processor
stepping    : 0
microcode   : 0x8701013
cpu MHz     : 2200.000
cache size  : 512 KB
...

(for perspective).

bminixhofer commented 3 years ago

Alright, it was a bit more than that

I should document that, I guess you had to download the build dirs too.

Thanks for checking the perf! And thanks for the PR. I agree on the point with the warmup, I'll merge it.

bminixhofer commented 3 years ago

Turns out the Chunker deserialization also contributed a significant amount now that the Tokenizer is faster. There was a fairly easy fix storing the parameters in one large vector and accessing with (offset, length) tuples instead of having one vector for each feature which shaves of another ~12% of loading time on the Tokenizer (in terms of the already improved speed above).

Unfortunately I think that's it for the reasonably low-hanging speed improvements :wink:

bminixhofer / nlprule

Improve tagger: Return iterators over `WordData`, remove groups, parallelize deserialization #70