Speed up tagger loading: remove IndexMap, new -> with_capacity

bminixhofer commented 3 years ago

Hey @drahnr I've had a go at speeding up loading the Tokenizer today.

I did two things:

Replace the IndexMap with Vec<(WordIdInt, PosIdInt)> as discussed in #56. This makes the most difference.
new -> with_capacity by storing the lengths, this makes a very small but measurable difference (a couple %).

Overall I get a 25% speedup, which is something at least. I experimented a bit with parallelization, particularly setting some "anchor" points in the FST and splitting the work in chunks where each chunk iterators from one anchor point to the next, but it seems the speedup from that is nullified by the merge we have to do afterwards.

Maybe there's some more smarter ways to further speed this up, but I couldn't think of anything.

drahnr commented 3 years ago

This is very good news! 25% is already a noticeable improvement, sorry for dropping the ball on this :>

bminixhofer commented 3 years ago

No worries. As of release 0.6.2 you should see the speedup :)

bminixhofer / nlprule

Speed up tagger loading: remove IndexMap, new -> with_capacity #66