Hey @drahnr I've had a go at speeding up loading the Tokenizer today.
I did two things:
Replace the IndexMap with Vec<(WordIdInt, PosIdInt)> as discussed in #56. This makes the most difference.
new -> with_capacity by storing the lengths, this makes a very small but measurable difference (a couple %).
Overall I get a 25% speedup, which is something at least. I experimented a bit with parallelization, particularly setting some "anchor" points in the FST and splitting the work in chunks where each chunk iterators from one anchor point to the next, but it seems the speedup from that is nullified by the merge we have to do afterwards.
Maybe there's some more smarter ways to further speed this up, but I couldn't think of anything.
Hey @drahnr I've had a go at speeding up loading the Tokenizer today.
I did two things:
IndexMap
withVec<(WordIdInt, PosIdInt)>
as discussed in #56. This makes the most difference.new
->with_capacity
by storing the lengths, this makes a very small but measurable difference (a couple %).Overall I get a 25% speedup, which is something at least. I experimented a bit with parallelization, particularly setting some "anchor" points in the FST and splitting the work in chunks where each chunk iterators from one anchor point to the next, but it seems the speedup from that is nullified by the merge we have to do afterwards.
Maybe there's some more smarter ways to further speed this up, but I couldn't think of anything.