Faster vocabulary initialization

turboderp commented 3 months ago

Just a couple small changes for your consideration. Compiling a regex in the _multiple_replace function per-token is redundant and very much a hotspot when profiling. By only constructing it once, get_original_characters goes from about 9.5 seconds down to about 0.7 (for a Llama3 vocabulary, TR 7960X).

_multiple_replace still uses up 70% of that time, so it might be worth it to add an extension function for that. If the substitutions are all between 1 to 2 byte substrings it should be nearly instant in C, but I don't know if you'd want to complicate the library that way, or maybe if you would prefer a Rust utility function in the KBNF library instead.

Also, specific to the ExLlamaV2 integration, I can't see a downside to caching create_engine_vocabulary so initialization only has to happen once when creating multiple formatters for the same model.

Dan-wanna-M commented 3 months ago

@turboderp Thanks for moving the regex compilation out and I think 0.7s is good enough for now.

Dan-wanna-M commented 2 months ago

@turboderp is probably busy with his life(which is completely normal), and since I can edit this PR, I probably will just apply my review changes and merge this in the next few days.

turboderp commented 2 months ago

I'm mostly busy with code. There's a bunch of other stuff going on with tensor-parallel implementations and samplers and chasing bugs and so on. I will add features to the filters soon to allow either sets, lists or tensor return values, it just hasn't made it to the front of the queue yet.

As for this PR, feel free to make whatever changes you feel are appropriate. Or just close the PR and do something similar on the main fork if you want, I don't mind.

Dan-wanna-M commented 2 months ago

Changes made in faster vocab initialization branch.

Dan-wanna-M / formatron

Faster vocabulary initialization #10