Closed turboderp closed 2 months ago
@turboderp Thanks for moving the regex compilation out and I think 0.7s is good enough for now.
@turboderp is probably busy with his life(which is completely normal), and since I can edit this PR, I probably will just apply my review changes and merge this in the next few days.
I'm mostly busy with code. There's a bunch of other stuff going on with tensor-parallel implementations and samplers and chasing bugs and so on. I will add features to the filters soon to allow either sets, lists or tensor return values, it just hasn't made it to the front of the queue yet.
As for this PR, feel free to make whatever changes you feel are appropriate. Or just close the PR and do something similar on the main fork if you want, I don't mind.
Changes made in faster vocab initialization
branch.
Just a couple small changes for your consideration. Compiling a regex in the
_multiple_replace
function per-token is redundant and very much a hotspot when profiling. By only constructing it once,get_original_characters
goes from about 9.5 seconds down to about 0.7 (for a Llama3 vocabulary, TR 7960X)._multiple_replace
still uses up 70% of that time, so it might be worth it to add an extension function for that. If the substitutions are all between 1 to 2 byte substrings it should be nearly instant in C, but I don't know if you'd want to complicate the library that way, or maybe if you would prefer a Rust utility function in the KBNF library instead.Also, specific to the ExLlamaV2 integration, I can't see a downside to caching
create_engine_vocabulary
so initialization only has to happen once when creating multiple formatters for the same model.