Dan-wanna-M / formatron

Formatron empowers everyone to control the format of language models' output with minimal overhead.
MIT License
163 stars 6 forks source link

Add Benchmarks #1

Open Dan-wanna-M opened 3 months ago

Dan-wanna-M commented 3 months ago
turboderp commented 3 months ago

As a note re ExLlamaV2, recent versions have a some amount of warmup time, and really need to generate about 250 tokens at a given batch size before all the kernels are tuned and the graphs are built and so on. It wouldn't directly impact the application of filters, but you might end up skewing the results if the backend gets 30% faster in-between two tests.

Dan-wanna-M commented 3 months ago

As a note re ExLlamaV2, recent versions have a some amount of warmup time, and really need to generate about 250 tokens at a given batch size before all the kernels are tuned and the graphs are built and so on. It wouldn't directly impact the application of filters, but you might end up skewing the results if the backend gets 30% faster in-between two tests.

Got it, will add a exllamav2-specific warmup phrase. By the way I notice that creating a large Python set(or clear&fill) every time in the filter can take a few ms, especially for regex like json string where almost all tokens are allowed. Would it be possible to offer alternative interfaces in exllamav2

turboderp commented 3 months ago

I can definitely add something like that. I think sets are more efficient when you have to interact with certain other filters, but I could allow ExLlamaV2Filter.next() to return a list rather than a set, and then deal with it in the sampler if it actually needs to be converted to a set for whatever reason.

Would a list be ideal in that case? Or some other structure, maybe a Torch tensor?

Of course at some point I'll just move the filter evaluation to the end of the forward pass so it ends up overlapping with the CUDA queue and then it probably won't matter either way.

Dan-wanna-M commented 3 months ago

I can definitely add something like that. I think sets are more efficient when you have to interact with certain other filters, but I could allow ExLlamaV2Filter.next() to return a list rather than a set, and then deal with it in the sampler if it actually needs to be converted to a set for whatever reason.

Would a list be ideal in that case? Or some other structure, maybe a Torch tensor?

Of course at some point I'll just move the filter evaluation to the end of the forward pass so it ends up overlapping with the CUDA queue and then it probably won't matter either way.

I think a list is fine; I actually used a list to mask logits in huggingface's integration and the overhead/token is negligble.