LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.98k stars 349 forks source link

token generation speed decreases when using grammar #810

Open abeiro opened 5 months ago

abeiro commented 5 months ago

When a set of GBNF expressions is specified via the 'grammar' parameter through the API, the performance is affected.

For instance, I'm using this file: https://github.com/ggerganov/llama.cpp/blob/master/grammars/json.gbnf to ensure that the LLM always generates valid JSON content The token-per-second ratio decreases from 70 to 30 when using Tesla V100-SXM2-16GB (compared to plain text output). Through testing, I noticed that an RTX3080 generates tokens at the same speed as an RTX3060 (both at 30 tokens per second). I find it curious that the speed is roughly the same across these three cards and when using grammar sampling.

When deactivating the parameter (not using grammar), the generation speeds do differ. This strikes me as odd. Is this normal? Is there another way to ensure that the LLM always returns valid JSON objects?

LostRuins commented 5 months ago

Yes, that is expected. Grammar sampling is rather expensive to do.

If the model is well tuned, it should be able to produce valid JSON without grammar, so an easier way would be to attempt generation and then try to parse the result, retrying on failure.