token generation speed decreases when using grammar

When a set of GBNF expressions is specified via the 'grammar' parameter through the API, the performance is affected.

For instance, I'm using this file: https://github.com/ggerganov/llama.cpp/blob/master/grammars/json.gbnf to ensure that the LLM always generates valid JSON content The token-per-second ratio decreases from 70 to 30 when using Tesla V100-SXM2-16GB (compared to plain text output). Through testing, I noticed that an RTX3080 generates tokens at the same speed as an RTX3060 (both at 30 tokens per second). I find it curious that the speed is roughly the same across these three cards and when using grammar sampling.

When deactivating the parameter (not using grammar), the generation speeds do differ. This strikes me as odd. Is this normal? Is there another way to ensure that the LLM always returns valid JSON objects?

LostRuins / koboldcpp

token generation speed decreases when using grammar #810