Currently the implementation computes the full next-allowed-token mask at each decoding step according to the grammar and the prefix.
However, in many cases, the model's most likely token is probably valid so computing the entire mask is unnecessary.
Instead, iterately validate tokens with highest likelihood until finding the first match would probably be better.
This would need some refactoring in the logit processor class.
Currently the implementation computes the full next-allowed-token mask at each decoding step according to the grammar and the prefix. However, in many cases, the model's most likely token is probably valid so computing the entire mask is unnecessary.
Instead, iterately validate tokens with highest likelihood until finding the first match would probably be better.
This would need some refactoring in the logit processor class.