eth-sri / lmql

A language for constraint-guided and efficient LLM programming.
https://lmql.ai
Apache License 2.0
3.7k stars 200 forks source link

determining the speed of generation based on model's underlying quality #270

Closed mcapizzi-cohere closed 1 year ago

mcapizzi-cohere commented 1 year ago

This is both (1) a very naive question and (2) probably best suited for another forum, but I'll ask it anyway.

Is there a relationship between (1) the underlying quality of the LLM used and (2) the time it takes to complete the LMQL query? Take two models for example:

  1. model one -> very good model that, during greedy decoding, could actually complete the desired output without any LMQL constraints
  2. model two -> a very "poor" model that can only complete the desired output with LMQL constraints

Will one of those models complete the generation faster?

This question reveals my lack of detailed understanding of both (1) greedy decoding in general and (2) LMQL's implementation but I'd appreciate some more intuition on the question as it will help us decide if it's "worth" using a stronger (most likely larger w.r.t. parameters) model in our application.

lbeurerkellner commented 1 year ago

The quality of a model does not directly affect inference speed. However, smaller models (wrt no. of parameters) are typically much faster while often also less capable. Thus, it could be that you are just observing a smaller/less capable model to be faster and a bigger/more capable model to be slower.