eth-sri / lmql

A language for constraint-guided and efficient LLM programming.
https://lmql.ai
Apache License 2.0
3.6k stars 194 forks source link

Question about amount of API calls #2

Closed pruksmhc closed 1 year ago

pruksmhc commented 1 year ago

Thank you for the great work! For a query for the sunscreen example: "A list of things not to forget when going to the sea (not travelling): \n" "- [THING]

From a quick skim of the paper, my understanding is that there will be an API call for each THING in each line (two calls). Is that correct?

Also, it seems like the OpenAI API already gives some control over the decoding strategy (especially with introduction of logit bias, which can be used to upsample/downsample certain tokens). Is the additional value this library gives over that constraints?

lbeurerkellner commented 1 year ago

Thank you for your interest in this.

Indeed, for OpenAI we have to issue multiple subsequent API calls in the given example. However, we do implement speculative generation, which means, that in some cases you can actually save some of the calls. This is an unfortunate limitation of the OpenAI API and not the case with local models (e.g. via the HuggingFace Transformers backend). With local models, LMQL runs directly as part of the LM's decoding loop, applying masking eagerly and inserting prompt tokens if necessary. So in this case, you only need one decoder call.

To learn more about the limitations of the OpenAI API, you can also find more information at https://docs.lmql.ai/en/latest/language/models.html#openai-api-limitations.

To see how many API calls and tokens LMQL consumes to execute a query, you can also find query statistics and cost estimates when running in the LMQL playground, which reports these statistics in real-time during execution.

When it comes to the benefits of LMQL over plain API use, it depends on what you are looking for. So far, we found LMQL to be of great use when it comes to multi-part prompting schemes (e.g. with dynamic insertions in-between completion), as you don't have to manage multiple API calls yourself, and overall, your code will operate at a higher level. Another thing is that LMQL allows you to run 100s of queries in parallel, and automatically bundles and batches API calls across all executed queries. Constraints are also a big benefit and increase in expressiveness, as they go beyond simple filter lists for specific tokens.

I hope this answers your question.