BerriAI / litellm

Python SDK, Proxy Server to call 100+ LLM APIs using the OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
12.23k stars 1.42k forks source link

[Feature]: Support N Choices In Langfuse #4964

Open alexanderepstein opened 1 month ago

alexanderepstein commented 1 month ago

The Feature

I want any generation that has N choices:

Motivation, pitch

When logging with Langfuse when using n choices only the first choice is logged, this may be hiding content on how the generation is actually used. For instance, I want to implement chain of thought + max voting for some structured output. Right now I won't see all the chain of thoughts, and all the final outputs.

The alternative here is to use multiple invocations rather than using n choices, the downside to this is having to recompute the attention matrices multiple times rather than only once and outputting n various generations, this has a cost, in both dollars & time.

Twitter / LinkedIn details

No response

Manouchehri commented 1 month ago

Duplicate of #3273 I think? :)

krrishdholakia commented 1 month ago

so log each n as a separate generation? @Manouchehri @alexanderepstein

thinking aloud - where is usage logged? as i understand it - usage is for the entire response object, whereas there would be multiple langfuse generations (1 for each 'n')

alexanderepstein commented 1 month ago

@Manouchehri I don't think its exactly a duplicate, while I agree Langfuse could seek to handle N choices even if they did the implementation in litellm would still only trace the first choice, see here.

@krrishdholakia so my proposal to work around it is to use the span mechanism for grouping the n generations so that its clear these are from the various choices. So yes from the litellm perspective it would be like logging the n different generations with the parent observation id being the span, you can view the image here to see how langfuse supports grouping other observations by using a span.

In theory though the cost of processing the prompt is only valid at the span level since it happens once for all of the choices, but the cost of each of the choices will depend on the number of tokens in the respective generation. Of course the apis only return the total cost so if you had a way of determining the number of tokens per generation and then tying that to the model used you would know that cost as well. In practice this feels a little heavyweight to do for logging so maybe its best to only apply costs at the span level rather than trying to show them for each individual choice generation.