Open alexanderepstein opened 1 month ago
Duplicate of #3273 I think? :)
so log each n as a separate generation? @Manouchehri @alexanderepstein
thinking aloud - where is usage logged? as i understand it - usage is for the entire response object, whereas there would be multiple langfuse generations (1 for each 'n')
@Manouchehri I don't think its exactly a duplicate, while I agree Langfuse could seek to handle N choices even if they did the implementation in litellm would still only trace the first choice, see here.
@krrishdholakia so my proposal to work around it is to use the span mechanism for grouping the n generations so that its clear these are from the various choices. So yes from the litellm perspective it would be like logging the n different generations with the parent observation id being the span, you can view the image here to see how langfuse supports grouping other observations by using a span.
In theory though the cost of processing the prompt is only valid at the span level since it happens once for all of the choices, but the cost of each of the choices will depend on the number of tokens in the respective generation. Of course the apis only return the total cost so if you had a way of determining the number of tokens per generation and then tying that to the model used you would know that cost as well. In practice this feels a little heavyweight to do for logging so maybe its best to only apply costs at the span level rather than trying to show them for each individual choice generation.
The Feature
I want any generation that has N choices:
redacted-by-litellm
Motivation, pitch
When logging with Langfuse when using n choices only the first choice is logged, this may be hiding content on how the generation is actually used. For instance, I want to implement chain of thought + max voting for some structured output. Right now I won't see all the chain of thoughts, and all the final outputs.
The alternative here is to use multiple invocations rather than using n choices, the downside to this is having to recompute the attention matrices multiple times rather than only once and outputting n various generations, this has a cost, in both dollars & time.
Twitter / LinkedIn details
No response