Closed bmanturner closed 8 months ago
Hey @bmanturner you can add the callback handler on the LLM itself, is that still problematic bc you dont know which LLM call to give the score to?
Correct. We have different LLM calls in each chain
@bmanturner Hmm I hear this is a bit confusing. We have been thinking about introducing an abstraction for a group, this would allow you to arbitrarily group LLM calls. If we did that do you think it would make sense to introduce a score to the parent group or just have tools to assign the same score to all of the LLM calls within a group?
For now, if possible you can just put the score onto one of the LLM calls (maybe the last one for example).
But curious to hear if a group abstraction would be worthwhile for you and what you think the most logical UX for scoring in a group would be.
Our idea for a workaround (although a tedious one), would be to generate a UUID per chain call, and assign the UUID as metadata in each LLM call within the chain, and then gather a list of all the request ids and submit feedback that way.
To your question, for the most granularity I think feedback on the group as well as the individual calls would be good, because maybe overall the chain failed, but there's still a good example of an llm call within the chain that would be good for training. Maybe by default feedback on the group applies to all the calls within it, but it could be overridden? I'm not certain. Maybe feedback gets applied to the group only, as it would require more investigation by a user to determine which calls within the group were relevant.
It would be nice to have all the LLM requests in a chain automatically grouped together. We're looking to implement a UI for end-users to score the results of a chain, and in some cases it's difficult to determine which request ID to use for the feedback.
I assumed the PromptLayer Callback Handler could be passed to a chain, but that doesn't seem to be the case.
What do you recommend here?