MagnivOrg / prompt-layer-library

🍰 PromptLayer - Maintain a log of your prompts and OpenAI API requests. Track, debug, and replay old completions.
https://www.promptlayer.com
Apache License 2.0
479 stars 42 forks source link

[Question] LangChain Chain Callback Handler #57

Closed bmanturner closed 8 months ago

bmanturner commented 11 months ago

It would be nice to have all the LLM requests in a chain automatically grouped together. We're looking to implement a UI for end-users to score the results of a chain, and in some cases it's difficult to determine which request ID to use for the feedback.

I assumed the PromptLayer Callback Handler could be passed to a chain, but that doesn't seem to be the case.

What do you recommend here?

Jped commented 11 months ago

Hey @bmanturner you can add the callback handler on the LLM itself, is that still problematic bc you dont know which LLM call to give the score to?

bmanturner commented 11 months ago

Correct. We have different LLM calls in each chain

Jped commented 11 months ago

@bmanturner Hmm I hear this is a bit confusing. We have been thinking about introducing an abstraction for a group, this would allow you to arbitrarily group LLM calls. If we did that do you think it would make sense to introduce a score to the parent group or just have tools to assign the same score to all of the LLM calls within a group?

For now, if possible you can just put the score onto one of the LLM calls (maybe the last one for example).

But curious to hear if a group abstraction would be worthwhile for you and what you think the most logical UX for scoring in a group would be.

bmanturner commented 11 months ago

Our idea for a workaround (although a tedious one), would be to generate a UUID per chain call, and assign the UUID as metadata in each LLM call within the chain, and then gather a list of all the request ids and submit feedback that way.

To your question, for the most granularity I think feedback on the group as well as the individual calls would be good, because maybe overall the chain failed, but there's still a good example of an llm call within the chain that would be good for training. Maybe by default feedback on the group applies to all the calls within it, but it could be overridden? I'm not certain. Maybe feedback gets applied to the group only, as it would require more investigation by a user to determine which calls within the group were relevant.