confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
2.91k stars 213 forks source link

Pre-configuring Multiple Responses for a Conversation #140

Closed dvirginz closed 6 months ago

dvirginz commented 11 months ago

Hello deepeval maintainers and community,

I am currently working on a project where I am building a chatbot to assist users in buying a product. I want to be able to evaluate the bot's responses in various conversation flows, and I was wondering if the deepeval library supports such a use case.

Here are the three specific flows I'd like to test:

  1. Positive Flow: The user agrees with everything and always responds positively, essentially saying 'yes' to all prompts.
  2. Inquisitive Flow: The user asks 1 to 3 out of 5 possible questions during the interaction.
  3. Human Representative Request: Throughout the conversation, the user consistently asks to speak to a human representative.

Ideally, I would like to set up a mock "user" (which could be another bot) to communicate with the bot we're aiming to send to production. This would simulate these three scenarios and allow us to test our bot's responses.

Questions:

  1. Does the deepeval library support this use case of pre-configuring multiple conversation flows?
  2. If yes, could you provide any pointers or documentation links on how to set it up?
  3. If no, do you have any plans in the future roadmap to support such functionality or do you know of any other tools/libraries that might assist with this?

Thank you for your time and looking forward to your response!

ColabDog commented 11 months ago

We currently don't have a way of pre-configuring multiple conversation flows but that sounds a lot like having an intent classifier for each flow. I find this use case interesting as chatbot applications are one of the main ways people are building LLMs and I think the way you broke it down to be quite interesting.

Off the top of my head, the way I think we can develop a way to test for this is:

Would this method of e2e testing be appropriate? If so - I can potentially think of a way to brainstorm this.

This isn't yet on our roadmap (a bit of focus instead on growing the number of metrics) but I believe it should be! Happy to book a call to discuss further :) https://calendly.com/jacky-twilix/30min

dvirginz commented 11 months ago

Hi Jacky,

Thank you for your thorough response!

Just to clarify, after reading your message several times, it seems you might partially support this feature. Do you have a "user-bot" that can mimic the 'client' for multiple messages?

If so, we can evaluate whether the conversation ended successfully on our own. We just don’t want to enter a "client-bot" development cycle and connect it to a CI/CD pipeline if observability platforms like yours already support this.

ColabDog commented 11 months ago

Unfortunately we don't - it's something we're thinking about adding to our roadmap - I will create a ticket that goes into this into more detail

dvirginz commented 11 months ago

Thank you very much.

When we think of CI/CD for LLMs (at least for our use case), this is the only way we see ourselves validating our bots.

On Thu, 28 Sept 2023 at 09:27, ColabDog @.***> wrote:

Unfortunately we don't - it's something we're thinking about adding to our roadmap - I will create a ticket that goes into this into more detail

— Reply to this email directly, view it on GitHub https://github.com/confident-ai/deepeval/issues/140#issuecomment-1738543083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHM4A776TAF7ZLF5EIR6YPLX4UKGZANCNFSM6AAAAAA5HKXNT4 . You are receiving this because you authored the thread.Message ID: @.***>

ColabDog commented 11 months ago

@dvirginz That sounds interesting. Can you share a bit of details about the chatbot stack? (Langchain/LlamaIndex/etc.) Would love to learn more.

dvirginz commented 11 months ago

Yes, Langchain + Flask integrates with Github Actions.

It's important to mention the web framework and the CI/CD tool, as our ultimate goal is to find a CI/CD tool for our LLMs.

Currently, our most pressing observability need is ensuring that when we push new PRs, we don't break chatbot logic that has been previously validated. This validation is currently being done manually (QA).

On Thu, 28 Sept 2023 at 22:24, ColabDog @.***> wrote:

@dvirginz https://github.com/dvirginz That sounds interesting. Can you share a bit of details about the chatbot stack? (Langchain/LlamaIndex/etc.) Would love to learn more.

— Reply to this email directly, view it on GitHub https://github.com/confident-ai/deepeval/issues/140#issuecomment-1739883410, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHM4A7Z322CPSK4N3PD7K4DX4XFG7ANCNFSM6AAAAAA5HKXNT4 . You are receiving this because you were mentioned.Message ID: @.***>

ColabDog commented 11 months ago

Got it - I think I have a fairly neat solution for you :)

You can consider the following:

Positive Flow: The user agrees with everything and always responds positively, essentially saying 'yes' to all prompts.

Inquisitive Flow: The user asks 1 to 3 out of 5 possible questions during the interaction.

Human Representative Request: Throughout the conversation, the user consistently asks to speak to a human representative - You can ensure this by running a "Factual consistency check" of whether the user wants to speak to a customer support agent to see if the flow is breaking.

Here is some documentation for factual consistency: https://docs.confident-ai.com/docs/measuring_llm_performance/factual_consistency

dvirginz commented 11 months ago

Hi Friends,

Thank you once again for your responsiveness.

While the direction you've suggested is promising, our requirements are a tad more intricate, much like life often is. Let's consider we're building a fashion chatbot. This bot would ask users about their size, style, and preferred color. In such a context, a straightforward response might be "38, casual, red", and the sequence doesn't necessarily matter.

A more complex query might be, "Which style would best fit a 25-year-old with a size 40?"

Getting in touch with a representative is straightforward.

For scenarios like the ones mentioned above, the client would also need to be a bot to handle these interactions. Building the client flow isn't challenging - in fact, we've already accomplished that. However, crafting the boilerplate code that facilitates communication between the two bots, and then concludes with another agent ensuring the conversation met expectations, is a task we assumed an observability tool would handle.

If incorporating this isn't on your roadmap, I think we might take the initiative and develop it ourselves.

On Fri, 29 Sept 2023 at 09:34, ColabDog @.***> wrote:

Got it - I think I have a fairly neat solution for you :)

You can consider the following:

Positive Flow: The user agrees with everything and always responds positively, essentially saying 'yes' to all prompts.

  • You can ensure this by having factual consistency check for "yes" for all questions.

Inquisitive Flow: The user asks 1 to 3 out of 5 possible questions during the interaction.

  • You can ensure this by having a classifier for whether a sentence is a question - I think it should be fairly simple to train a classifier

Human Representative Request: Throughout the conversation, the user consistently asks to speak to a human representative - You can ensure this by running a "Factual consistency check" of whether the user wants to speak to a customer support agent to see if the flow is breaking.

Here is an example for factual consistency

https://docs.confident-ai.com/docs/measuring_llm_performance/factual_consistency

— Reply to this email directly, view it on GitHub https://github.com/confident-ai/deepeval/issues/140#issuecomment-1740378260, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHM4A76KUR4DDAIFSQMS6JDX4ZTYPANCNFSM6AAAAAA5HKXNT4 . You are receiving this because you were mentioned.Message ID: @.***>

ColabDog commented 11 months ago

@dvirginz

Thanks for explaining that. We are currently ironing out our chatbot roadmap due to a few requests in this area. Would love to share with you what we're thinking and see how it works for your use case too. What's the best way to get in touch?

My email is jacky@confident-ai.com if you would like to continue conversation there

dvirginz commented 11 months ago

Super nice of you to be interested in our use case like that.

I've reached out through email. :)

On Fri, 29 Sept 2023 at 20:38, ColabDog @.***> wrote:

@dvirginz https://github.com/dvirginz

Thanks for explaining that. We are currently ironing out our chatbot roadmap due to a few requests in this area. Would love to share with you what we're thinking and see how it works for your use case too. What's the best way to get in touch?

My email is @.*** if you would like to continue conversation there

— Reply to this email directly, view it on GitHub https://github.com/confident-ai/deepeval/issues/140#issuecomment-1741266511, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHM4A7ZPGDF2DSV7XINKZG3X44BQ7ANCNFSM6AAAAAA5HKXNT4 . You are receiving this because you were mentioned.Message ID: @.***>