dotnet / ai-samples

MIT License
220 stars 38 forks source link

LLMEval - Project to run evaluations on a LLM #49

Open elbruno opened 2 months ago

elbruno commented 2 months ago

The [llm-eval] folder includes:

jmatthiesen commented 2 months ago

Thanks, @elbruno! Am I correct in thinking that for now this focuses on evaluating results with different models, but not necessarily evaluating custom prompts with these models?

One common flow is to have a set of source data that will be used with RAG in an application, and generating an initial set of ground truth data based on that - to help speed up getting started with curating ground truth data. How would I do that with this solution? For example, if I have insurance benefit documentation that I want to use in my app, how could I use this solution to create some initial ground truth data based on that documentation?

Related - how do you think we'd use this to help evaluate custom prompts/chat backends?

elbruno commented 1 month ago

Thanks, @elbruno! Am I correct in thinking that for now this focuses on evaluating results with different models, but not necessarily evaluating custom prompts with these models?

One common flow is to have a set of source data that will be used with RAG in an application, and generating an initial set of ground truth data based on that - to help speed up getting started with curating ground truth data. How would I do that with this solution? For example, if I have insurance benefit documentation that I want to use in my app, how could I use this solution to create some initial ground truth data based on that documentation?

Related - how do you think we'd use this to help evaluate custom prompts/chat backends?

Got the point, and yes, for a detailed scenario / industry a tweak is needed. Similar to what @mahomedalid is doing now. A possible workaround will be add a 2 paragraph section about "How to create a new custom evaluator", that will include a custom prompt and a couple of C# lines of code to be used.

Thoughts on that?