Open elbruno opened 2 months ago
Thanks, @elbruno! Am I correct in thinking that for now this focuses on evaluating results with different models, but not necessarily evaluating custom prompts with these models?
One common flow is to have a set of source data that will be used with RAG in an application, and generating an initial set of ground truth data based on that - to help speed up getting started with curating ground truth data. How would I do that with this solution? For example, if I have insurance benefit documentation that I want to use in my app, how could I use this solution to create some initial ground truth data based on that documentation?
Related - how do you think we'd use this to help evaluate custom prompts/chat backends?
Thanks, @elbruno! Am I correct in thinking that for now this focuses on evaluating results with different models, but not necessarily evaluating custom prompts with these models?
One common flow is to have a set of source data that will be used with RAG in an application, and generating an initial set of ground truth data based on that - to help speed up getting started with curating ground truth data. How would I do that with this solution? For example, if I have insurance benefit documentation that I want to use in my app, how could I use this solution to create some initial ground truth data based on that documentation?
Related - how do you think we'd use this to help evaluate custom prompts/chat backends?
Got the point, and yes, for a detailed scenario / industry a tweak is needed. Similar to what @mahomedalid is doing now. A possible workaround will be add a 2 paragraph section about "How to create a new custom evaluator", that will include a custom prompt and a couple of C# lines of code to be used.
Thoughts on that?
The [llm-eval] folder includes: