Closed dearden closed 2 months ago
We've done some research into some GenAI evaluation tools (#38), but the scripts written so far have been toy examples.
We should write a script that compares prompts on representative data, which we can reuse for evaluating new prompts and new models.
Overview
We've done some research into some GenAI evaluation tools (#38), but the scripts written so far have been toy examples.
We should write a script that compares prompts on representative data, which we can reuse for evaluating new prompts and new models.
Requirements