OpenFn / apollo

GNU Lesser General Public License v2.1
0 stars 2 forks source link

job chat: Add a prompt testing process #108

Open hanna-paasivirta opened 1 week ago

hanna-paasivirta commented 1 week ago

New prompts should be tested to evaluate their performance and minimise unexpected issues in production. This will likely involve accumulating generated test datasets targeting different issues, as well as using LLM-based evaluation to check if each test passed (T/F) to produce a score.

josephjclark commented 1 week ago

For the record I would be happy with a manual test process which goes something like this:

We may also need to factor in drift from the LLM end itself - as eg Anthropic updates its model, I don't know how tightly we can version lock, so we may see a natural variance.