Right now some of the tests are flaky. A good example is data extraction. Instead of a single test, we should have a best-of-N testing setup for any AI responses with several evaluation points.
We can store individual responses (for example, the details extract returns name and gender but not age) as well as overall success rates.
We can't expect 100% success rate but having some benchmark or gradient for testing would make prompt engineering viable and far more efficient.
Right now some of the tests are flaky. A good example is data extraction. Instead of a single test, we should have a best-of-N testing setup for any AI responses with several evaluation points.
We can store individual responses (for example, the details extract returns name and gender but not age) as well as overall success rates.
We can't expect 100% success rate but having some benchmark or gradient for testing would make prompt engineering viable and far more efficient.