JoinTheAlliance / bgent

Flexible, scalable and customizable agents to do your bidding.
https://bgent.org
69 stars 15 forks source link

Threshold testing #3

Closed lalalune closed 4 months ago

lalalune commented 4 months ago

Right now some of the tests are flaky. A good example is data extraction. Instead of a single test, we should have a best-of-N testing setup for any AI responses with several evaluation points.

We can store individual responses (for example, the details extract returns name and gender but not age) as well as overall success rates.

We can't expect 100% success rate but having some benchmark or gradient for testing would make prompt engineering viable and far more efficient.