Benchmark: failure mode assessment

biocypher / biochatter

Backend library for conversational AI in biomedicine

http://biochatter.org/

MIT License

51 stars 19 forks source link

Benchmark: failure mode assessment #149

Open slobentanzer opened 2 months ago

slobentanzer commented 2 months ago

For cases of bad performance in particular, it would be good to have an automated way of getting a rough idea of failure modes: were the instructions not understood, system prompts not followed, or was the answer attempted but wrong?

Could be assessed by secondary LLM.