Open slobentanzer opened 2 months ago
For cases of bad performance in particular, it would be good to have an automated way of getting a rough idea of failure modes: were the instructions not understood, system prompts not followed, or was the answer attempted but wrong?
Could be assessed by secondary LLM.
For cases of bad performance in particular, it would be good to have an automated way of getting a rough idea of failure modes: were the instructions not understood, system prompts not followed, or was the answer attempted but wrong?
Could be assessed by secondary LLM.