Requires setting up good testing benchmarks for each of the agents that run as pytests.
Then we can set up a github workflow to automatically run these benchmarks on PRs.
Will ensure that any changes we make, whether that's on the prompts, the LLM model, or anything, don't affect the overall experience / the LLM can still delegate to the appropriate agents.