Closed tadamcz closed 3 weeks ago
Yeah, we don't run all the tests in CI because it costs $ to use those APIs and the slow tests are really slow so undesirably hinder responsiveness of the CI pipeline. We do have a mockllm class that we use to exercise most all of the code without requiring the $. I think a couple of things we could do to improve things are:
(1) See if there are tests currently not using mockllm that could. (2) See if there are tests using openai b/c we need a "real" model, and switch those to something free like ollma.
As far as what that leaves us with, we don't take an absolutist position that "if it doesn't run in CI its not very useful". The provider specific tests are extremely useful even though we don't run them on every CI pass as they give us a line of defense against breakage in those packages/services.
Completely agree with your reasons not to run these tests in CI as written :)
See if there are tests currently not using mockllm that could.
I think mocking is one option, but another is testing smaller units of code.
For a small example inspired by https://github.com/UKGovernmentBEIS/inspect_ai/issues/695, in addition to the end-to-end test tests.tools.test_tools.check_tools_calls
which would require API access to be tested with Google models, you could have a test for the function inspect_ai.model._providers.google.chat_tools
, which doesn't require any API access to be called.
Your codebase almost never uses classes that have a significant amount of state. It uses lots of small functions. This approach has pros and cons (which I don't want to discuss here), but one huge pro is that such code is much more easily testable 🙂. It would be great to take more advantage of that! Much of the code in providers and models would be completely testable without API keys.
This isn't something you probably can or want to change in one day, but I think it might be a good direction to go in over time.
Yeah, we could definitely write more unit tests (we tend to rely a lot on scenario tests w/ more of an emphasis on unit tests for code that does complicated transformations and/or could surprise in other ways).
We have now used mockllm in place of openai wherever possible: https://github.com/UKGovernmentBEIS/inspect_ai/pull/734
Now the only tests skipped are the "slow" ones and ones that depend on having an API token for a commercial service.
Thanks for https://github.com/UKGovernmentBEIS/inspect_ai/pull/734! What about the following:
Could (some of) the tests that use these use mockllm as well?
No unfortunately they can't. Those are all tests of the actual service interface/interactions.
The reason we had a bunch of OpenAI tests left over that could be mocked was that mockllm didn't exist at the inception of the project, so our stand-in LLM where we needed generate called as part of the test was OpenAI. I thought most of these had already been migrated but turns out there were some holdovers that we missed.
Note that even though we don't run those with every commit on CI the main developers do run them quite frequently (I probably run the full suite locally at least 2 or 3 times a day,). We just don't want CI to take > 10 minutes (--runslow
) and we don't want to pay $ for tokens with every commit (--runapi
). Note also that --runapi
was a user contribution b/c they found it quite displeasing and surprising that we were using tokens by default on local tests (i.e. they just ran the test suite and we spent their $, not cool!). So we don't run against APIs by default either locally or on CI (for different reasons) but again we still get plenty of mileage under these tests.
I noticed large numbers of tests are not run in CI. For example, on this latest run:
Some examples of tests that are skipped:
pytest.mark.slow
tests/tools/test_tools.py
) are skipped because API keys would be required to run themtests/model
) and provider (tests/provider
) tests are skipped because API keys would be required to run themHowever, I think it's perfectly possible to test this code, with well-written tests. The solution is to write more targeted tests that test smaller units of code (in addition to end to end tests). These tests wouldn't depend on anything external and could run in CI, fast.
In my opinion, tests that aren't systematically run in CI are of very reduced usefulness.