Open mrocklin opened 8 months ago
I'd love to get things down to root level, but that requires being careful in a way that I think is unlikely in the next couple of days. With that in mind I'll suggest the following changes:
tests
to benchmarks
tests/benchmarks
to benchmarks/general
tests/runtime
(unless people actively look at thistests/workflows
. Do we use these today?These seem inoffensive to me and like a good small step in this direction. I may be understimating the costs of small steps though. In particular I'm not sure how expensive it is to construct alembic migrations.
I'm not sure what to do with tests/workflows. Do we use these today?
I consider those rather integration tests than anything else. For instance, we run a snowflake thing on coiled. If this lights up for whatever reason, we should get notified.
From that POV, other items in that folder are less interesting and I'd be fine with nuking them and move the snowflake test to stability
instead (or integration
)
In particular I'm not sure how expensive it is to construct alembic migrations.
It's cheap and not difficult, just a little annoying. We also don't have any tests for the migration but if we mess this up, we can "restore" the database from an earlier GH actions run (assuming we notice in time; FWIW we never had to do this)
alembic revision -m"Rename foo to bar"
generates a new file in alembic/versions
with the appropriate revisions already set. We then just have to write some simple sql that describes the changes.
For example, we moved the H2O tests once from one directory to another. The SQL for that was
update test_run
set path = 'benchmarks/test_h2o.py'
where path = 'benchmarks/h2o/test_h2o_benchmarks.py';
https://github.com/coiled/benchmarks/blob/main/alembic/versions/9d6f8ea24ee1_move_h2o_tests.py
From that POV, other items in that folder are less interesting and I'd be fine with nuking them and move the snowflake test to
stability
instead (orintegration
)
+1 on bundling stability and workflows in integration
.
I'm in favor of running the test suite nightly instead of on every push. I'm just wondering how hard it would be to run only the tests in the files that were touched in a commit on every push. My guess is that it's non-trivial, so we shouldn't invest time into this, but maybe @crusaderky has a trick up his sleeve?
We also want to make it easy to run benchmarks when we are in other repos. This the the main value of the A/B test infrastructure. That is the real use and power of this repository, not testing within itself. With this in mind, we may also want to verify the correctness of the benchmark that we just touched. If there is a way to target specific benchmarks that would be useful (but I don't know if that's hard).
I'm not sure if I'm understanding you correctly, but you can specify the individual tests you'd like to run here: https://github.com/coiled/benchmarks/blob/ded7daccadbbc2f993d7292da758cfacbc926602/AB_environments/config.yaml#L13-L14C3
That being said, the hierarchical structure of this repository makes it quite cumbersome to run related-but-not-grouped tests, e.g., all tests using P2P-shuffling.
That being said, the hierarchical structure of this repository makes it quite cumbersome to run related-but-not-grouped tests, e.g., all tests using P2P-shuffling.
To deal with this, we should have an easy way to target specific tests via flags/marks.
So a directory tree might look something like the following:
+1
Now, obviously if we're screwing around with benchmarking infrastructure (fixtures, ci, etc..) then we'll certainly want to run the benchmarks to verify that we didn't screw things up,
Instead, we could also do this automatically.
https://github.com/anapaulagomes/pytest-picked might be worth checking out. IIUC with this plugin, pytest --picked --mode=branch
will only run test files that were modified which is almost exactly what we need.
To run everything when we change something in the ci
directory we would need something like this
if [ "$(git diff --name-status | grep -c ci)" -eq 0 ]
then
PYTEST_ARGS="--picked"
else
PYTEST_ARGS=""
fi
pytest $PYTEST_ARGS ....
Of course, if the bot thing is easier, this manual step is also fine I think
@crusaderky any thoughts/concerns on what's here?
I'm in favor of running the test suite nightly instead of on every push. I'm just wondering how hard it would be to run only the tests in the files that were touched in a commit on every push. My guess is that it's non-trivial, so we shouldn't invest time into this, but maybe @crusaderky has a trick up his sleeve?
This sounds like some git voodoo is required. Feasible? Definitely. Worth the effort? If pytest-picked does the trick, maybe. We need to find a way to trigger a full test suite if conftest.py, utils_test.py, or the env files change. Which is to say, somewhat frequently. I suspect the effort may not be worth the benefit.
I'm not sure what to do with tests/workflows. Do we use these today?
For pass/fail, yes - under the assumption that someone actually monitors the bot tickets. This has not been the case recently due to high amounts of noise. The purpose of those tests is to get an alert if there's something that breaks stuff for xgboost etc. They should stay. As of today, they run exclusively nightly (not in A/B, not in PRs unless you explicitly add a tag for it) and I think it's what we want.
Same for stability. They currently run on every PR but I'd be fine if they ran only overnight.
Are there larger things we care about running weekly? Maybe the entire TPCH benchmark suite rather than just Dask? If so, I'll propose a @pytest.mark.weekly marker that gets skipped unless it's Sunday or something.
This is reasonable.
Rename tests/ to benchmarks/
This feels inconsequential to me. Are the marketing reasons underlying this? I thought this repo, albeit public, was used exclusively internally.
Remove coiled_runtime (it's empty anyway)
Indeed. There's also a lot of additional cruft leftover from the runtime era that should be cleaned up.
I'm not sure if I'm understanding you correctly, but you can specify the individual tests you'd like to run here: https://github.com/coiled/benchmarks/blob/ded7daccadbbc2f993d7292da758cfacbc926602/AB_environments/config.yaml#L13-L14C3
You can also use the same file to specify mark filters. This is however undocumented and untested. Something easy to fix.
What I personally would like to see:
@pytest.mark.nightly
mark, which is skipped in A/B and PRs unless explicitly stated otherwise (through AB_environments/config.yaml or a PR tag respectively)@pytest.mark.weekly
mark, same as above. This would be used to label non-dask tpch tests.@pytest.mark.shuffle_p2p
@pytest.mark.shuffle_tasks
@pytest.mark.dataframe
@pytest.mark.array
@pytest.mark.tpch
(dask + non-dask)These marks should all be enabled by default but can be cherry-picked through AB_environments/config.yaml
.
Rename tests/ to benchmarks/
This feels inconsequential to me. Are the marketing reasons underlying this? I thought this repo, albeit public, was used exclusively internally.
I'm advertising it for TPC-H stuff. I can also move TPC-H elsewhere to its own repo, but in principle I like the idea of having other efforts benchmarked here in the open. For example maybe Pangeo people like the effort and engage with a benchmarks/pangeo
directory.
I'm open to either path (separate repo or shared). There's mild work either direction.
This sounds like some git voodoo is required. Feasible? Definitely. Worth the effort? If pytest-picked does the trick, maybe. We need to find a way to trigger a full test suite if conftest.py, utils_test.py, or the env files change. Which is to say, somewhat frequently. I suspect the effort may not be worth the benefit.
This repo costs us $1000+ each month. If it's a week of work to do this then it's worth the benefit I think.
I said $1000k+ each month, but I meant $1000+. My apologies for the typo.
To give more motivation here, I find myself not pushing to this repository because I know that every time I do so it costs me some money (I'm actually curious how much, if anyone has an answer). I find myself pushing once every day or two rather than on every commit.
I'm actually curious how much, if anyone has an answer
I think regular branch runs are ~$10
A/B runs are potentially much more expensive:
(that's for all of October)
A/B runs are potentially much more expensive
if a regular PR is ~$9, then a single A/B test is $9 repeat number of envs In a typical config with repeat=5, test_null_hypothesis=True, and a single experimental environment, that will be $9 5 (1 (AB_Baseline)+1 (AB_null_hypothesis)+1 (AB_yourbranch)) = $9 5 3 = ~$135
FWIW I'm totally fine paying for stuff if it helps us make good decisions more quickly (you all are very expensive).
Mostly I get annoyed when we spend money for things that don't provide any value. My sense is that running all tests on every commit isn't helpful.
Edit: I've rewritten most of the original post. My apologies to people who like history.
I propose the following changes. I propose them with only mild confidence because I don't have all the context here. Hopefully some of them are good ideas.
Theme: this repository is weird because the content of the repository look like tests, and so we treat them as tests. For example we name them
tests
and we run them on every push to the repository to make sure that they still run. But this may not be quite correct. These benchmarks don't actually test anything in this repository, and so are maybe more like content.Directory Structure
These seem mostly incontroversial to me, but we'll see
tests/
tobenchmarks/
tests/benchmarks
or split it apartcoiled_runtime
(it's empty anyway)So a directory tree might look something like the following:
When to run benchmarks
These seem more controversial
Should we run the benchmarks on every push to this repo? Every night? Every push to Dask and upstream repos? Every release?
Proposal: Nightly, not every push
Our default operating mode as programmers is to run things on every push to the repo. This is because we're used to writing code and having tests that verify the behavior of that code in the same repo. Change the code, run the tests.
But these aren't testing code in this repo. These test things in other repos. For example if I change a benchmark in
stability
it has no impact on the benchmarks intest_join.py
. There's no reason to run them again. These tests don't verify the behavior of the thing we're working on, they are the thing we're working on, and they have no relationship to each other.Now, obviously if we're screwing around with benchmarking infrastructure (fixtures, ci, etc..) then we'll certainly want to run the benchmarks to verify that we didn't screw things up, but this seems like the exception, rather than the rule. If it's easy, I'll propose a way to run benchmarks, like saying "Hey @coiled/ci-bot, run benchmarks".
We also want to make it easy to run benchmarks when we are in other repos. This the the main value of the A/B test infrastructure. That is the real use and power of this repository, not testing within itself. With this in mind, we may also want to verify the correctness of the benchmark that we just touched. If there is a way to target specific benchmarks that would be useful (but I don't know if that's hard).
Proposal: And Weekly?
Are there larger things we care about running weekly? Maybe the entire TPCH benchmark suite rather than just Dask?
If so, I'll propose a
@pytest.mark.weekly
marker that gets skipped unless it's Sunday or something.