coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
27 stars 17 forks source link

Reorganize Benchmarks #1099

Open mrocklin opened 8 months ago

mrocklin commented 8 months ago

Edit: I've rewritten most of the original post. My apologies to people who like history.

I propose the following changes. I propose them with only mild confidence because I don't have all the context here. Hopefully some of them are good ideas.

Theme: this repository is weird because the content of the repository look like tests, and so we treat them as tests. For example we name them tests and we run them on every push to the repository to make sure that they still run. But this may not be quite correct. These benchmarks don't actually test anything in this repository, and so are maybe more like content.

Directory Structure

These seem mostly incontroversial to me, but we'll see

  1. Rename tests/ to benchmarks/
  2. This probably requires us to rename tests/benchmarks or split it apart
  3. Remove coiled_runtime (it's empty anyway)
  4. Keep stability and workflows for now (but I'm curious if people watch these)

So a directory tree might look something like the following:

benchmarks:~$ tree .
.
├── README.md
├── benchmark.db
├── benchmark_schema.py
└── benchmarks
    ├── arrays
    ├── conftest.py
    ├── dataframes
    ├── general
    ├── h2o
    ├── stability
    ├── tpch
    └── workflows

When to run benchmarks

These seem more controversial

Should we run the benchmarks on every push to this repo? Every night? Every push to Dask and upstream repos? Every release?

Proposal: Nightly, not every push

Our default operating mode as programmers is to run things on every push to the repo. This is because we're used to writing code and having tests that verify the behavior of that code in the same repo. Change the code, run the tests.

But these aren't testing code in this repo. These test things in other repos. For example if I change a benchmark in stability it has no impact on the benchmarks in test_join.py. There's no reason to run them again. These tests don't verify the behavior of the thing we're working on, they are the thing we're working on, and they have no relationship to each other.

Now, obviously if we're screwing around with benchmarking infrastructure (fixtures, ci, etc..) then we'll certainly want to run the benchmarks to verify that we didn't screw things up, but this seems like the exception, rather than the rule. If it's easy, I'll propose a way to run benchmarks, like saying "Hey @coiled/ci-bot, run benchmarks".

We also want to make it easy to run benchmarks when we are in other repos. This the the main value of the A/B test infrastructure. That is the real use and power of this repository, not testing within itself. With this in mind, we may also want to verify the correctness of the benchmark that we just touched. If there is a way to target specific benchmarks that would be useful (but I don't know if that's hard).

Proposal: And Weekly?

Are there larger things we care about running weekly? Maybe the entire TPCH benchmark suite rather than just Dask?

If so, I'll propose a @pytest.mark.weekly marker that gets skipped unless it's Sunday or something.

mrocklin commented 8 months ago

I'd love to get things down to root level, but that requires being careful in a way that I think is unlikely in the next couple of days. With that in mind I'll suggest the following changes:

These seem inoffensive to me and like a good small step in this direction. I may be understimating the costs of small steps though. In particular I'm not sure how expensive it is to construct alembic migrations.

fjetter commented 8 months ago

I'm not sure what to do with tests/workflows. Do we use these today?

I consider those rather integration tests than anything else. For instance, we run a snowflake thing on coiled. If this lights up for whatever reason, we should get notified. From that POV, other items in that folder are less interesting and I'd be fine with nuking them and move the snowflake test to stability instead (or integration)

fjetter commented 8 months ago

In particular I'm not sure how expensive it is to construct alembic migrations.

It's cheap and not difficult, just a little annoying. We also don't have any tests for the migration but if we mess this up, we can "restore" the database from an earlier GH actions run (assuming we notice in time; FWIW we never had to do this)

alembic revision -m"Rename foo to bar"

generates a new file in alembic/versions with the appropriate revisions already set. We then just have to write some simple sql that describes the changes.

For example, we moved the H2O tests once from one directory to another. The SQL for that was

        update test_run
        set path = 'benchmarks/test_h2o.py'
        where path = 'benchmarks/h2o/test_h2o_benchmarks.py';

https://github.com/coiled/benchmarks/blob/main/alembic/versions/9d6f8ea24ee1_move_h2o_tests.py

hendrikmakait commented 8 months ago

From that POV, other items in that folder are less interesting and I'd be fine with nuking them and move the snowflake test to stability instead (or integration)

+1 on bundling stability and workflows in integration.

hendrikmakait commented 8 months ago

I'm in favor of running the test suite nightly instead of on every push. I'm just wondering how hard it would be to run only the tests in the files that were touched in a commit on every push. My guess is that it's non-trivial, so we shouldn't invest time into this, but maybe @crusaderky has a trick up his sleeve?

hendrikmakait commented 8 months ago

We also want to make it easy to run benchmarks when we are in other repos. This the the main value of the A/B test infrastructure. That is the real use and power of this repository, not testing within itself. With this in mind, we may also want to verify the correctness of the benchmark that we just touched. If there is a way to target specific benchmarks that would be useful (but I don't know if that's hard).

I'm not sure if I'm understanding you correctly, but you can specify the individual tests you'd like to run here: https://github.com/coiled/benchmarks/blob/ded7daccadbbc2f993d7292da758cfacbc926602/AB_environments/config.yaml#L13-L14C3

That being said, the hierarchical structure of this repository makes it quite cumbersome to run related-but-not-grouped tests, e.g., all tests using P2P-shuffling.

hendrikmakait commented 8 months ago

That being said, the hierarchical structure of this repository makes it quite cumbersome to run related-but-not-grouped tests, e.g., all tests using P2P-shuffling.

To deal with this, we should have an easy way to target specific tests via flags/marks.

fjetter commented 8 months ago

So a directory tree might look something like the following:

+1

Now, obviously if we're screwing around with benchmarking infrastructure (fixtures, ci, etc..) then we'll certainly want to run the benchmarks to verify that we didn't screw things up,

Instead, we could also do this automatically.

https://github.com/anapaulagomes/pytest-picked might be worth checking out. IIUC with this plugin, pytest --picked --mode=branch will only run test files that were modified which is almost exactly what we need.

To run everything when we change something in the ci directory we would need something like this

if [ "$(git diff --name-status | grep -c ci)" -eq 0 ]
then
    PYTEST_ARGS="--picked"
else
    PYTEST_ARGS=""
fi
pytest $PYTEST_ARGS ....

Of course, if the bot thing is easier, this manual step is also fine I think

mrocklin commented 8 months ago

@crusaderky any thoughts/concerns on what's here?

crusaderky commented 8 months ago

I'm in favor of running the test suite nightly instead of on every push. I'm just wondering how hard it would be to run only the tests in the files that were touched in a commit on every push. My guess is that it's non-trivial, so we shouldn't invest time into this, but maybe @crusaderky has a trick up his sleeve?

This sounds like some git voodoo is required. Feasible? Definitely. Worth the effort? If pytest-picked does the trick, maybe. We need to find a way to trigger a full test suite if conftest.py, utils_test.py, or the env files change. Which is to say, somewhat frequently. I suspect the effort may not be worth the benefit.

I'm not sure what to do with tests/workflows. Do we use these today?

For pass/fail, yes - under the assumption that someone actually monitors the bot tickets. This has not been the case recently due to high amounts of noise. The purpose of those tests is to get an alert if there's something that breaks stuff for xgboost etc. They should stay. As of today, they run exclusively nightly (not in A/B, not in PRs unless you explicitly add a tag for it) and I think it's what we want.

Same for stability. They currently run on every PR but I'd be fine if they ran only overnight.

Are there larger things we care about running weekly? Maybe the entire TPCH benchmark suite rather than just Dask? If so, I'll propose a @pytest.mark.weekly marker that gets skipped unless it's Sunday or something.

This is reasonable.

Rename tests/ to benchmarks/

This feels inconsequential to me. Are the marketing reasons underlying this? I thought this repo, albeit public, was used exclusively internally.

Remove coiled_runtime (it's empty anyway)

Indeed. There's also a lot of additional cruft leftover from the runtime era that should be cleaned up.

I'm not sure if I'm understanding you correctly, but you can specify the individual tests you'd like to run here: https://github.com/coiled/benchmarks/blob/ded7daccadbbc2f993d7292da758cfacbc926602/AB_environments/config.yaml#L13-L14C3

You can also use the same file to specify mark filters. This is however undocumented and untested. Something easy to fix.

What I personally would like to see:

These marks should all be enabled by default but can be cherry-picked through AB_environments/config.yaml.

mrocklin commented 8 months ago

Rename tests/ to benchmarks/

This feels inconsequential to me. Are the marketing reasons underlying this? I thought this repo, albeit public, was used exclusively internally.

I'm advertising it for TPC-H stuff. I can also move TPC-H elsewhere to its own repo, but in principle I like the idea of having other efforts benchmarked here in the open. For example maybe Pangeo people like the effort and engage with a benchmarks/pangeo directory.

I'm open to either path (separate repo or shared). There's mild work either direction.

mrocklin commented 8 months ago

This sounds like some git voodoo is required. Feasible? Definitely. Worth the effort? If pytest-picked does the trick, maybe. We need to find a way to trigger a full test suite if conftest.py, utils_test.py, or the env files change. Which is to say, somewhat frequently. I suspect the effort may not be worth the benefit.

This repo costs us $1000+ each month. If it's a week of work to do this then it's worth the benefit I think.

mrocklin commented 8 months ago

I said $1000k+ each month, but I meant $1000+. My apologies for the typo.

mrocklin commented 8 months ago

To give more motivation here, I find myself not pushing to this repository because I know that every time I do so it costs me some money (I'm actually curious how much, if anyone has an answer). I find myself pushing once every day or two rather than on every commit.

ntabris commented 8 months ago

I'm actually curious how much, if anyone has an answer

I think regular branch runs are ~$10

image

A/B runs are potentially much more expensive:

image

(that's for all of October)

crusaderky commented 7 months ago

A/B runs are potentially much more expensive

if a regular PR is ~$9, then a single A/B test is $9 repeat number of envs In a typical config with repeat=5, test_null_hypothesis=True, and a single experimental environment, that will be $9 5 (1 (AB_Baseline)+1 (AB_null_hypothesis)+1 (AB_yourbranch)) = $9 5 3 = ~$135

mrocklin commented 7 months ago

FWIW I'm totally fine paying for stuff if it helps us make good decisions more quickly (you all are very expensive).

Mostly I get annoyed when we spend money for things that don't provide any value. My sense is that running all tests on every commit isn't helpful.