coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
34 stars 17 forks source link

TPCH - Separate out environment from normal benchmarks #1069

Open mrocklin opened 1 year ago

mrocklin commented 1 year ago

Right now we're putting everything in one environment.yml file. Maybe this isn't wise? I'm thinking of creating separate environment files for TPC-H vs normal benchmarks.

More broadly, I think that we put TPC-H in this repository for convenience (all the CI infrastructure is here). Do we also want these queries for normal benchmarks, or should they stay separate?

Putting my marketing hat on, I like the idea of them being separate. I hope to point people at some tpch directory with its own very simple instructions and have them understand it easily. Seeing a bunch of other stuff in there might confuse things.

I think I want feedback from @fjetter and @crusaderky here, who I think have the most context.

crusaderky commented 1 year ago

We're already doing this for snowflake.

  1. make sure test_tpch.py files starts with pytest.importorskip to avoid crashing the test suite when your dependency isn't installed
  2. add a delta environment file to the ci directory. See for example ci/environment-snowflake.yml
  3. add a block to the include section of .github/workflows/tests.yml: https://github.com/coiled/benchmarks/blob/ebffe0fabe50a9ed6aa514705a01ec482a72934e/.github/workflows/tests.yml#L59-L62
crusaderky commented 1 year ago

This said - what are the dependencies of tpch? I'm only seeing dask-expr which is a small dependency-free package, isn't it? What's the benefit of splitting it out?

EDIT - nvm, I've now seen the most recent test pack and its dependencies

crusaderky commented 1 year ago

Using a monolithic environment.yml is chiefly very convenient for A/B tests. Otherwise you'd have to prepare and run a separate suite of A/B tests for each set of base environments.