coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

Don't use clusters for quick tests #420

Open gjoseph92 opened 2 years ago

gjoseph92 commented 2 years ago

Some tests have been added recently (and some have existed for a while) that run extremely quickly (<10s). Though they may be valuable tests, I don't think they should run against a coiled cluster. There's too much variance to get a useful signal from them.

These tests should either:

  1. just use a local cluster
  2. repeat many times

https://github.com/coiled/coiled-runtime/blob/f81cf53151c7fb8e1bbcb12e781053f7ca94e9ef/tests/benchmarks/test_futures.py#L6-L8

https://github.com/coiled/coiled-runtime/blob/f81cf53151c7fb8e1bbcb12e781053f7ca94e9ef/tests/benchmarks/test_array.py#L161-L177

https://github.com/coiled/coiled-runtime/blob/f81cf53151c7fb8e1bbcb12e781053f7ca94e9ef/tests/benchmarks/test_zarr.py#L14-L17

ntabris commented 2 years ago

Though they may be valuable tests, I don't think they should run against a coiled cluster. There's too much variance to get a useful signal from them.

Variance because of Coiled? My understanding is that test time doesn't include time to spin up cluster, so this would surprise me a little.

Could we move the super fast tests into the same file so (I think) they'd all run on the same cluster?

jrbourbeau commented 2 years ago

Yeah, I'd also like to have a better understanding of what variance @gjoseph92 is talking about

gjoseph92 commented 2 years ago

On a recent AB test with 4 repeats, the standard deviation within a test case was larger than the differences between test cases:

Screen Shot 2022-10-04 at 2 19 11 PM Screen Shot 2022-10-04 at 2 19 37 PM Screen Shot 2022-10-04 at 2 20 04 PM

Compare that to a long-running test like test_shuffle, where the effect size is much larger than the noise within a test case: image

gjoseph92 commented 2 years ago

Variance because of Coiled? My understanding is that test time doesn't include time to spin up cluster, so this would surprise me a little.

Probably not "because of Coiled", as in not Coiled's fault. There's just way more complexity and natural variance involved in connecting to another machine over the internet which then connects to more machines. (There's still plenty of variance in a local cluster too; maybe even too much. I just bet there's less.)

I'm not too concerned about the variance and don't think we should put much effort into investigating it right now.

My point is just that a single iteration of a really fast test is not very accurate. For the same reason that %timeit runs your microbenchmarks 1000s of times and averages them, we should also run these sorts of micro-benchmarks many times. The size of the effect we are trying to measure is going to be small compared to the power of our measurement.

Hence why I say we shouldn't use clusters for fasts tests. Instead, we should either:

  1. not use a cluster (to remove a source of variance)
  2. make the tests not fast (loop them enough times that the effect size would become visible)
fjetter commented 2 years ago

I think the test_single_future is maybe the best example. This would be great in an asv suite like https://github.com/dask/dask-benchmarks However, I also see a point of testing this on a coiled cluster to make sure we're not introducing artificial latencies. A simple fix would be to annotate the test and run the test logic 100 times.

I think another interesting example is https://github.com/coiled/coiled-runtime/issues/339 which is measuring performance to read a single parquet file. There is not a real reason why we'd need to measure this on a coiled cluster.

My concern in not dispatching this to a coiled cluster is mostly that I don't trust GH actions runners to have reliable performance. Ideally, we'd have a simple VM to run these tests in isolation