Open gjoseph92 opened 2 years ago
Though they may be valuable tests, I don't think they should run against a coiled cluster. There's too much variance to get a useful signal from them.
Variance because of Coiled? My understanding is that test time doesn't include time to spin up cluster, so this would surprise me a little.
Could we move the super fast tests into the same file so (I think) they'd all run on the same cluster?
Yeah, I'd also like to have a better understanding of what variance @gjoseph92 is talking about
On a recent AB test with 4 repeats, the standard deviation within a test case was larger than the differences between test cases:
Compare that to a long-running test like test_shuffle
, where the effect size is much larger than the noise within a test case:
Variance because of Coiled? My understanding is that test time doesn't include time to spin up cluster, so this would surprise me a little.
Probably not "because of Coiled", as in not Coiled's fault. There's just way more complexity and natural variance involved in connecting to another machine over the internet which then connects to more machines. (There's still plenty of variance in a local cluster too; maybe even too much. I just bet there's less.)
I'm not too concerned about the variance and don't think we should put much effort into investigating it right now.
My point is just that a single iteration of a really fast test is not very accurate. For the same reason that %timeit
runs your microbenchmarks 1000s of times and averages them, we should also run these sorts of micro-benchmarks many times. The size of the effect we are trying to measure is going to be small compared to the power of our measurement.
Hence why I say we shouldn't use clusters for fasts tests. Instead, we should either:
I think the test_single_future
is maybe the best example. This would be great in an asv suite like https://github.com/dask/dask-benchmarks
However, I also see a point of testing this on a coiled cluster to make sure we're not introducing artificial latencies. A simple fix would be to annotate the test and run the test logic 100 times.
I think another interesting example is https://github.com/coiled/coiled-runtime/issues/339 which is measuring performance to read a single parquet file. There is not a real reason why we'd need to measure this on a coiled cluster.
My concern in not dispatching this to a coiled cluster is mostly that I don't trust GH actions runners to have reliable performance. Ideally, we'd have a simple VM to run these tests in isolation
Some tests have been added recently (and some have existed for a while) that run extremely quickly (<10s). Though they may be valuable tests, I don't think they should run against a coiled cluster. There's too much variance to get a useful signal from them.
These tests should either:
https://github.com/coiled/coiled-runtime/blob/f81cf53151c7fb8e1bbcb12e781053f7ca94e9ef/tests/benchmarks/test_futures.py#L6-L8
https://github.com/coiled/coiled-runtime/blob/f81cf53151c7fb8e1bbcb12e781053f7ca94e9ef/tests/benchmarks/test_array.py#L161-L177
https://github.com/coiled/coiled-runtime/blob/f81cf53151c7fb8e1bbcb12e781053f7ca94e9ef/tests/benchmarks/test_zarr.py#L14-L17