roachtest: predictive test selection

srosenberg commented 8 months ago

Problem

In an ongoing effort to reduce (billing) spend on the test infrastructure, as well as to improve resource utilization, a number of roachtests can be selectively skipped without sacrificing coverage. Today, our nightlies execute the same (static) test suite despite the fact that a large number of roachtests rarely fail. Over time, the test suite has grown to hundreds of correctness and performance tests of varying (cluster) sizes and (test) durations. Owing to this inefficient allocation of resources, our GCE nightlies take ~20 hours to execute. Thus, the overarching goal is to optimize the following,

total spend
total running time

subject to the constraint that the selection strategy does not increase the risk of missed coverage.

Note the constraint is hard to formalize, especially for randomized tests. While running the test n times vs. k times, for n >> k may not necessary yield a higher coverage, the entirety of the explored state space is expected to be higher. Thus, we should consider excluding some randomized tests from predictive test selection.

On the other side of the spectrum are the performance tests. Our tolerance for skipping them is generally higher than correctness tests. First, there are a number of correlated performance tests (i.e., same benchmark, different parameters) adding to redundancy wrt detecting a performance regression. Second, our change-point detection system requires a minimum of ~7 runs. Hence, we are effectively trading (test) run frequency for a delay to detect a potential regression.

Suggested Approach

Given the above nuances, it's imperative to come up with a simple heuristic. The history for each roachtest execution (in CI) is exported into Snowflake. Using historical data, we can derive a probability distribution of pass/fail. Naturally, a sufficiently high probability of pass would yield initial candidates for test selection, subject to the aforementioned constraints.

Related Work

The general problem of predictive test selection is an open research problem. The primary focus seems to be around unit tests, which tend to be much flakier than e2e tests. Since roachtests denote the latter category, they are amenable to simpler selection strategies. Nevertheless, it's instructive to review related work. Both papers from Google [1], and Meta [2] are interesting reads. They both consider "edit distance" wrt code changes as a feature for test selection. While this may work for some unit tests, it seems less effective for e2e (and integration) tests, owing to their "global" footprint; i.e., e2e tests typically touch a large surface area of the entire system.

[1] https://research.google.com/pubs/archive/45861.pdf [2] https://research.facebook.com/publications/predictive-test-selection/

Jira issue: CRDB-36219

blathers-crl[bot] commented 8 months ago

cc @cockroachdb/test-eng

renatolabs commented 8 months ago

This all makes sense to me, a couple of things to consider:

we should consider excluding some randomized tests from predictive test selection

Maybe we should consider excluding all randomized tests from predictive selection? We do see cases of randomized tests (especially SQL randomization) that do uncover issues, but it can take months of continuous nightly runs. Most our test are not randomized, so most tests would still benefit from predictive selection.

The history for each roachtest execution (in CI) is exported into Snowflake

This is definitely the way to go; one caveat to keep in mind is that Snowflake only exposes test status as reported by TeamCity. This means that TC failures that we wouldn't consider "real" failures would still be accounted (e.g., SSH flakes, cluster creation errors and, more recently, VM preemptions). We could choose to not deal with this problem in the first iteration, but I suspect it will add quite a bit of noise to the analysis.

A couple of options to deal with this would be: not report such test failures as TeamCity test failures (but do report them in GitHub). Alternatively, we could extend the Snowflake schema to include either the test failure error message (or some kind of failure kind). This approach would involve changes on the data ingestion side of things, which I'm less familiar with.

cockroachdb / cockroach