aicoe-aiops / ocp-ci-analysis

Developing AI tools for developers by leveraging the data made openly available by OpenShift and Kubernetes CI platforms.
https://old.operate-first.cloud/data-science/ai4ci/
GNU General Public License v3.0
33 stars 72 forks source link

Infra Flake Detection Built on Wrong Assumptions #447

Open antter opened 2 years ago

antter commented 2 years ago

Is your feature request related to a problem? Please describe. Currently, we are trying to detect "infra flakes" by looking for a waterfall pattern. This might not be as well-motivated as we originally thought. Currently the tests are ordered by most recent failure, meaning that a waterfall pattern could only occur at the beginning, and an infra flake in the middle would look like random tests failing.

Describe the solution you'd like An updated model that uses some sort of statistical techniques to find infra flakes that doesn't rely on the order that the tests come in.

Additional context detailed explanation of infra flakes are here: https://github.com/aicoe-aiops/ocp-ci-analysis/issues/1

antter commented 2 years ago

Explanation of a possible solution I'm interested in exploring:

Basically, an infra flake happens when we have a handful of tests fail at a close time unexpectedly. The issue here lies in the word "unexpected". If a test is failing one every 5 times, no failure could be considered "unexpected". What is "unexpected" I feel we can only deduce from looking at a single test's history. It becomes a time series, and I am thinking of making an autoregressive model or moving average model, to capture the fact that more recent previous failures will make a failure more likely for a single test. This way we can sort of quantify unexpectedness.

If we have some sort of baseline for when a test failing is "unexpected", then all that would be left would be to do some analysis to see how well this baseline works, and find a way to identify several unexpected failures happening at once.

All of the above has a decent chance of totally failing though, this is a tough dataset.

One issue that keeps coming up while pondering how to classify infra flakes is that it is hard to decide if a test fails as a direct result of another test failing or both tests fail as a result of an infra flake. The distinction is tough, and maybe not possible with this type of dataset. I'm going to ignore this problem for now.

antter commented 2 years ago

Also, it may not be all that necessary to make any type of time series model. I think we could get decent results by just simply taking a # failures / # attempts as a metric first. But I do think I'll end up trying both, building off the simple model first. The time series model definitely has potential to be a lot stronger so I'll have to do some sort of comparison at the end.

antter commented 2 years ago

And FWIW, I don't believe any left-right is necessary for an infra flake. It seems to happen occasionally because infrastructure is flaky in a dynamic way, and a test can pass but fail an hour later because of it. However, it is also the case that infrastructure would have an issue just for one hour, failing many tests, then everything is fine the next time tests come around.