Perf Dashboard: Alerting when behavior is the same on the ref build - Githubissues

catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..

https://chromium.googlesource.com/catapult

BSD 3-Clause "New" or "Revised" License

1.93k stars 562 forks source link

Perf Dashboard: Alerting when behavior is the same on the ref build #1574

Closed fmeawad closed 5 years ago

fmeawad commented 9 years ago

Chrome version: 45.0.2454.101 (Linux) URL: https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDAmei5ogsM

Please copy and paste any errors from JavaScript console (Ctrl+Shift+J to open): None

Please describe the problem: The ref build is also having the same regressions, therefore it should not be alerted on. The alert is captured in https://code.google.com/p/chromium/issues/detail?id=538745

qyearsley-zz commented 9 years ago

Note: occasionally an alert gets through when the same step is present on the ref build series but was smaller; the current rule for alerting is to not alert when an alert would have also been fired on the ref build series at the same.

But I haven't verified that this is the case here, since the /debug_alert page appears to be broken; I'll file a separate bug for that: #1583.

qyearsley-zz commented 9 years ago

Another possible case to check once debug_alert is back up: https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDAufS7rAoM

qyearsley-zz commented 8 years ago

Sheriffing today I found 50+ alerts that shouldn't have been fired because reference build changed at the same time (see crbug.com/626031), so it looks like this isn't just one or two isolated cases.

The code that's supposed to filter out the alerts where ref build moved at the same time is at https://github.com/catapult-project/catapult/blob/9fde332/dashboard/dashboard/find_anomalies.py#L61; maybe some bug can be found by changing the unit test?

anniesullie commented 8 years ago

@eakuefner Can you take the data from one of the alerts that fired when ref build jumped at the same time as ToT and send it into the unit test?

eakuefner commented 7 years ago

Unassigning myself as I'm not currently actively working on this.

Might be of interest to @jessimb

zeptonaut commented 7 years ago

I'm going to take a look at this.

zeptonaut commented 7 years ago

A few recent examples:

zeptonaut commented 7 years ago

@anniesullie and I spent a few minutes investigating this just now, and it looks like the unit test finds zero change points in the non-ref data even when we put in data directly from the recent examples that we just found. Our best guess as to the reason behind this is that the unit tests are using a different config with less sensitive change detection than the live site is using. Going to continue investigating.

zeptonaut commented 7 years ago

I've verified that we're feeding the data into the changepoint detection code correctly. @anniesullie suggested that we may be using a different change detection config in the test code than we are in the live dashboard, which would make sense. I dumped the config in the test code and just got {} (an empty dictionary). I'm going to see what it is in the live code.

zeptonaut commented 7 years ago

This seems to be the relative "debug alert" page for the data series that I'm currently at.

It's worth noting that if you change the "number of points before" number from 10 to 40, the alert disappears. My theory for the reason for this is the tool just finds the first change point in the data series that it's pointed at. @anniesullie, can you confirm that this is the case?

anniesullie commented 7 years ago

Actually the tool does find multiple changepoints; here is a random example.

The algorithm to find change points is sensitive to the set of points that are sent into it. Most commonly this shows up when it is looking for a sustained regression before alerting. But additional data points could also cause it to determine that a changepoint was within the noise, or not meeting the threshold for "steppiness".

zeptonaut commented 7 years ago

Good to know: thanks!

~~On a positive note, I did manage to cause the algorithm to find some change point by hard coding a config of~~

config = {
  max_window_size: 5
}

EDIT: Looks like I was wrong. I was looking at the wrong thing. A max window size of 5 still doesn't cause any change points to be found.

zeptonaut commented 7 years ago

One thing that I find confusing: I stripped down the test data to only include 21 data points for each dataset (ref/nonref) for this regression, with 10 points before and 10 points after the regression that happened at 440685. The algorithm still didn't find any change points using the default config.

This pokes a hole in our theory that the reason it isn't finding a change point in that dataset is because additional data points after the regression took place showed that the dataset had additional noise, and thus increased the threshold for what qualifies as a regression.

Going to keep playing around with this.

EDIT: I realized after-the-fact that it's not clear why this is surprising. Looking at this anomaly detection graph, it appears that 10 points before and 10 points after that regression should result in an anomaly being found.

zeptonaut commented 7 years ago

Looking more into this, find_change_points._FindSplit() is correctly identify 440685 as the most "interesting" split in the data. However, it's failing to pass the "steppiness" test. This change has a "steppiness" of 0.4922, whereas the default steppiness threshold is 0.5. Wow! That's close to the cutoff.

Diving in now to understand steppiness better.

zeptonaut commented 7 years ago

I've also now verified that using a config of:

config = {
  min_steppiness: 0.4
}

results in the change point being found at 440685 like expected.

zeptonaut commented 7 years ago

A couple of interesting things to note here based on an offline conversation that @anniesullie and I just had:

find_change_points.FindChangePoints() only finds the first change point in the data series, not all change points in the series. find_anomalies.ProcessTest() uses find_change_points.FindChangePoints() on both the non-ref series and the ref series and nixes the non-ref change point if that same change point was also found in the ref series. One problem, though: find_change_points.FindChangePoints() has a sort of strange interface where it only finds the first change point in the series that it's passed. If that series contains one or more change points, then it returns a list with a single item which is the first change point in that series. Otherwise, it returns an empty list.

But what if the ref build contains a spurious step at the start of its series? Then FindChangePoints() is going to return that step for as long as it's in the window and, even if later on, there's a second step in which the non-ref and ref build move together, FindChangePoints on the ref build will find the early spurious step and FindChangePoints on the non-ref build will find only the later step, resulting in an alert even though the ref and non-ref builds moved together.

In practice, I don't think this is likely to happen with the default min segment size (6): I can't imagine that there are too many "spurious" 6 segment steps on the ref build that don't also occur on the non-ref build.

What I think is more likely to be problematic is that if there's a regression that's just over the change point threshold on the non-ref build but just under the threshold on the ref build. Our algorithm will say that this is an anomaly, even though both series changed, but the ref build didn't change quite enough to be considered a change point.

I think that, at least in the series that I've been testing, this is what happened. If we look at the non-ref build series with a max window size of 7, there's a change that's detected. If we look at the same graph on the ref build, no such change point is detected, even though you can see pretty clearly that it's close to being detected.

What would happen if we lower the change detection threshold on the ref build, though? This should be okay, because we're not actually alerting on the ref build: we're just using it to identify whether corresponding regressions on the non-ref build are real. If we lower the "min steppiness" threshold on the ref build from 0.5 to 0.3, then our non-change-point becomes a change point.

I think that this or some variation on it is what we'll need to do in order to fix this broader problem: it should take a little less for something to qualify as a change point on the ref build compared to the non-ref build. However, in order to prevent the problem from biting us, we probably want to start returning all change points in the series instead of just the first, and basically do a set subtraction of non_ref_change_points - ref_change_points to find what are real non-ref change anomalies.

CalebRouleau commented 6 years ago

Ping on this bug! Thanks for looking into it Charlie!

Another example is https://chromeperf.appspot.com/group_report?bug_id=781310 (I made bug https://bugs.chromium.org/p/chromium/issues/detail?id=781310 for it)

benshayden commented 5 years ago

Let's continue this work in crbug.com/781310