dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.09k stars 1.56k forks source link

High priority: Investigate flakiness on dart2js-windows #28955

Open mkustermann opened 7 years ago

mkustermann commented 7 years ago

There is a lot of flaky timesouts on all dart2js-windows and these do not seeem test-specific

The underlying problem might be an infrastructure issue in test.dart's browser controller in connection with IE10/IE11. Our buildbots don't surface the debug log information from test.dart so it's really hard to diagnose.

I think it should be a high priority to look into those flakes and fix the underlying issue.

mkustermann commented 7 years ago

This issue has been going on for a long time. We should consider moving all of these builders to FYI until they are fixed.

whesse commented 7 years ago

I'm rebooting these machines, because ie10 doesn't seem to be working well, even in a remote desktop session. IE 10 is old and unsupported, so we should switch to testing the new IE, edge, on Windows 10.

whesse commented 7 years ago

Looking at just one builder, the ie11 shard 2 of 4, and getting all of the timeouts in the 10 failing ie11tests in the last 100 runs, we found two tests that were repeatedly timing out, and more that only timed out once.

dart2js-ie11 release_x64 html/custom/constructor_calls_created_synchronously_test failed 3 times FAILED: dart2js-ie11 release_x64 pkg/testing/test/hello_test failed 4 times

These each failed once: FAILED: dart2js-ie11 release_x64 html/custom_elements_test/preregister FAILED: dart2js-ie11 release_x64 html/js_typed_interop_test/avoid leaks on dart:core FAILED: dart2js-ie11 release_x64 html/xsltprocessor_test/supported FAILED: dart2js-ie11 release_x64 pkg/front_end/test/scanner_test FAILED: dart2js-ie11 release_x64 pkg/front_end/test/src/async_dependency_walker_test

All failures are timeouts of tests that don't normally time out.

More investigation of the tests that are timing out on all windows browser tests can help us figure out what the problem is.

whesse commented 7 years ago

@peter-ahe-google @sigmundch @sortie

sigmundch commented 7 years ago

/cc @stereotype441 I'd skip for now the pkg/front_end/* tests - I don't think there is much value in running these tests in ie11 at the moment. We do want to test that we can run the frontend in chrome (the main use case is to one day run ddc as part of the chrome-debugger), but I'm OK skipping all other browsers for now.

peter-ahe-google commented 7 years ago

I want to be sure that there's no confusion here: running pkg/front_end/* tests on the Dart VM is absolutely critical, but yes, we can skip them on browsers if that's helpful. However, it might also be valuable to check these tests for asynchronous pitfalls.

whesse commented 7 years ago

The current work that needs to be done first, I think, is: Using a script or a shell script, get the logs with the flaky timeouts from the dart2js-windows columns in the buildbot. Each column can be done separately, since every test only runs on one of the shards, and which shard it is is deterministic. Each step can also be done separately, since most steps don't have any failures they can be skipped, and the co19 steps can be done separately from the other steps.

The logs that are held on the buildbot (about 2 weeks worth) can be fetched using the stdio/text URLs. For example, I just did a command line: for i in 4121 4112 4110 4107 4100 4090 4062 4061 4045; do curl -o log$i https://build.chromium.org/p/client.dart/builders/dart2js-win7-ie11ff-2-4-be/builds/$i/steps/dart2js%20ie11%20tests/logs/stdio/text; done

There is also a permanent record of older logs in cloud storage, stored in logdog, viewable by the links in square brackets in a build, which use the logdog viewer: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fclient.dart%2Fdart2js-win7-ie11ff-3-4-be%2F4076%2F%2B%2Frecipes%2Fsteps%2Fdart2js_ff_observatory_ui_tests%2F0%2Fstdout

There is also a different way of seeing all the runs in a column, using milo, the replacement for buildbot. This can also go back arbitrarily long, rather than dropping runs more than a month old: https://luci-milo.appspot.com/buildbot/client.dart/dart2js-win7-ie11ff-4-4-be/ https://luci-milo.appspot.com/

There is a command-line tool, called logdog, for fetching the file as a direct text download: https://github.com/luci/luci-go/blob/master/logdog/client/cmd/logdog/README.md I will be finding out how we get these command-line tools other than downloading and building them ourselves.

floitschG commented 7 years ago

I just committed a small library that allows us to download logs via logdog: https://github.com/dart-lang/sdk/blob/master/tools/gardening/lib/src/logdog.dart

This should allow us to fetch logs from a longer period, thus showing us which tests fail the most often.

efortuna commented 7 years ago

Hi Bill, What's the latest on this bug? Last I heard chrome people were re-imaging the Windows machines, that has been done, and now it looks like we're just back to spurious flaky timeout failures (as opposed to the return status failures of last week that required the reimaging). What's the current status of handling flaky tests? It looks like in some cases we're re-running, but I don't see any rerunning happening for these tests that timed out: https://uberchromegw.corp.google.com/i/client.dart/builders/dart2js-win7-ie11ff-3-4-be/builds/75/steps/dart2js%20ie11%20tests/logs/stdio

mkustermann commented 7 years ago

During my gardening shift I saw this build failure today.

One very interesting thing is that the number of failures equals precisely the number of cores this machine has (which corresponds to the number of parallel tests we run).

This leads me to suspect that Internet Explorer might get into a bad state, which will affect all open tabs / windows, thereby affecting all concurrently running tests.

One possible situation in which this could happen if Internet Explorer (or the system) pops up a modal dialog (e.g. "this script is no longer responding, do you want to wait for it? y/n"). Similar issues have occurred in the past.

Since I'm the gardener today, I'll do a far fetched attempt, by making a screenshot before killing browsers and upload them to cloud storage (see cl).

whesse commented 7 years ago

RIght now, IE11 is mainly timing out in the first run of IE on a build. I have watched this with a window open, and the IE window is open, trying to load the driver page, and nothing is happening. The option to open developer tools is grayed out in the system menu, so I can't see the network traffic.

This timing-out browser is killed, and a new one is opened, and it times out too, while loading the driver page. Only the third time opening the browser, is the connection made.

The strange thing is that this only happens in the first IE11 step in a build, so it seems like there could be some IE11 state that is initialized, and kept warm, even when the browsers are killed.

whesse commented 7 years ago

The debug log for IE11 bots shows that the 60 second timeout for fetching a test is being hit when starting ie11 for the first time in a build. The browser is then killed, and a new instance is started, which again takes more than 60 seconds to start up. This cycle continues until the max number of failures is reached, and then the test run is stopped.

Increasing the time allowed for a browser to fetch a test to 120 seconds. This should fix the problem on IE. And this does not increase the timeout for tests that take too long, which is controlled by a different timer.

whesse commented 7 years ago

The commit https://github.com/dart-lang/sdk/commit/d6ca1a5defcbe3834d08a67511abc23bb902f073 seems not to be working correctly, so I am introducing a new CL to catch the timeouts more robustly. If this works, then we know there was a problem with the previous attempt.

whesse commented 7 years ago

https://codereview.chromium.org/2938813002 which should stop ie11 timeouts from reporting as errors, has landed as d98d32b63c46b884fcdea3be08cfd55c5f35165d

efortuna commented 6 years ago

I feel like the these bots have been a lot more stable. Am I mistaken? Can we close this issue?

whesse commented 6 years ago

Actually, we were just thinking about reverting the change that ignores Windows IE timeouts. The builders have been stable because up to 5 or 10 timeouts are ignored, for each run of test.py. But now I realize that the fix by @mraleph for Windows probably won't help these bots, because they are timing out in IE. So if we revert the hack, and they become unstable, we would recommit the hack (that ignores up to 10 IE timeouts).

I would not close the issue, because it is not fixed, just hidden.

whesse commented 5 years ago

After the ie11 builders were moved from the golo lab (permanently assigned vms assigned to them) to swarming (windows GCE VMs taken from a pool, randomly), the flakiness and the real timeouts have increased to more than 10 per shard.

To investigate this, and turn it green, we are dropping the code that ignores the timeouts. Real timeouts will show up as errors, and flaky ones will enter the flakiness system, and eventually be forgiven. @sortie

whesse commented 5 years ago

The high number of ignore results could also come from the other dart2js windows thing that returns ignore: a dart2js hang. We are leaving that code in place, to verify that the ignores coming with the move to swarming are ie11 timeouts. The issue tracking dart2js hangs is #26060