dart-lang / co19

A Dart language and library conformance test suite
BSD 3-Clause "New" or "Revised" License
37 stars 28 forks source link

Some async tests are missing `asyncStart`/`asyncEnd` #2398

Open osa1 opened 9 months ago

osa1 commented 9 months ago

While debugging https://github.com/dart-lang/sdk/issues/54140 @mkustermann realized that some of the async tests in co19 are missing asyncStart and asyncEnd:

When testing in the browser, without asyncStart and asyncEnd, the test driver is not waiting for the test to finish and declaring a success even when the test actually fails.

I quickly searched for all the files with an async main and an asyncStart, and when I compared the outputs I see that there are dozens of files where main is async but the file doesn't have asyncStart. For example:

I used ag -Q 'main() async' -c to find tests with async main and ag -Q 'asyncStart' -c to find the tests with asyncStart. (I think it might make sense to search for await instead of main() async).

sgrekhov commented 9 months ago

@osa1 thank you! I added some missed asyncStart/End(). It's not supposed that all async test should have asyncStart/End(). Tests like

main() async {
  await test1();
  ...
  await test2();
}

don't need asyncStart/End(). But I found a lot of tests that have excessive async.

nshahan commented 2 weeks ago

@sgrekhov I've noticed that these tests are flaky for dart2js and DDC when running on Firefox and Safari even though they all use the TouchEvent constructor which is not present in those browsers.

LibTest/html/Element/onTouchCancel_A01_t01
LibTest/html/Element/onTouchEnd_A01_t01
LibTest/html/Element/onTouchEnter_A01_t01
LibTest/html/Element/onTouchLeave_A01_t01
LibTest/html/Element/onTouchMove_A01_t01
LibTest/html/Element/onTouchStart_A01_t01
LibTest/html/Element/touchCancelEvent_A01_t01
LibTest/html/Element/touchEndEvent_A01_t01
LibTest/html/Element/touchEnterEvent_A01_t01
LibTest/html/Element/touchLeaveEvent_A01_t01
LibTest/html/Element/touchMoveEvent_A01_t01
LibTest/html/Element/touchStartEvent_A01_t01
LibTest/html/IFrameElement/onTouchCancel_A01_t01
LibTest/html/IFrameElement/onTouchEnd_A01_t01
LibTest/html/IFrameElement/onTouchEnter_A01_t01
LibTest/html/IFrameElement/onTouchLeave_A01_t01
LibTest/html/IFrameElement/onTouchMove_A01_t01
LibTest/html/IFrameElement/onTouchStart_A01_t01

They should consistently fail but the fact that they can sometimes pass leads me to believe that they might complete without running the code or recognizing the failure. It makes me question the reliability of the asyncStart/End guards. Can you think of other explanations?

Since the API simply doesn't exist I'm going to start skipping these tests on those browsers but I wanted to call out this observation.

sgrekhov commented 2 weeks ago

@nshahan where do you see these flaky results? According to the https://dart-current-results.web.app/#/filter=co19/LibTest/html/Element/,co19/LibTest/html/IFrameElement/ all of these tests are not flaky but fail with TouchEvent is not defined error. May be it makes sense to change the expected result in this case? If we have no event to test then test should succeed. It's not an error if we have nothing to test. I'm going to update the tests this way.

mkustermann commented 2 weeks ago

I've noticed that these tests are flaky for dart2js and DDC when running on Firefox and Safari even though they all use the TouchEvent constructor which is not present in those browsers.

How did you determine they are flaky? Did they actually flip between Pass and Fail/RuntimeError?

They should consistently fail but the fact that they can sometimes pass leads me to believe that they might complete without running the code or recognizing the failure. It makes me question the reliability of the asyncStart/End guards. Can you think of other explanations?

The go/dart-current-results page seems to indicate they consistently pass.

When SQL querying the test results from dart-ci.results.results_60d table for the co19/LibTest/html/Element/touch% name it seems that they are flaking between Timeout and RuntimeError.

=> @nshahan Can it be that it's actually flakily timeout instead of failing?

/cc @athomas

The asyncStart & asyncEnd mechanism should be working correctly. It's a collaboration between the test itself and the test controller (running in browser) and tools/test.py - once an asyncStart has been issued the test is considered not done until an error or a asyncEnd.

nshahan commented 2 weeks ago

How did you determine they are flaky?

I keep seeing them turn our bots red from time to time when they change from flaky -> RuntimeError.

Did they actually flip between Pass and Fail/RuntimeError?

Oh good intuition! They are never actually passing. Sorry about that assumption. Sometimes they get the Runtime error I would expect and sometimes they timeout.

You can see an example of oscillating between flaky and RuntimeError here: https://dart-ci.web.app/#showLatestFailures=false&configurations=ddc-linux-firefox&test=co19/LibTest/html/Element/touchEndEvent_A01_t01

sgrekhov commented 1 week ago

With this change these tests should start passing

mkustermann commented 1 week ago

@nshahan @sgrekhov The asyncStart / asyncEnd machinery works perfectly fine here. The fact that the tests fail on some browsers should also be perfectly fine (so we shouldn't change the tests because of it). I think the root cause may be something else entirely: For example it can mean that the test runner starts too many browsers (it runs N tests in parallel) and that one test may consume a lot of CPU or RAM and cause other browser tabs to crash, ...

sgrekhov commented 1 week ago

If there is nothing to tests in some brausers why the tests should fail? And the change (already commited but I can revert it if needed) should fix the flakeness. IMHO it's fine

mkustermann commented 1 week ago

If there is nothing to tests in some brausers why the tests should fail? And the change (already commited but I can revert it if needed) should fix the flakeness. IMHO it's fine

In general our tests should test the behavior we want/expect. So if an implementation doesn't adhere to this behavior/spec, the test should fail.

This way we can go to our test results database and check: Are e.g. touch events working on all browsers? And we'll see all the browsers where they work and those where they don't work (Instead of seeing all touch event tests passing, giving us the illusion of this working everywhere - even though on some browsers the API throws)

There's a gray line when APIs are e.g. optional (e.g. not in official W3C spec, ...). Since it's web-specific, I'll leave this to web team @nshahan @sigmundch.

eernstg commented 1 week ago

We have several tests where a specific behavior is expected based on the number semantics. They would make the distinction that some configurations have web numbers and others have native numbers, and the expectations differ. So perhaps we need to make more than one distinction:

In the former case we just need a run-time test that determines which kind of configuration we're currently running, or we may have several tests that are only executed with specific configurations (using status files, no less ;-).

If we have a situation like the latter then we could have a test that expects the known outcome (e.g., it could expect that there's a timeout, or the browser crashes with a bus error, or whatever), but the test runner doesn't (perhaps can't) support that. So even though item 2 looks like a special case of item 1, we may not be able to test it in the usual way.

Next, even if we can arrange for every test succeeding on every configuration (that is, it works, or we expect that it crashes and it actually crashes, etc.), the information about known deficiencies of specific platforms/tools is still implicit.

It seems likely that we'd want some kind of provenance for this. E.g., it could be useful to be able to query things like "which features are known to be broken with this and that specific browser", in particular if the status changes and a bunch of tests need to be updated. A bit like the approval associations where a test failure should allow us to look up the issue where that failure is being addressed.

In any case, I'm not convinced that we should consider a test failure to be a normal behavior that nobody should ever bother to look into.