Transient test failues when running web_test_suite

data-enabler commented 1 year ago

Looking for some guidance/help troubleshooting flakiness that we've been observing. Every now and then, we'll get an error that looks like this when running our tests:

2023/06/28 01:52:57 launching HTTP server at: teamcity-agent-39.teamcity-agent.teamcity.svc.cluster.local:37775
2023/06/28 01:52:58.142365 Listening on :46115
2023/06/28 01:52:58 [WSL Service]: context deadline exceeded waiting for healthy: [WSL Service]: error getting http://localhost:46115/healthz: Get "http://localhost:46115/healthz": context deadline exceeded

We're currently on commit 48b51502432fe804d7b73dd8d4724a4d5056026c of rules_webtesting (but also observed this behavior when we were previously on 34c659ab3e78f41ebe6453bee6201a69aef90f56), and are declaring our test suites like so inside of our test macro:

web_test_suite(
    name = name,
    data = data + [server_name],
    test = "//tools/js/tests:webtest",
    args = [
        "--debug",
        "--test_url",
        "$(location %s)" % html,
        "--server_target",
        server_name,
    ],
    browsers = browsers,
    # https://github.com/bazelbuild/rules_webtesting/issues/315
    tags = ["no-sandbox", "native"],
    visibility = visibility,
    web_test_template = "//tools/js/tests:web_test.sh.template",
    **kwargs
)

We're loading browsers from @io_bazel_rules_webtesting//web/versioned:browsers-0.3.3.bzl and using a custom_browser for Chromium with the following metadata:

{
  "capabilities" : {
    "goog:chromeOptions" : {
      "args" : [
        "--headless",
        "--no-sandbox",
        "--use-gl=swiftshader-webgl",
        "--disable-dev-shm-usage"
      ]
    }
  }
}

In case it's relevant, our webtest binary is mostly copy-pasted from https://github.com/googlecodelabs/tools/tree/main/codelab-elements/tools

We have a build that runs a collection of about 80 of these test suites as part of our automated testing in TeamCity (Linux), and we'll fairly regularly have one of the test suites in this build fail. I've also been able to trigger this error locally (macOS) by running these test suites in a loop, generally after a few hundred test suite runs (so, a few runs of the full build).

Is it possible that this issue is due to running multiple test suites in parallel? While digging around for examples, I've seen tests fail like this while running a collection of only 3 web_test_suites. Is there anything I should look into as a potential cause, or something we can use to mitigate the issue beyond using the flaky flag?

joshbruning commented 1 year ago

In capabilities, you could try:

"google:wslConfig": { "timeout": 60 }

depending upon how much resources are used by tests and available (consider there are a lot of processes - browsers, drivers, WTL, WSL, virtual displays, potentially test servers), I've seen this take longer than 5s to start.

data-enabler commented 1 year ago

Thanks! Seems to be working well so far in my own local testing. Hopefully the same holds once this is running in our CI system.

bazelbuild / rules_webtesting

Transient test failues when running web_test_suite #457