clj-commons / etaoin

Pure Clojure Webdriver protocol implementation
https://cljdoc.org/d/etaoin
Eclipse Public License 1.0
908 stars 95 forks source link

ci: tests on Windows can sporadically fail #465

Open lread opened 2 years ago

lread commented 2 years ago

Currently

It is not unusual for jobs on Windows to fail on GitHub Actions. A re-run of any failed jobs usually succeeds.

Impact

It is a bit annoying to have to rerun these sporadic failed jobs.

Typical Symptom

The typical symptom is a SocketTimeoutException exception, for example:

Testing etaoin.api-test
Testing with browsers: [:firefox]
Error: Exception in thread "main" java.net.SocketTimeoutException: Read timed out
    at java.base/java.net.SocketInputStream.socketRead0(Native Method)
    at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at clj_http.core$request.invokeStatic(core.clj:621)
    at clj_http.core$request.invoke(core.clj:546)
    at clj_http.core$request.invokeStatic(core.clj:547)
    at clj_http.core$request.invoke(core.clj:546)
    at clojure.lang.Var.invoke(Var.java:381)
    at clj_http.client$wrap_request_timing$fn__5700.invoke(client.clj:1083)
    at clj_http.headers$wrap_header_map$fn__4561.invoke(headers.clj:152)
    at clj_http.client$wrap_query_params$fn__5580.invoke(client.clj:817)
    at clj_http.client$wrap_basic_auth$fn__5586.invoke(client.clj:840)
    at clj_http.client$wrap_oauth$fn__5591.invoke(client.clj:857)
    at clj_http.client$wrap_user_info$fn__5600.invoke(client.clj:877)
    at clj_http.client$wrap_url$fn__5682.invoke(client.clj:1035)
    at clj_http.client$wrap_decompression$fn__5327.invoke(client.clj:416)
    at clj_http.client$wrap_input_coercion$fn__5492.invoke(client.clj:630)
    at clj_http.client$wrap_additional_header_parsing$fn__5517.invoke(client.clj:685)
    at clj_http.client$wrap_output_coercion$fn__5479.invoke(client.clj:574)
    at clj_http.client$wrap_exceptions$fn__5274.invoke(client.clj:250)
    at clj_http.client$wrap_accept$fn__5532.invoke(client.clj:728)
    at clj_http.client$wrap_accept_encoding$fn__5539.invoke(client.clj:750)
    at clj_http.client$wrap_content_type$fn__5526.invoke(client.clj:711)
    at clj_http.client$wrap_form_params$fn__5647.invoke(client.clj:959)
    at clj_http.client$wrap_nested_params$fn__5668.invoke(client.clj:993)
    at clj_http.client$wrap_flatten_nested_params$fn__5677.invoke(client.clj:1017)
    at clj_http.client$wrap_method$fn__5605.invoke(client.clj:893)
    at clj_http.cookies$wrap_cookies$fn__3434.invoke(cookies.clj:128)
    at clj_http.links$wrap_links$fn__4955.invoke(links.clj:63)
    at clj_http.client$wrap_unknown_host$fn__5685.invoke(client.clj:1046)
    at etaoin.impl.client$call.invokeStatic(client.cljc:101)
    at etaoin.impl.client$call.invoke(client.cljc:77)
    at etaoin.api$execute.invokeStatic(api.clj:228)
    at etaoin.api$execute.invoke(api.clj:202)
    at etaoin.api$create_session.invokeStatic(api.clj:255)
    at etaoin.api$create_session.doInvoke(api.clj:245)
    at clojure.lang.RestFn.invoke(RestFn.java:423)
    at etaoin.api$_connect_driver.invokeStatic(api.clj:3569)
    at etaoin.api$_connect_driver.doInvoke(api.clj:3484)
    at clojure.lang.RestFn.invoke(RestFn.java:423)
    at etaoin.api$boot_driver.invokeStatic(api.clj:3609)
    at etaoin.api$boot_driver.invoke(api.clj:3593)
    at etaoin.api_test$fixture_browsers$fn__8111.invoke(api_test.clj:56)
    at etaoin.api_test$fixture_browsers.invokeStatic(api_test.clj:56)
    at etaoin.api_test$fixture_browsers.invoke(api_test.clj:53)
    at clojure.test$compose_fixtures$fn__9203$fn__9204.invoke(test.clj:693)
    at clojure.test$default_fixture.invokeStatic(test.clj:686)
    at clojure.test$default_fixture.invoke(test.clj:682)
    at clojure.test$compose_fixtures$fn__9203.invoke(test.clj:693)
    at clojure.test$test_vars$fn__9235.invoke(test.clj:734)
    at etaoin.api_test$report_browsers.invokeStatic(api_test.clj:70)
    at etaoin.api_test$report_browsers.invoke(api_test.clj:68)
    at clojure.test$compose_fixtures$fn__9203$fn__9204.invoke(test.clj:693)
    at clojure.test$default_fixture.invokeStatic(test.clj:686)
    at clojure.test$default_fixture.invoke(test.clj:682)
    at clojure.test$compose_fixtures$fn__9203.invoke(test.clj:693)
    at clojure.test$test_vars.invokeStatic(test.clj:730)
    at clojure.test$test_all_vars.invokeStatic(test.clj:736)
    at clojure.test$test_ns.invokeStatic(test.clj:757)
    at clojure.test$test_ns.invoke(test.clj:742)
    at clojure.core$map$fn__5587.invoke(core.clj:2747)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.Cons.next(Cons.java:39)
    at clojure.lang.RT.boundedLength(RT.java:1785)
    at clojure.lang.RestFn.applyTo(RestFn.java:130)
    at clojure.core$apply.invokeStatic(core.clj:659)
    at clojure.test$run_tests.invokeStatic(test.clj:767)
    at clojure.test$run_tests.doInvoke(test.clj:767)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at clojure.core$apply.invokeStatic(core.clj:657)
    at clojure.core$apply.invoke(core.clj:652)
    at cognitect.test_runner$test.invokeStatic(test_runner.clj:74)
    at cognitect.test_runner$test.invoke(test_runner.clj:62)
    at cognitect.test_runner$_main.invokeStatic(test_runner.clj:127)
    at cognitect.test_runner$_main.doInvoke(test_runner.clj:115)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at clojure.lang.Var.applyTo(Var.java:702)
    at clojure.core$apply.invokeStatic(core.clj:657)
    at clojure.main$main_opt.invokeStatic(main.clj:317)
    at clojure.main$main_opt.invoke(main.clj:313)
    at clojure.main$main.invokeStatic(main.clj:424)
    at clojure.main$main.doInvoke(main.clj:387)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at clojure.lang.Var.applyTo(Var.java:702)
    at clojure.main.main(main.java:37)

Thoughts

This is different from #464. The default socket timeout is 60 seconds.

I've not done much verification to try to reproduce this one on my Windows VM yet.

My wild stab guess is that the WebDriver process has maybe crashed, but I dunno yet. Maybe we are asking the WebDriver to create a session before it is ready to do so.

We could try this idea from #464:

When Etaoin is the one who has launched the WebDriver process it could also check if the process actually still seems to be running (I don't think we currently do this?).

lread commented 2 years ago

It feels to me like maybe the Windows platform, for whatever reasons, is simply less stable.

Sometimes when GitHub Actions tries to launch an app, we'll get a -1073741502 exit code, example raw log:

2022-06-26T13:46:31.9821729Z TASK tools-versions done.
2022-06-26T13:46:32.3019582Z ##[group]Run bb test unit --platform=jvm
2022-06-26T13:46:32.3019965Z bb test unit --platform=jvm
2022-06-26T13:46:32.3078141Z shell: C:\Program Files\PowerShell\7\pwsh.EXE -command ". '{0}'"
2022-06-26T13:46:32.3078439Z env:
2022-06-26T13:46:32.3078785Z   JAVA_HOME: C:\hostedtoolcache\windows\Java_Temurin-Hotspot_jdk\11.0.15-10\x64
2022-06-26T13:46:32.3079128Z ##[endgroup]
2022-06-26T13:46:32.3253202Z ##[error]Process completed with exit code -1073741502.

This translates to NTSTATUS code 0xC0000142 https://stackoverflow.com/a/38246636 which means DLL initialization failed. I see this sometimes for other projects too.

This is sporadic so I could maybe have GitHub Actions retry a command if it returns this exit code, but I am not sure what I am working-around here. Windows? GitHub Actions Windows platform? Some issue with the binaries I am launching? I'll probably never know.

lread commented 2 years ago

It is odd that the socket timeout exception seems to only occur on Windows.

I have seen it when talking to:

lread commented 2 years ago

A bit of a long shot, but maybe related?: https://github.com/actions/virtual-environments/issues/1187

Specifically this might help? https://github.com/actions/virtual-environments/issues/1187#issuecomment-687364572

lread commented 2 years ago

I'll report on any future Windows SocketTimeoutException occurrences here. If we don't see any for a while, will close this issue.

lread commented 2 years ago

While working on clj-http-lite I've noticed that process-launch->available-connection on Windows can sometimes be extremely slow. On ubuntu ~5s is normal, but on Windows ~10s and then sometimes a whopping 90s.

lread commented 1 year ago

As of this writing, we are still getting many sporadic failures on GitHub Actions Windows.

Currently our default *wait-timeout* is global for all platforms. Currently 7 seconds. I'll think about ways to make this more configurable. Maybe allow for a fn here that can vary based on OS type, CI or whatever.

Or maybe just letting the user bind it to whatever is good enough. Will see how I can apply a larger default for CI + Windows tests only.

lread commented 1 year ago

Note that a socket timeout exception can mean that the WebDriver process is hung/failing somehow. In these cases, it is valuable to see the WebDriver logs.

By default, we do not capture WebDriver logs, they get sent to /dev/null.

For our tests, it might be interesting to capture WebDriver logs but dump them only for failing tests.

It is probably interesting to enable trace/debug-level logging, although a con of doing so is that this could potentially impact WebDriver behaviour. I'm probably willing to take that risk.

If we go down this road, we also don't want to pollute a system with a bunch of WebDriver log files. So some strategy to either not do that, or easily clean up would be necessary.

lread commented 4 months ago

As of 2024-05-16 Windows is still very flaky. It is normal to have to rerun failing Windows jobs to get a successful CI run.