Need help figuring out a timeout from `wiremock::MockServer`

Enrico2 commented 4 years ago

See the test failure here and all relevant logs: https://github.com/apollographql/rust/runs/789581734?check_suite_focus=true

specifically:

[2020-06-19T21:57:13Z DEBUG wiremock::server_actor] Mock server started listening on 127.0.0.1:8000!
[2020-06-19T21:57:13Z DEBUG apollo_cli::telemetry::session] Sending telemetry to http://127.0.0.1:8000/telemetry
[2020-06-19T21:57:13Z DEBUG reqwest::connect] starting new connection: http://127.0.0.1:8000/
[2020-06-19T21:57:13Z DEBUG hyper::client::connect::http] connecting to 127.0.0.1:8000
[2020-06-19T21:57:13Z DEBUG hyper::client::connect::http] connected to 127.0.0.1:8000
[2020-06-19T21:57:13Z DEBUG apollo_cli::telemetry::session] Telemetry request done
[2020-06-19T21:57:13Z DEBUG apollo_cli::telemetry::session] telemetry request failed: error sending request for url (http://127.0.0.1:8000/telemetry): operation timed out

test code is here: https://github.com/apollographql/rust/blob/ran/workflow_revamp/cli/src/telemetry/session.rs#L163-L190

the failure comes from here: https://github.com/apollographql/rust/blob/ran/workflow_revamp/cli/src/telemetry/session.rs#L146

The timeout we currently set is 300ms, but trying to change this to 3000ms resulted in the same problem.

We have multiple tests that use wiremock that are running in parallel, but as far as I can tell they have no dependencies between each other and are using different ports.

I can't reproduce this issue locally on my mac (I used flaky-finder to run the suite many times in parallel), so I suspect this has something to do perhaps with Github's runtime environment for actions.

Could you assist in trying to debug this? Thanks.

LukeMathWalker commented 4 years ago

Interesting - out of curiosity (not sure how complicated this is for you), does the issue still verify if you replace blocking reqwest with, for example, ureq or surf? Just trying to narrow down to problem space and exclude weird compatibility issues between runtimes/executors.

Enrico2 commented 4 years ago

Hmm, interesting: So this run replaced the failing instance of a test with ureq: https://github.com/apollographql/rust/runs/791473995?check_suite_focus=true

What failed before has now passed, but there is another test that uses MockServer, and it hung until it failed. The original one failed fast because we set a timeout, this one took 30 seconds to fail, we don't set a timeout on it, which I suspect is the default timeout for reqwest?

Notice the timestamps of the first and last lines.

[2020-06-20T21:46:17Z DEBUG apollo_cli::version] Checking to see if there is a newer version
[2020-06-20T21:46:17Z DEBUG apollo_cli::version] Fetching latest version from http://127.0.0.1:8002/cli/darwin
[2020-06-20T21:46:17Z DEBUG reqwest::connect] starting new connection: http://127.0.0.1:8002/
[2020-06-20T21:46:17Z DEBUG hyper::client::connect::http] connecting to 127.0.0.1:8002
[2020-06-20T21:46:17Z DEBUG hyper::client::connect::http] connected to 127.0.0.1:8002
[2020-06-20T21:46:17Z DEBUG wiremock::server_actor] Starting new connection from 127.0.0.1:49404
[2020-06-20T21:46:17Z DEBUG hyper::proto::h1::io] flushed 88 bytes
[2020-06-20T21:46:48Z DEBUG apollo_cli::version] Failed to determined if new updated was needed:
    Unable to fetch release

Anyway, seems like something is ... off with either reqwest or MockServer under specific conditions..

LukeMathWalker commented 4 years ago

30 seconds should indeed be the default timeout. It is a bit difficult for me to help you debug this given that it only happens in your CI environment, but another possible line of investigation is the interaction between the blocking code you are calling (reqwest/ureq) and the async context of your test. Could you try wrapping the blocking calls in your tests witth spawn_blocking?

Enrico2 commented 4 years ago

using spawn_blocking is a bit more complex in in this situation. The result of the blocking call doesn't implement Send, and piping that through everywhere is starting to get complex, and collides with Errors source method which we override that expects a result type of dyn Error + 'static (which doesn't mention Send).

I'm going to try swapping out reqwest for ureq everywhere and see if that matters.

Enrico2 commented 4 years ago

I've made some progress. Turns out it was only reproducible in CI because we didn't check in our Cargo.lock file. I was able to reproduce locally by deleting my local copy of Cargo.lock, tests started failing.

I then compared the passing vs. failing Cargo.lock, and specifically focused on differences that related to this test.

So, here's what I found: While we specifically depend on async-std = "1.6.1", wiremock has that dependency as version = "1", which brings in 1.6.2 (that's what the failing Cargo.lock had)

So, in my Cargo.lock, I forced 1.6.1. That change on its own was not enough, I had to also revert smol from 0.1.18 to 0.1.11.

THAT was the thing that enabled tests to pass.

Enrico2 commented 4 years ago

Adding this line on our Cargo.toml file resolves the issue:

smol = "= 0.1.12"

Note that 0.1.13 fails on the timeout I mentioned.

LukeMathWalker commented 4 years ago

So it is in the end an issue with executors - it would probably be worth raising an issue on smol/async-std's repositories to shine some light here. They certainly know better what changed between those two versions.

Enrico2 commented 4 years ago

Thanks for the help!

LukeMathWalker / wiremock-rs

Need help figuring out a timeout from `wiremock::MockServer` #13