Open efredine opened 5 days ago
Thanks @efredine for reporting this. I haven't dig into this too much yet, but I think this is somehow related to different behavior of network on mac and linux, using following command could reproduce this:
RUST_TEST_THREADS=1 cargo test --test glue_catalog_test
I'll go on with the investigation, but others are also welcome to do it.
I did some tests and found that this maybe a bug or limitation of orbstack. Here is what I've observed:
RUST_TEST_THREADS=1 cargo test --test glue_catalog_test
This commands works well since it limits the concurrency of tests into 1. That means we will not create too much containers at the same.I think the conclusion is that creating a set of container of each test function maybe not scalable, we may need to do some refactoring to our tests.
I did a few more tests using this approach.
Even with RUST_TEST_THREADS=4 cargo test --test glue_catalog_test
and Orbstack allocated 16g of memory I get intermittent failures.
I ran it 10 times: 5 successes 4 stalls 1 error
The error:
failures:
---- test_list_namespace stdout ----
Error: Unexpected => Operation failed for hitting aws skd error
Source: aws sdk error: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Io, Os { code: 54, kind: ConnectionReset, message: "Connection reset by peer" }), connection: Unknown } })
I've also observed it sometimes failing with an IOError.
The tests are very fast when they work and I like the isolation of the tests as you say, so these intermittent errors are very frustrating!
I guess the IOError happens because service is not ready.
I'm running the tests locally on an M1 mac using OrbStack.
I run the tests with:
It runs some of the tests successfully (in parallel as expected) but eventually it stalls because a couple of tests fail to start. In the test output:
These are tests from
glue_catalog_test
. They never start, so the tests just hang.However, running the tests directly and individually works fine. For example:
Starts and executes successfully.
But this also fails in the same way:
But notice that in this run it has some different tests that have failed to start. The containers don't seem to be starved for resources in any way (i.e. failing due to lack of memory or file space).
It seems like there is perhaps some sort of race condition in boot-strapping the related containers?
@liurenjie1024 in a slack conversation said he seems to have a similar problem.