bug: Glue catalog integration tests hangs on M1 macos + orbstack

efredine commented 5 days ago

I'm running the tests locally on an M1 mac using OrbStack.

I run the tests with:

cargo test

It runs some of the tests successfully (in parallel as expected) but eventually it stalls because a couple of tests fail to start. In the test output:

test test_load_table has been running for over 60 seconds
test test_rename_table has been running for over 60 seconds

These are tests from glue_catalog_test. They never start, so the tests just hang.

However, running the tests directly and individually works fine. For example:

cargo test --test glue_catalog_test test_rename_table

Starts and executes successfully.

But this also fails in the same way:

cargo test --test glue_catalog_test
...
test test_load_table has been running for over 60 seconds
test test_namespace_exists has been running for over 60 seconds

But notice that in this run it has some different tests that have failed to start. The containers don't seem to be starved for resources in any way (i.e. failing due to lack of memory or file space).

It seems like there is perhaps some sort of race condition in boot-strapping the related containers?

@liurenjie1024 in a slack conversation said he seems to have a similar problem.

liurenjie1024 commented 4 days ago

Thanks @efredine for reporting this. I haven't dig into this too much yet, but I think this is somehow related to different behavior of network on mac and linux, using following command could reproduce this:

RUST_TEST_THREADS=1 cargo test --test glue_catalog_test

liurenjie1024 commented 4 days ago

I'll go on with the investigation, but others are also welcome to do it.

liurenjie1024 commented 1 day ago

I did some tests and found that this maybe a bug or limitation of orbstack. Here is what I've observed:

RUST_TEST_THREADS=1 cargo test --test glue_catalog_test This commands works well since it limits the concurrency of tests into 1. That means we will not create too much containers at the same.
When I increase obrstack's memory (in Settings -> System) to a larger value, say 6g, I can increase RUST_TEST_THREADS to 6 and it still works well.
But it never passes when we don't set limit for RUST_TEST_THREADS.

I think the conclusion is that creating a set of container of each test function maybe not scalable, we may need to do some refactoring to our tests.

efredine commented 1 day ago

I did a few more tests using this approach.

Even with RUST_TEST_THREADS=4 cargo test --test glue_catalog_test and Orbstack allocated 16g of memory I get intermittent failures.

I ran it 10 times: 5 successes 4 stalls 1 error

The error:

failures:

---- test_list_namespace stdout ----
Error: Unexpected => Operation failed for hitting aws skd error

Source: aws sdk error: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Io, Os { code: 54, kind: ConnectionReset, message: "Connection reset by peer" }), connection: Unknown } })

I've also observed it sometimes failing with an IOError.

The tests are very fast when they work and I like the isolation of the tests as you say, so these intermittent errors are very frustrating!

liurenjie1024 commented 18 hours ago

I guess the IOError happens because service is not ready.

apache / iceberg-rust

bug: Glue catalog integration tests hangs on M1 macos + orbstack #420