apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.54k stars 3.54k forks source link

[CI][C++] Fix arrow-s3fs-test timeouts on macOS C++ job #40410

Open amoeba opened 8 months ago

amoeba commented 8 months ago

Describe the bug, including details regarding any error messages, version, and platform.

The arrow-s3fs-test has been failing due to what appears to be a timeout on the "AMD64 macOS 12 C++" job for a while.

A recent example is: https://github.com/apache/arrow/actions/runs/8179809257/job/22366589254?pr=40373#step:13:265 and the relevant output is:

2024-03-06T23:05:01.0379130Z 97/97 Test #73: arrow-s3fs-test ..............................***Timeout 300.04 sec
2024-03-06T23:05:01.0381760Z Running arrow-s3fs-test, redirecting output into /Users/runner/work/arrow/arrow/build/cpp/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)
2024-03-06T23:05:01.0383120Z 
2024-03-06T23:05:01.0383360Z       Start 73: arrow-s3fs-test
2024-03-06T23:10:01.0805520Z     Test #73: arrow-s3fs-test ..............................***Timeout 300.04 sec
2024-03-06T23:10:01.0808220Z Running arrow-s3fs-test, redirecting output into /Users/runner/work/arrow/arrow/build/cpp/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)
2024-03-06T23:10:01.0810650Z 
2024-03-06T23:10:01.0810890Z       Start 73: arrow-s3fs-test
2024-03-06T23:15:01.1167650Z     Test #73: arrow-s3fs-test ..............................***Timeout 300.04 sec
2024-03-06T23:15:01.1171300Z Running arrow-s3fs-test, redirecting output into /Users/runner/work/arrow/arrow/build/cpp/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)
2024-03-06T23:15:01.1172580Z 
2024-03-06T23:15:01.1175960Z 
2024-03-06T23:15:01.1176500Z 99% tests passed, 1 tests failed out of 97
2024-03-06T23:15:01.1176970Z 
2024-03-06T23:15:01.1177690Z Label Time Summary:
2024-03-06T23:15:01.1178690Z arrow-compute-tests    =  89.33 sec*proc (13 tests)
2024-03-06T23:15:01.1179450Z arrow-tests            = 449.67 sec*proc (41 tests)
2024-03-06T23:15:01.1183940Z arrow_acero            = 148.43 sec*proc (13 tests)
2024-03-06T23:15:01.1184610Z arrow_dataset          = 101.76 sec*proc (14 tests)
2024-03-06T23:15:01.1185230Z Errors while running CTest
2024-03-06T23:15:01.1185750Z arrow_flight           =  14.60 sec*proc (2 tests)
2024-03-06T23:15:01.1186400Z arrow_substrait        =   1.58 sec*proc (1 test)
2024-03-06T23:15:01.1187050Z filesystem             = 334.44 sec*proc (5 tests)
2024-03-06T23:15:01.1187830Z gandiva-tests          =  28.20 sec*proc (4 tests)
2024-03-06T23:15:01.1188560Z parquet-tests          =  59.08 sec*proc (9 tests)
2024-03-06T23:15:01.1189210Z unittest               = 892.66 sec*proc (97 tests)
2024-03-06T23:15:01.1189650Z 
2024-03-06T23:15:01.1189820Z Total Test time (real) = 1008.26 sec
2024-03-06T23:15:01.1190170Z 
2024-03-06T23:15:01.1190330Z The following tests FAILED:
2024-03-06T23:15:01.1190830Z     73 - arrow-s3fs-test (Timeout)

gist for posterity.

I expect this causes a lot of noise for maintainers so figuring out why this test is timing out would be good.

Component(s)

C++, Continuous Integration

amoeba commented 8 months ago

When running this test locally on my non-AMD64 Mac, I actually get three failures:

[ RUN      ] TestS3FS.CreateDir
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:934: Failure
Failed
Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got OK
[  FAILED  ] TestS3FS.CreateDir (37 ms)
[ RUN      ] TestS3FSGeneric.CopyFile
Warning: The standard parity is set to 0. This can lead to data loss.
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/test_util.cc:570: Failure
Failed
Expected 'fs->CopyFile("AB/abc", "def/mno")' to fail with IOError, but got OK
[  FAILED  ] TestS3FSGeneric.CopyFile (66 ms)
[ RUN      ] TestS3FSGeneric.MoveFile
Warning: The standard parity is set to 0. This can lead to data loss.
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/test_util.cc:448: Failure
Failed
Expected 'fs->Move("AB/pqr", "xxx/mno")' to fail with IOError, but got OK
[  FAILED  ] TestS3FSGeneric.MoveFile (81 ms)

I see in this job we run ctest with,

ctest --label-regex unittest --output-on-failure --parallel 4 --repeat until-pass:3 --timeout 300

My hunch is that it's possible the test is failing and ctest is retrying. My local machine is fast enough that it can get 3 repeats in before the timeout but it's possible GitHub's runner is slow enough that the timeout happens first.

Another thing I noticed is that, when ctest automatically re-attempts the test as requested, it still prints "attempt 1/1" each time:

Running arrow-s3fs-test, redirecting output into /Users/bryce/src/apache/arrow/cpp/build/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)
amoeba commented 8 months ago

I reverted my minio and mc binaries to the ones we pin for the tests and the above failures go away so I don't think that's what's going on. Though the above errors may be something to be aware of we ever need to upgrade minio/mc.

felipecrv commented 8 months ago

Relevant link -> https://github.com/apache/arrow/pull/34671

amoeba commented 8 months ago

I haven't been able to reproduce locally (without modifying anything) but I have been able to reproduce once on CI with a stripped-down workflow in this run https://github.com/amoeba/arrow/actions/runs/8194916281/job/22416431654 so I'm going to try debugging that with tmate.

pitrou commented 8 months ago

Good luck with that, my attempts at reproducing while logging in using tmate actually failed here: https://github.com/apache/arrow/pull/40401

amoeba commented 7 months ago

I'm having success reproducing the timeout by SSH'ing into GHA so this has now turned into me figuring out how to get useful information out of the hanging test with dtrace. A did a bit of searching to find out how to get thread stack traces from a running process and ended up trying sudo dtrace -p $PID -n 'profile-997 /pid == '$PID'/ { @[ustack(100)] = count();}' -o somefile which I don'k think is close but not quite what I'm after. It behaves like it captures just a single stack trace so I'm going to keep experimenting.

But so far, it produces this more often than not, which is interesting,

libsystem_kernel.dylib`__pipe+0xa
libcurl.4.dylib`curl_easy_perform+0x77
libaws-cpp-sdk-core.dylib`Aws::Http::CurlHttpClient::MakeRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const+0xda6
libaws-cpp-sdk-core.dylib`std::__1::__function::__func<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24, std::__1::allocator<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24>, std::__1::shared_ptr<Aws::Http::HttpResponse> ()>::operator()()+0x26
libaws-cpp-sdk-core.dylib`std::__1::shared_ptr<Aws::Http::HttpResponse> smithy::components::tracing::TracingUtils::MakeCallWithTiming<std::__1::shared_ptr<Aws::Http::HttpResponse> >(std::__1::function<std::__1::shared_ptr<Aws::Http::HttpResponse> ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, smithy::components::tracing::Meter const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x48
libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const+0x779
libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const+0x43b
libaws-cpp-sdk-core.dylib`Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const+0x42
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(Aws::S3::Model::HeadBucketRequest const&)+0xf2
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0xb8
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegionUncached(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0xda
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegion(d::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x12c
libarrow.1600.0.0.dylib`arrow::fs::ResolveS3BucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x265a
libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleSehExceptionsInMethodIfSuppord<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x
libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x6a
libarrow_gtestd.1.11.0.dylib`testing::TestInfo::Run()+0xlibarrow_gtestd.1.11.0.dylib`testing::TestSuite::Run()+0x104
pitrou commented 7 months ago

More easily perhaps, you can use gdb -p to take control of a running process and debug it. lldb has something similar.

amoeba commented 7 months ago

Thanks. A bt all gives something similar. In an attempt to catch when the test was approaching timeout, I got a baseline for how long the test takes to complete normally which was pretty consistently ~25s. I ran the test by itself until it took longer and waited until ~60s and attached using lldb -p and then ran bt all:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff8179a40aa libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007ff8179e0c95 libsystem_pthread.dylib`_pthread_join + 358
    frame #2: 0x00007ff82c299cda libcurl.4.dylib`Curl_thread_join + 23
    frame #3: 0x00007ff82c28c45c libcurl.4.dylib`Curl_resolver_kill + 41
    frame #4: 0x00007ff82c2c21a0 libcurl.4.dylib`multi_done + 75
    frame #5: 0x00007ff82c2c5876 libcurl.4.dylib`multi_handle_timeout + 291
    frame #6: 0x00007ff82c2c4254 libcurl.4.dylib`multi_runsingle + 3542
    frame #7: 0x00007ff82c2c3391 libcurl.4.dylib`curl_multi_perform + 212
    frame #8: 0x00007ff82c29e7eb libcurl.4.dylib`curl_easy_perform + 299
    frame #9: 0x0000000102139df2 libaws-cpp-sdk-core.dylib`Aws::Http::CurlHttpClient::MakeRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const + 3494
    frame #10: 0x000000010212b61c libaws-cpp-sdk-core.dylib`std::__1::__function::__func<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24, std::__1::allocator<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24>, std::__1::shared_ptr<Aws::Http::HttpResponse> ()>::operator()() + 38
    frame #11: 0x0000000102116e78 libaws-cpp-sdk-core.dylib`std::__1::shared_ptr<Aws::Http::HttpResponse> smithy::components::tracing::TracingUtils::MakeCallWithTiming<std::__1::shared_ptr<Aws::Http::HttpResponse> >(std::__1::function<std::__1::shared_ptr<Aws::Http::HttpResponse> ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, smithy::components::tracing::Meter const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 72
    frame #12: 0x0000000102113883 libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const + 1913
    frame #13: 0x00000001021116af libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const + 1083
    frame #14: 0x000000010211c612 libaws-cpp-sdk-core.dylib`Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const + 66
    frame #15: 0x0000000120bbb4d2 libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(Aws::S3::Model::HeadBucketRequest const&) + 242
    frame #16: 0x0000000120bbb368 libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 184
    frame #17: 0x0000000120bbac9a libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegionUncached(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 218
    frame #18: 0x0000000120b56d3c libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 300
    frame #19: 0x0000000120b49ab5 libarrow.1600.0.0.dylib`arrow::fs::ResolveS3BucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 613
    frame #20: 0x0000000100c7cfd4 arrow-s3fs-test`arrow::fs::S3RegionResolutionTest_NonExistentBucket_Test::TestBody() + 68
    frame #21: 0x00000001026db47b libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 123
    frame #22: 0x000000010269dcba libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 106
    frame #23: 0x000000010269dc03 libarrow_gtestd.1.11.0.dylib`testing::Test::Run() + 195
    frame #24: 0x000000010269ed81 libarrow_gtestd.1.11.0.dylib`testing::TestInfo::Run() + 241
    frame #25: 0x000000010269fe24 libarrow_gtestd.1.11.0.dylib`testing::TestSuite::Run() + 260
    frame #26: 0x00000001026af16b libarrow_gtestd.1.11.0.dylib`testing::internal::UnitTestImpl::RunAllTests() + 987
    frame #27: 0x00000001026e04bb libarrow_gtestd.1.11.0.dylib`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 123
    frame #28: 0x00000001026aeb2a libarrow_gtestd.1.11.0.dylib`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 106
    frame #29: 0x00000001026aea15 libarrow_gtestd.1.11.0.dylib`testing::UnitTest::Run() + 197
    frame #30: 0x00000001011e0f21 libarrow_gtest_maind.1.11.0.dylib`RUN_ALL_TESTS() + 17
    frame #31: 0x00000001011e0efd libarrow_gtest_maind.1.11.0.dylib`main + 61
    frame #32: 0x000000010afe352e dyld`start + 462
  thread #2
    frame #0: 0x00007ff8179a401a libsystem_kernel.dylib`__workq_kernreturn + 10
    frame #1: 0x00007ff8179dc034 libsystem_pthread.dylib`_pthread_wqthread + 426
    frame #2: 0x00007ff8179daf57 libsystem_pthread.dylib`start_wqthread + 15
  thread #3, name = 'AwsEventLoop 1'
    frame #0: 0x00007ff8179a730e libsystem_kernel.dylib`kevent + 10
    frame #1: 0x0000000101fe99ec libaws-c-io.1.0.0.dylib`aws_event_loop_thread + 404
    frame #2: 0x00000001024ffd3d libaws-c-common.1.0.0.dylib`thread_fn + 345
    frame #3: 0x00007ff8179df4e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #4: 0x00007ff8179daf6b libsystem_pthread.dylib`thread_start + 15
  thread #4
    frame #0: 0x00007ff8179a730e libsystem_kernel.dylib`kevent + 10
    frame #1: 0x00007ff817a052ee libsystem_info.dylib`_mdns_search_ex + 1217
    frame #2: 0x00007ff817a07c1c libsystem_info.dylib`mdns_addrinfo + 483
    frame #3: 0x00007ff817a079e4 libsystem_info.dylib`search_addrinfo + 164
    frame #4: 0x00007ff817a00659 libsystem_info.dylib`si_addrinfo + 1811
    frame #5: 0x00007ff8179ffeac libsystem_info.dylib`getaddrinfo + 176
    frame #6: 0x00007ff82c296186 libcurl.4.dylib`Curl_getaddrinfo_ex + 36
    frame #7: 0x00007ff82c28ca24 libcurl.4.dylib`getaddrinfo_thread + 83
    frame #8: 0x00007ff82c299c8d libcurl.4.dylib`curl_thread_create_thunk + 29
    frame #9: 0x00007ff8179df4e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #10: 0x00007ff8179daf6b libsystem_pthread.dylib`thread_start + 15
pitrou commented 7 months ago

Thanks a lot @amoeba . It appears therefore that it is timing out when trying to join a thread using Curl_thread_join. This is quite unexpected, though there seems to be a similar issue reported on the Internet: https://github.com/nim-lang/Nim/issues/15019#issuecomment-661208069

Edit: at second sight, the sentence above is incorrect. There is a thread pending on Curl_getaddrinfo_ex, which means that the DNS request is still ongoing. Presumably DNS requests on macOS can take a lot of time to time out?

pitrou commented 7 months ago

It would be nice if you could try again to catch another backtrace, so that we can see if it's always the same test timing out. If it is, we could just disable that test on macOS...

pitrou commented 7 months ago

Even better: can you try getting a traceback later? It seems DNS resolution timeout on macOS is 30 seconds.

pitrou commented 1 month ago

I've tried reenabling it but it still times out

pitrou commented 1 month ago

@amoeba Could you make progress on the diagnosis here? Otherwise, perhaps we can simply skip the NonExistentBucket test on macOS, which would hopefully allow re-enabling the other S3 tests on CI.

amoeba commented 1 month ago

Yeah, I can pick this back up this week. Sorry I dropped it!

amoeba commented 1 month ago

I spent some time Saturday trying to reproduce the timeout like I'd done before and couldn't until I was about to head out so I'll keep trying this week.

amoeba commented 1 month ago

I think I was able to reproduce the hang this morning and I collected 6 backtraces spread out of 1-2min that were all essentially identical:

bt all output ```sh (lldb) bt all * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x000000019213bee8 libsystem_kernel.dylib`__select + 8 frame #1: 0x00000001a9c3cc64 libcurl.4.dylib`Curl_poll + 516 frame #2: 0x00000001a9c3501c libcurl.4.dylib`multi_wait + 640 frame #3: 0x00000001a9c0eda8 libcurl.4.dylib`curl_easy_perform + 268 frame #4: 0x0000000105b7f720 libaws-cpp-sdk-core.dylib`Aws::Http::CurlHttpClient::MakeRequest(std::__1::shared_ptr const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const + 3524 frame #5: 0x0000000105b425a8 libaws-cpp-sdk-core.dylib`std::__1::shared_ptr smithy::components::tracing::TracingUtils::MakeCallWithTiming>(std::__1::function ()>, std::__1::basic_string, std::__1::allocator> const&, smithy::components::tracing::Meter const&, std::__1::map, std::__1::allocator>, std::__1::basic_string, std::__1::allocator>, std::__1::less, std::__1::allocator>>, std::__1::allocator, std::__1::allocator> const, std::__1::basic_string, std::__1::allocator>>>>&&, std::__1::basic_string, std::__1::allocator> const&) + 84 frame #6: 0x0000000105b4e24c libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const + 1772 frame #7: 0x0000000105b4c3e8 libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const + 948 frame #8: 0x0000000105b55a0c libaws-cpp-sdk-core.dylib`Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const + 64 frame #9: 0x0000000105eec284 libaws-cpp-sdk-s3.dylib`std::__1::__function::__func, Aws::Utils::Outcome ()>::operator()() + 1624 frame #10: 0x0000000105e94050 libaws-cpp-sdk-s3.dylib`Aws::Utils::Outcome smithy::components::tracing::TracingUtils::MakeCallWithTiming>(std::__1::function ()>, std::__1::basic_string, std::__1::allocator> const&, smithy::components::tracing::Meter const&, std::__1::map, std::__1::allocator>, std::__1::basic_string, std::__1::allocator>, std::__1::less, std::__1::allocator>>, std::__1::allocator, std::__1::allocator> const, std::__1::basic_string, std::__1::allocator>>>>&&, std::__1::basic_string, std::__1::allocator> const&) + 84 frame #11: 0x0000000105e936c0 libaws-cpp-sdk-s3.dylib`Aws::S3::S3Client::PutObject(Aws::S3::Model::PutObjectRequest const&) const + 1512 frame #12: 0x0000000110ef0060 libarrow.1800.dylib`arrow::fs::S3FileSystem::Impl::CreateEmptyDir(std::__1::basic_string, std::__1::allocator> const&, std::__1::basic_string_view>) + 524 frame #13: 0x0000000110eeecf4 libarrow.1800.dylib`arrow::fs::S3FileSystem::CreateDir(std::__1::basic_string, std::__1::allocator> const&, bool) + 1820 frame #14: 0x00000001046e0a30 arrow-s3fs-test`arrow::fs::FileSystem::CreateDir(std::__1::basic_string, std::__1::allocator> const&) + 56 frame #15: 0x00000001046e349c arrow-s3fs-test`arrow::fs::TestS3FS_CreateDir_Test::TestBody() + 4432 frame #16: 0x00000001060dfad8 libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) + 132 frame #17: 0x000000010609d8b8 libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) + 96 frame #18: 0x000000010609d808 libarrow_gtestd.1.11.0.dylib`testing::Test::Run() + 192 frame #19: 0x000000010609e8ac libarrow_gtestd.1.11.0.dylib`testing::TestInfo::Run() + 244 frame #20: 0x000000010609f99c libarrow_gtestd.1.11.0.dylib`testing::TestSuite::Run() + 276 frame #21: 0x00000001060ad70c libarrow_gtestd.1.11.0.dylib`testing::internal::UnitTestImpl::RunAllTests() + 1008 frame #22: 0x00000001060e6d6c libarrow_gtestd.1.11.0.dylib`bool testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 132 frame #23: 0x00000001060ad0e0 libarrow_gtestd.1.11.0.dylib`bool testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 96 frame #24: 0x00000001060acfcc libarrow_gtestd.1.11.0.dylib`testing::UnitTest::Run() + 216 frame #25: 0x0000000104dabef0 libarrow_gtest_maind.1.11.0.dylib`RUN_ALL_TESTS() + 16 frame #26: 0x0000000104dabed4 libarrow_gtest_maind.1.11.0.dylib`main + 76 frame #27: 0x0000000191de7154 dyld`start + 2476 thread #2 frame #0: 0x0000000192132aa4 libsystem_kernel.dylib`__workq_kernreturn + 8 thread #3, name = 'AwsEventLoop 1' frame #0: 0x0000000192136f40 libsystem_kernel.dylib`kevent + 8 frame #1: 0x0000000105cea0d4 libaws-c-io.1.0.0.dylib`aws_event_loop_thread + 412 frame #2: 0x0000000105d4cd18 libaws-c-common.1.dylib`thread_fn + 340 frame #3: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 thread #4 frame #0: 0x00000001921345ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019217255c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x0000000192097b14 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 28 frame #3: 0x000000011135129c libarrow.1800.dylib`arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) + 1040 frame #4: 0x0000000111350e54 libarrow.1800.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6::operator()() const + 88 frame #5: 0x0000000111350dc8 libarrow.1800.dylib`decltype(std::declval()()) std::__1::__invoke[abi:ue170006](arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6&&) + 24 frame #6: 0x0000000111350da4 libarrow.1800.dylib`void std::__1::__thread_execute[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>(std::__1::tuple>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>&, std::__1::__tuple_indices<>) + 28 frame #7: 0x0000000111350a1c libarrow.1800.dylib`void* std::__1::__thread_proxy[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 84 frame #8: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 thread #5 frame #0: 0x00000001921345ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019217255c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x0000000192097b14 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 28 frame #3: 0x000000011135129c libarrow.1800.dylib`arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) + 1040 frame #4: 0x0000000111350e54 libarrow.1800.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6::operator()() const + 88 frame #5: 0x0000000111350dc8 libarrow.1800.dylib`decltype(std::declval()()) std::__1::__invoke[abi:ue170006](arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6&&) + 24 frame #6: 0x0000000111350da4 libarrow.1800.dylib`void std::__1::__thread_execute[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>(std::__1::tuple>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>&, std::__1::__tuple_indices<>) + 28 frame #7: 0x0000000111350a1c libarrow.1800.dylib`void* std::__1::__thread_proxy[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 84 frame #8: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 thread #6 frame #0: 0x00000001921345ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019217255c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x0000000192097b14 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 28 frame #3: 0x000000011135129c libarrow.1800.dylib`arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) + 1040 frame #4: 0x0000000111350e54 libarrow.1800.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6::operator()() const + 88 frame #5: 0x0000000111350dc8 libarrow.1800.dylib`decltype(std::declval()()) std::__1::__invoke[abi:ue170006](arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6&&) + 24 frame #6: 0x0000000111350da4 libarrow.1800.dylib`void std::__1::__thread_execute[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>(std::__1::tuple>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>&, std::__1::__tuple_indices<>) + 28 frame #7: 0x0000000111350a1c libarrow.1800.dylib`void* std::__1::__thread_proxy[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 84 frame #8: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 thread #7 frame #0: 0x00000001921345ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019217255c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x0000000192097b14 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 28 frame #3: 0x000000011135129c libarrow.1800.dylib`arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) + 1040 frame #4: 0x0000000111350e54 libarrow.1800.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6::operator()() const + 88 frame #5: 0x0000000111350dc8 libarrow.1800.dylib`decltype(std::declval()()) std::__1::__invoke[abi:ue170006](arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6&&) + 24 frame #6: 0x0000000111350da4 libarrow.1800.dylib`void std::__1::__thread_execute[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>(std::__1::tuple>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>&, std::__1::__tuple_indices<>) + 28 frame #7: 0x0000000111350a1c libarrow.1800.dylib`void* std::__1::__thread_proxy[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 84 frame #8: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 thread #8 frame #0: 0x00000001921345ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019217255c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x0000000192097b14 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 28 frame #3: 0x000000011135129c libarrow.1800.dylib`arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) + 1040 frame #4: 0x0000000111350e54 libarrow.1800.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6::operator()() const + 88 frame #5: 0x0000000111350dc8 libarrow.1800.dylib`decltype(std::declval()()) std::__1::__invoke[abi:ue170006](arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6&&) + 24 frame #6: 0x0000000111350da4 libarrow.1800.dylib`void std::__1::__thread_execute[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>(std::__1::tuple>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>&, std::__1::__tuple_indices<>) + 28 frame #7: 0x0000000111350a1c libarrow.1800.dylib`void* std::__1::__thread_proxy[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 84 frame #8: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 thread #9 frame #0: 0x00000001921345ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019217255c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x0000000192097b14 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 28 frame #3: 0x000000011135129c libarrow.1800.dylib`arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) + 1040 frame #4: 0x0000000111350e54 libarrow.1800.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6::operator()() const + 88 frame #5: 0x0000000111350dc8 libarrow.1800.dylib`decltype(std::declval()()) std::__1::__invoke[abi:ue170006](arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6&&) + 24 frame #6: 0x0000000111350da4 libarrow.1800.dylib`void std::__1::__thread_execute[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>(std::__1::tuple>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>&, std::__1::__tuple_indices<>) + 28 frame #7: 0x0000000111350a1c libarrow.1800.dylib`void* std::__1::__thread_proxy[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 84 frame #8: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 thread #10 frame #0: 0x00000001921345ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019217255c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x0000000192097b14 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 28 frame #3: 0x000000011135129c libarrow.1800.dylib`arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) + 1040 frame #4: 0x0000000111350e54 libarrow.1800.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6::operator()() const + 88 frame #5: 0x0000000111350dc8 libarrow.1800.dylib`decltype(std::declval()()) std::__1::__invoke[abi:ue170006](arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6&&) + 24 frame #6: 0x0000000111350da4 libarrow.1800.dylib`void std::__1::__thread_execute[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>(std::__1::tuple>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>&, std::__1::__tuple_indices<>) + 28 frame #7: 0x0000000111350a1c libarrow.1800.dylib`void* std::__1::__thread_proxy[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 84 frame #8: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 thread #11 frame #0: 0x00000001921345ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019217255c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x0000000192097b14 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 28 frame #3: 0x000000011135129c libarrow.1800.dylib`arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) + 1040 frame #4: 0x0000000111350e54 libarrow.1800.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6::operator()() const + 88 frame #5: 0x0000000111350dc8 libarrow.1800.dylib`decltype(std::declval()()) std::__1::__invoke[abi:ue170006](arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6&&) + 24 frame #6: 0x0000000111350da4 libarrow.1800.dylib`void std::__1::__thread_execute[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>(std::__1::tuple>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>&, std::__1::__tuple_indices<>) + 28 frame #7: 0x0000000111350a1c libarrow.1800.dylib`void* std::__1::__thread_proxy[abi:ue170006]>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 84 frame #8: 0x0000000192171f94 libsystem_pthread.dylib`_pthread_start + 136 (lldb) ```

This trace is sufficiently different from the previous one that I'm going to try to capture another trace later today. The earlier trace was in a different test and stuck on a different syscall. Will report back.

amoeba commented 1 month ago

I wasn't able to reproduce the hang and get a debugger attached after numerous attempts today. I can keep trying but I'm not sure what a good next step here would be. The process I've been taking has been to run a modified cpp.yml workflow file, conditionally start a tmate session if the tests timeout, ssh into the VM, re-run the tests and wait until it appears things have timed out, then finally attach via lldb.

pitrou commented 1 month ago

Thanks @amoeba . I think we're quite clear now that every hang that could be reproduced was simply stalled waiting for a network response, so it's probably not a bug in our S3 filesystem at least.

We could perhaps try to disable the non-existent bucket tests on macOS to see if that makes our CI more reliable...