Open amoeba opened 8 months ago
When running this test locally on my non-AMD64 Mac, I actually get three failures:
[ RUN ] TestS3FS.CreateDir
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:934: Failure
Failed
Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got OK
[ FAILED ] TestS3FS.CreateDir (37 ms)
[ RUN ] TestS3FSGeneric.CopyFile
Warning: The standard parity is set to 0. This can lead to data loss.
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/test_util.cc:570: Failure
Failed
Expected 'fs->CopyFile("AB/abc", "def/mno")' to fail with IOError, but got OK
[ FAILED ] TestS3FSGeneric.CopyFile (66 ms)
[ RUN ] TestS3FSGeneric.MoveFile
Warning: The standard parity is set to 0. This can lead to data loss.
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/test_util.cc:448: Failure
Failed
Expected 'fs->Move("AB/pqr", "xxx/mno")' to fail with IOError, but got OK
[ FAILED ] TestS3FSGeneric.MoveFile (81 ms)
I see in this job we run ctest with,
ctest --label-regex unittest --output-on-failure --parallel 4 --repeat until-pass:3 --timeout 300
My hunch is that it's possible the test is failing and ctest is retrying. My local machine is fast enough that it can get 3 repeats in before the timeout but it's possible GitHub's runner is slow enough that the timeout happens first.
Another thing I noticed is that, when ctest automatically re-attempts the test as requested, it still prints "attempt 1/1" each time:
Running arrow-s3fs-test, redirecting output into /Users/bryce/src/apache/arrow/cpp/build/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)
I reverted my minio and mc binaries to the ones we pin for the tests and the above failures go away so I don't think that's what's going on. Though the above errors may be something to be aware of we ever need to upgrade minio/mc.
Relevant link -> https://github.com/apache/arrow/pull/34671
I haven't been able to reproduce locally (without modifying anything) but I have been able to reproduce once on CI with a stripped-down workflow in this run https://github.com/amoeba/arrow/actions/runs/8194916281/job/22416431654 so I'm going to try debugging that with tmate.
Good luck with that, my attempts at reproducing while logging in using tmate actually failed here: https://github.com/apache/arrow/pull/40401
I'm having success reproducing the timeout by SSH'ing into GHA so this has now turned into me figuring out how to get useful information out of the hanging test with dtrace. A did a bit of searching to find out how to get thread stack traces from a running process and ended up trying sudo dtrace -p $PID -n 'profile-997 /pid == '$PID'/ { @[ustack(100)] = count();}' -o somefile
which I don'k think is close but not quite what I'm after. It behaves like it captures just a single stack trace so I'm going to keep experimenting.
But so far, it produces this more often than not, which is interesting,
libsystem_kernel.dylib`__pipe+0xa
libcurl.4.dylib`curl_easy_perform+0x77
libaws-cpp-sdk-core.dylib`Aws::Http::CurlHttpClient::MakeRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const+0xda6
libaws-cpp-sdk-core.dylib`std::__1::__function::__func<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24, std::__1::allocator<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24>, std::__1::shared_ptr<Aws::Http::HttpResponse> ()>::operator()()+0x26
libaws-cpp-sdk-core.dylib`std::__1::shared_ptr<Aws::Http::HttpResponse> smithy::components::tracing::TracingUtils::MakeCallWithTiming<std::__1::shared_ptr<Aws::Http::HttpResponse> >(std::__1::function<std::__1::shared_ptr<Aws::Http::HttpResponse> ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, smithy::components::tracing::Meter const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x48
libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const+0x779
libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const+0x43b
libaws-cpp-sdk-core.dylib`Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const+0x42
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(Aws::S3::Model::HeadBucketRequest const&)+0xf2
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0xb8
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegionUncached(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0xda
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegion(d::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x12c
libarrow.1600.0.0.dylib`arrow::fs::ResolveS3BucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x265a
libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleSehExceptionsInMethodIfSuppord<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x
libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x6a
libarrow_gtestd.1.11.0.dylib`testing::TestInfo::Run()+0xlibarrow_gtestd.1.11.0.dylib`testing::TestSuite::Run()+0x104
More easily perhaps, you can use gdb -p
to take control of a running process and debug it.
lldb has something similar.
Thanks. A bt all
gives something similar. In an attempt to catch when the test was approaching timeout, I got a baseline for how long the test takes to complete normally which was pretty consistently ~25s. I ran the test by itself until it took longer and waited until ~60s and attached using lldb -p
and then ran bt all
:
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007ff8179a40aa libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007ff8179e0c95 libsystem_pthread.dylib`_pthread_join + 358
frame #2: 0x00007ff82c299cda libcurl.4.dylib`Curl_thread_join + 23
frame #3: 0x00007ff82c28c45c libcurl.4.dylib`Curl_resolver_kill + 41
frame #4: 0x00007ff82c2c21a0 libcurl.4.dylib`multi_done + 75
frame #5: 0x00007ff82c2c5876 libcurl.4.dylib`multi_handle_timeout + 291
frame #6: 0x00007ff82c2c4254 libcurl.4.dylib`multi_runsingle + 3542
frame #7: 0x00007ff82c2c3391 libcurl.4.dylib`curl_multi_perform + 212
frame #8: 0x00007ff82c29e7eb libcurl.4.dylib`curl_easy_perform + 299
frame #9: 0x0000000102139df2 libaws-cpp-sdk-core.dylib`Aws::Http::CurlHttpClient::MakeRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const + 3494
frame #10: 0x000000010212b61c libaws-cpp-sdk-core.dylib`std::__1::__function::__func<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24, std::__1::allocator<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24>, std::__1::shared_ptr<Aws::Http::HttpResponse> ()>::operator()() + 38
frame #11: 0x0000000102116e78 libaws-cpp-sdk-core.dylib`std::__1::shared_ptr<Aws::Http::HttpResponse> smithy::components::tracing::TracingUtils::MakeCallWithTiming<std::__1::shared_ptr<Aws::Http::HttpResponse> >(std::__1::function<std::__1::shared_ptr<Aws::Http::HttpResponse> ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, smithy::components::tracing::Meter const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 72
frame #12: 0x0000000102113883 libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const + 1913
frame #13: 0x00000001021116af libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const + 1083
frame #14: 0x000000010211c612 libaws-cpp-sdk-core.dylib`Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const + 66
frame #15: 0x0000000120bbb4d2 libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(Aws::S3::Model::HeadBucketRequest const&) + 242
frame #16: 0x0000000120bbb368 libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 184
frame #17: 0x0000000120bbac9a libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegionUncached(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 218
frame #18: 0x0000000120b56d3c libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 300
frame #19: 0x0000000120b49ab5 libarrow.1600.0.0.dylib`arrow::fs::ResolveS3BucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 613
frame #20: 0x0000000100c7cfd4 arrow-s3fs-test`arrow::fs::S3RegionResolutionTest_NonExistentBucket_Test::TestBody() + 68
frame #21: 0x00000001026db47b libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 123
frame #22: 0x000000010269dcba libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 106
frame #23: 0x000000010269dc03 libarrow_gtestd.1.11.0.dylib`testing::Test::Run() + 195
frame #24: 0x000000010269ed81 libarrow_gtestd.1.11.0.dylib`testing::TestInfo::Run() + 241
frame #25: 0x000000010269fe24 libarrow_gtestd.1.11.0.dylib`testing::TestSuite::Run() + 260
frame #26: 0x00000001026af16b libarrow_gtestd.1.11.0.dylib`testing::internal::UnitTestImpl::RunAllTests() + 987
frame #27: 0x00000001026e04bb libarrow_gtestd.1.11.0.dylib`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 123
frame #28: 0x00000001026aeb2a libarrow_gtestd.1.11.0.dylib`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 106
frame #29: 0x00000001026aea15 libarrow_gtestd.1.11.0.dylib`testing::UnitTest::Run() + 197
frame #30: 0x00000001011e0f21 libarrow_gtest_maind.1.11.0.dylib`RUN_ALL_TESTS() + 17
frame #31: 0x00000001011e0efd libarrow_gtest_maind.1.11.0.dylib`main + 61
frame #32: 0x000000010afe352e dyld`start + 462
thread #2
frame #0: 0x00007ff8179a401a libsystem_kernel.dylib`__workq_kernreturn + 10
frame #1: 0x00007ff8179dc034 libsystem_pthread.dylib`_pthread_wqthread + 426
frame #2: 0x00007ff8179daf57 libsystem_pthread.dylib`start_wqthread + 15
thread #3, name = 'AwsEventLoop 1'
frame #0: 0x00007ff8179a730e libsystem_kernel.dylib`kevent + 10
frame #1: 0x0000000101fe99ec libaws-c-io.1.0.0.dylib`aws_event_loop_thread + 404
frame #2: 0x00000001024ffd3d libaws-c-common.1.0.0.dylib`thread_fn + 345
frame #3: 0x00007ff8179df4e1 libsystem_pthread.dylib`_pthread_start + 125
frame #4: 0x00007ff8179daf6b libsystem_pthread.dylib`thread_start + 15
thread #4
frame #0: 0x00007ff8179a730e libsystem_kernel.dylib`kevent + 10
frame #1: 0x00007ff817a052ee libsystem_info.dylib`_mdns_search_ex + 1217
frame #2: 0x00007ff817a07c1c libsystem_info.dylib`mdns_addrinfo + 483
frame #3: 0x00007ff817a079e4 libsystem_info.dylib`search_addrinfo + 164
frame #4: 0x00007ff817a00659 libsystem_info.dylib`si_addrinfo + 1811
frame #5: 0x00007ff8179ffeac libsystem_info.dylib`getaddrinfo + 176
frame #6: 0x00007ff82c296186 libcurl.4.dylib`Curl_getaddrinfo_ex + 36
frame #7: 0x00007ff82c28ca24 libcurl.4.dylib`getaddrinfo_thread + 83
frame #8: 0x00007ff82c299c8d libcurl.4.dylib`curl_thread_create_thunk + 29
frame #9: 0x00007ff8179df4e1 libsystem_pthread.dylib`_pthread_start + 125
frame #10: 0x00007ff8179daf6b libsystem_pthread.dylib`thread_start + 15
Thanks a lot @amoeba . It appears therefore that it is timing out when trying to join a thread using Curl_thread_join
. This is quite unexpected, though there seems to be a similar issue reported on the Internet:
https://github.com/nim-lang/Nim/issues/15019#issuecomment-661208069
Edit: at second sight, the sentence above is incorrect. There is a thread pending on Curl_getaddrinfo_ex
, which means that the DNS request is still ongoing. Presumably DNS requests on macOS can take a lot of time to time out?
It would be nice if you could try again to catch another backtrace, so that we can see if it's always the same test timing out. If it is, we could just disable that test on macOS...
Even better: can you try getting a traceback later? It seems DNS resolution timeout on macOS is 30 seconds.
I've tried reenabling it but it still times out
@amoeba Could you make progress on the diagnosis here? Otherwise, perhaps we can simply skip the NonExistentBucket
test on macOS, which would hopefully allow re-enabling the other S3 tests on CI.
Yeah, I can pick this back up this week. Sorry I dropped it!
I spent some time Saturday trying to reproduce the timeout like I'd done before and couldn't until I was about to head out so I'll keep trying this week.
I think I was able to reproduce the hang this morning and I collected 6 backtraces spread out of 1-2min that were all essentially identical:
This trace is sufficiently different from the previous one that I'm going to try to capture another trace later today. The earlier trace was in a different test and stuck on a different syscall. Will report back.
I wasn't able to reproduce the hang and get a debugger attached after numerous attempts today. I can keep trying but I'm not sure what a good next step here would be. The process I've been taking has been to run a modified cpp.yml
workflow file, conditionally start a tmate session if the tests timeout, ssh into the VM, re-run the tests and wait until it appears things have timed out, then finally attach via lldb.
Thanks @amoeba . I think we're quite clear now that every hang that could be reproduced was simply stalled waiting for a network response, so it's probably not a bug in our S3 filesystem at least.
We could perhaps try to disable the non-existent bucket tests on macOS to see if that makes our CI more reliable...
Describe the bug, including details regarding any error messages, version, and platform.
The
arrow-s3fs-test
has been failing due to what appears to be a timeout on the "AMD64 macOS 12 C++" job for a while.A recent example is: https://github.com/apache/arrow/actions/runs/8179809257/job/22366589254?pr=40373#step:13:265 and the relevant output is:
gist for posterity.
I expect this causes a lot of noise for maintainers so figuring out why this test is timing out would be good.
Component(s)
C++, Continuous Integration