apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[CI][FS][Azure] Azurite tests are flaking on `main` #40121

Open Tom-Newton opened 8 months ago

Tom-Newton commented 8 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Flaky failures like

C++ exception with description "Connection closed before getting full response or response is less than expected. Expected response length = 254. Read until now = 231" thrown in the test body.
2024-02-18T12:50:20.039Z ada6933e-9c33-47d2-86f6-29e9aa01f713 info: BlobStorageContextMiddleware: RequestMethod=DELETE RequestURL=http://127.0.0.1/devstoreaccount1/container?restype=container RequestHeaders:{"authorization":"SharedKey devstoreaccount1:hYh+JRj5cBYqdqOyM2wB3EZizQ/s2DiIoDI0CIF2EXM=","host":"127.0.0.1:10000","user-agent":"azsdk-cpp-storage-blobs/12.10.0-beta.1 (Linux 6.2.0-1019-azure x86_64 #19~22.04.1-Ubuntu SMP Wed Jan 10 22:57:03 UTC 2024)","x-ms-client-request-id":"be6819a2-72b8-4630-8eb0-4a88e7cb3061","x-ms-date":"Sun, 18 Feb 2024 12:50:20 GMT","x-ms-version":"2022-11-02"} ClientIP=127.0.0.1 Protocol=http HTTPVersion=1.1

I've seen the occur in different test cases and in different test suites.

Example failures: I've seen one flake on main: https://github.com/apache/arrow/actions/runs/7915689559/job/21608061673 Flakes on my recent PRs: https://github.com/apache/arrow/actions/runs/7951594516/job/21705210845?pr=40080 https://github.com/apache/arrow/actions/runs/7949050250/job/21699789831?pr=40080

Component(s)

C++, Continuous Integration

kou commented 8 months ago

Hmm, it seems that failed tests aren't same... Can we re-run failed tests?

diff --git a/ci/scripts/cpp_test.sh b/ci/scripts/cpp_test.sh
index 1d685c51a9..a23ea8eb1c 100755
--- a/ci/scripts/cpp_test.sh
+++ b/ci/scripts/cpp_test.sh
@@ -86,6 +86,7 @@ ctest \
     --label-regex unittest \
     --output-on-failure \
     --parallel ${n_jobs} \
+    --repeat until-pass:3 \
     --timeout ${ARROW_CTEST_TIMEOUT:-300} \
     "${ctest_options[@]}" \
     "$@"
Tom-Newton commented 8 months ago

I expect retries would be an effective mitigation.