Open derekbruening opened 8 months ago
Xref client.attach_test failing with a similar error; the frequency of its failures may have increased with PR #6513? Maybe not enough data to say for that but it failed 2 out of 4 times in recent PR's.
@onroadmuwl hoping you can help eliminate the flakiness in this new test that you added
@onroadmuwl hoping you can help eliminate the flakiness in this new test that you added
I'm sorry to hear that. I'm not able to reproduce this problem in my local testing. And I haven't seen similar problems when I submitted code for previous PR. I guess there may be a conflict with the newly submitted code. I'm trying to debug the test suite step by step in a new PR #6537. If I delete the test code for Linux's detachment, the attach test works well. So I think there may be some problems in the detach test instead of real detach code's problem.
The attach test was failing every once in a while with the same output: #6452. So it may be the same underlying pre-existing issue that is just more frequent in the detach test.
From the output it looks like the attach never happened b/c it never prints "thank you for testing..."?
@onroadmuwl did you try running many times inside the ctest framework to see if it reproduces non-deterministically on your local machine? E.g.:
$ ctest --repeat-until-fail 500 -R client.detach
Test project /usr/local/google/home/bruening/dr/git/build_x64_dbg_tests
Start 185: code_api|client.detach_test
Test #185: code_api|client.detach_test ...... Passed 0.35 sec
Start 185: code_api|client.detach_test
Test #185: code_api|client.detach_test ...... Passed 0.34 sec
Start 185: code_api|client.detach_test
Test #185: code_api|client.detach_test ...... Passed 0.34 sec
Start 185: code_api|client.detach_test
Test #185: code_api|client.detach_test ...... Passed 0.34 sec
Start 185: code_api|client.detach_test
Test #185: code_api|client.detach_test ...... Passed 0.34 sec
Start 185: code_api|client.detach_test
Test #185: code_api|client.detach_test ......***Failed Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
] 0.44 sec
0% tests passed, 1 tests failed out of 1
The attach
is confirmed by a variable saw_attach_event
because of the uncertain order of thread init
and attach
. And we never wait for the output of thank you for testing...
inside the runall.cmake file.
dr_exit(void)
{
if (!saw_attach_event)
dr_fprintf(STDERR, "Error: never saw attach event!\n");
#ifdef WINDOWS
dr_fprintf(STDERR, "done\n");
#else
/* The app prints 'done' for us. */
#endif
}
Is it possible to solve this problem by letting the detach test execute by launching instead of attaching?
@onroadmuwl did you try running many times inside the ctest framework to see if it reproduces non-deterministically on your local machine? E.g.:
$ ctest --repeat-until-fail 500 -R client.detach Test project /usr/local/google/home/bruening/dr/git/build_x64_dbg_tests Start 185: code_api|client.detach_test Test #185: code_api|client.detach_test ...... Passed 0.35 sec Start 185: code_api|client.detach_test Test #185: code_api|client.detach_test ...... Passed 0.34 sec Start 185: code_api|client.detach_test Test #185: code_api|client.detach_test ...... Passed 0.34 sec Start 185: code_api|client.detach_test Test #185: code_api|client.detach_test ...... Passed 0.34 sec Start 185: code_api|client.detach_test Test #185: code_api|client.detach_test ...... Passed 0.34 sec Start 185: code_api|client.detach_test Test #185: code_api|client.detach_test ......***Failed Required regular expression not found. Regex=[^starting thank you for testing detach thread init detach done $ ] 0.44 sec 0% tests passed, 1 tests failed out of 1
Thank you, this method is quite useful. But it seems that this error output in local testing isn't equal to the output of github's test suite.
177/468 Testing: code_api|client.detach_test
177/468 Test: code_api|client.detach_test
Command: "/usr/bin/cmake" "-D" "toolbindir=/home/wlmu/dynamorio/build/bin64" "-D" "cmd=/home/wlmu/dynamorio/build/bin64/run_in_bg@-out@/home/wlmu/dynamorio/build/suite/tests/client.detach_test-out@/home/wlmu/dynamorio/build/bin64/drrun@-no_inject@--@/home/wlmu/dynamorio/build/suite/tests/bin/linux.infloop@-v" "-D" "out=/home/wlmu/dynamorio/build/suite/tests/client.detach_test-out" "-D" "nudge=<detach>@-quiet@-stderr_mask@0xC@-dumpcore_mask@0@-code_api@-c@/home/wlmu/dynamorio/build/suite/tests/bin/libclient.detach_test.dll.so" "-D" "clear=" "-D" "pidfile=" "-P" "/home/wlmu/dynamorio/suite/tests/runall.cmake"
Directory: /home/wlmu/dynamorio/build/suite/tests
"code_api|client.detach_test" start time: Jan 03 23:23 CST
Output:
----------------------------------------------------------
<end of output>
Test time = 1500.13 sec
----------------------------------------------------------
Test Fail Reason:
Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]
"code_api|client.detach_test" end time: Jan 03 23:48 CST
"code_api|client.detach_test" time elapsed: 00:25:00
----------------------------------------------------------
End testing: Jan 03 23:48 CST
Your local run times out (1500s == 25:00). BTW that is a long timeout: maybe it should be shortened (xref #6127). It is still definitely worth debugging and fixing that local timeout -- aarch64 sees a timeout (#5740). Plus the fix may end up affecting the GA behavior too.
After completing next week's exams, I will attempt to address the above questions.
Sure, thank you for your efforts.
It seems that the flakiness of attach and detach has disappeared inexplicably in PR 6537. In the current code, the done
string is printed in infloop.c. With reference to Windows, I once thought that we can reduce attach's flakiness by print done
in the attach_test.dll.c. The previous tests follow this rule. But we can't get such a conclusion in recent tests.
It seems that the flakiness of attach and detach has disappeared inexplicably in https://github.com/DynamoRIO/dynamorio/pull/6537.
You mean you can't reproduce the failures on the Github Actions (GA) VM's? I thought you were able to reproduce locally by running --repeat-until-fail
or just a shell loop?
You mean you can't reproduce the failures on the Github Actions (GA) VM's? I thought you were able to reproduce locally by running
--repeat-until-fail
or just a shell loop? The flakiness is caused bykill cannot find target process
, I can't reproduce the same error on my machine.
Hmm, it did repro on my machine every so often as shown at https://github.com/DynamoRIO/dynamorio/issues/6536#issuecomment-1875548248. Not sure when I will have time to try to debug it though.
The error of my local testing is 120 - code_api|client.detach_test (Timeout)
, I can fix my error through the newest code in PR 6537, which modifies the position of printing "done\n"
. But it can't always fix the errors of Github Actions (GA) VM. You can try to see if this modification works on your machine.
Start 120: code_api|client.detach_test
Test #120: code_api|client.detach_test ...... Passed 0.25 sec
Start 120: code_api|client.detach_test
Test #120: code_api|client.detach_test ...... Passed 0.25 sec
Start 120: code_api|client.detach_test
Test #120: code_api|client.detach_test ...... Passed 0.25 sec
......
Start 120: code_api|client.detach_test
Test #120: code_api|client.detach_test ......***Timeout 1500.16 sec
0% tests passed, 1 tests failed out of 1
Total Test time (real) = 1520.32 sec
The following tests FAILED:
120 - code_api|client.detach_test (Timeout)
Errors while running CTest
Does your timeout fix solve the timeouts that we see in these tests in the ci-aarch64-native
run on Github Actions? If so, or even if not, it is worth checking in any improvement.
The timeouts turn out to be just missing ptrace privs: #6558
I have tested again in PR 6537, the detach test can be passed smoothly now. Would you like to tell me why the detach test was passed sometimes without setting ptrace_scope as 0 ? I'm curious about how the test did it.
I have tested again in PR 6537, the detach test can be passed smoothly now. Would you like to tell me why the detach test was passed sometimes without setting ptrace_scope as 0 ? I'm curious about how the test did it.
I think it failed every single time on the aarch64 machine due to the ptrace_scope.
I have tested again in https://github.com/DynamoRIO/dynamorio/pull/6537, the detach test can be passed smoothly now.
Oh you mean you figured out the error about not finding the process to kill and having no output?
I have tested again in #6537, the detach test can be passed smoothly now.
Oh you mean you figured out the error about not finding the process to kill and having no output?
Please change #6537 from Draft and request a review.
All right,I haven‘t find the solution of this problem. I misunderstand that the flakiness on X64 is also caused by ptrace privilege.
This reproduces in release build on a local machine.
Success:
183: Running |dr/git/build_x64_rel_tests/bin64/run_in_bg;-out;dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out;dr/git/build_x64_rel_tests/bin64/drrun;-no_inject;--;dr/git/build_x64_rel_tests/suite/tests/bin/linux.infloop;-v|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out| contents
183: Running |run_in_bg dr/git/build_x64_rel_tests/bin64/drrun;-attach;3620449;-takeover_sleep;-takeovers;1000;-quiet;-stderr_mask;0xC;-dumpcore_mask;0;-code_api;-c;dr/git/build_x64_rel_tests/suite/tests/bin/libclient.detach_test.dll.so|
183: Waiting for thread init in |starting
183: |
183: Running |drconfig -detach 3620449|
183: Waiting for detach in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for done in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: starting
183: thank you for testing detach
183: thread init
183: detach
183: done
Failure:
183: Running |dr/git/build_x64_rel_tests/bin64/run_in_bg;-out;dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out;dr/git/build_x64_rel_tests/bin64/drrun;-no_inject;--;dr/git/build_x64_rel_tests/suite/tests/bin/linux.infloop;-v|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out| contents
183: Running |run_in_bg dr/git/build_x64_rel_tests/bin64/drrun;-attach;3616219;-takeover_sleep;-takeovers;1000;-quiet;-stderr_mask;0xC;-dumpcore_mask;0;-code_api;-c;dr/git/build_x64_rel_tests/suite/tests/bin/libclient.detach_test.dll.so|
183: Waiting for thread init in |starting
183: |
183: Running |drconfig -detach 3616219|
183: Waiting for detach in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: CMake Error at dr/git/src/suite/tests/runall.cmake:131 (message):
183: *** kill failed (1): /usr/bin/kill: (3616219): No such process
The file:
$ cat dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out
starting
thank you for testing detach
thread init
detach
That kill that failed I believe is sending SIGTERM which is the expected method for the infloop app to exit. So I would assume that the detach crashed the process (after printing "detach") and so it's gone by the time the SIGTERM is sent.
If the detach crashed the process on X86-64 platform sometimes , it may happen at the stage of setting nudged thread's context. On the X86-64 platform, I didn't use the methods of sigreturn before due to the lack of fpstate information inside dcontext. But the fpstate information can be also obtained by the sc(sigcontext). So I create a new pr in https://github.com/DynamoRIO/dynamorio/pull/6579 to detach the process by sigreturn on X86-64 platform. It works well on my local machine(repeat-until-fail). Could you try to see if it works on your machine stably?
It still fails locally and on the x86-64 job (after a re-run; passed 1st time): https://github.com/DynamoRIO/dynamorio/pull/6579#issuecomment-1909049628 You can test it by just re-running the x86-64 job as it seems to fail at least half the time.
We are going to have to mark this test as flaky to get our suite green. Please continue working on it and when the detach feature is fully working we can un-mark the test.
I'll work on it and submit the related code if feasible solutions are found.
So you still cannot reproduce on any of your local machines, and you couldn't reproduce under tmate on the Github Actions runners here? It is failing every other time so you'd think it would reproduce under tmate on the Actions runners.
I have reproduced the failure result of timeout on another new x86-64 machine. And I fixed this error by printing the done in detach_test.dll.c instead of infloop.c. It works well on the new machine now. I re-ran the GA action more than five time and the failure of detach_test hasn't appeared (PR https://github.com/DynamoRIO/dynamorio/pull/6579). Could you please try to test it on your machine again?
By the way, all of my modifications are based on the previous version of DynamoRIO. There will be the following error when I try to build the newest DynamoRIO. Would you like to tell me why this error occurred if you ever met it?
-- Using elfutils
CMake Error at ext/drsyms/CMakeLists.txt:176 (message):
Failed to find single source for
/home/wlmu/dynamorio-detach_x86/ext/drsyms/elfutils/dwarf_begin_elf.c.patch
-- Configuring incomplete, errors occurred!
See also "/home/wlmu/dynamorio-detach_x86/build/CMakeFiles/CMakeOutput.log".
See also "/home/wlmu/dynamorio-detach_x86/build/CMakeFiles/CMakeError.log".
Did you initialize the elfutils submodule? I.e., is the directory empty? It needs to be initialized.
And I fixed this error by printing the done in detach_test.dll.c instead of infloop.c
But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?
But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?
In the 317 line of suite/tests/runall.cmake file, there is a "_kill_backgroundprocess(OFF)" command. This command is executed after detaching and before printing done. If the program is crashed after detaching, there will print an error like "kill cannot find target process". So it can ensure that the program isn't crashed. And this modification refers to the detachment test on Windows platform.
Did you initialize the elfutils submodule? I.e., is the directory empty? It needs to be initialized.
It works for me, thanks.
But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?
In the 317 line of suite/tests/runall.cmake file, there is a "_kill_backgroundprocess(OFF)" command. This command is executed after detaching and before printing done. If the program is crashed after detaching, there will print an error like "kill cannot find target process". So it can ensure that the program isn't crashed. And this modification refers to the detachment test on Windows platform.
It's also unstable sometimes. https://github.com/DynamoRIO/dynamorio/actions/runs/7698709453/job/20978621263?pr=6579
The Linux client.detach_test was just added in PR #6513 and it is failing (non-deterministically) on x86-64.
https://github.com/DynamoRIO/dynamorio/actions/runs/7381905578/job/20080993968?pr=6531