DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.62k stars 554 forks source link

client.detach_test failing: kill cannot find target process #6536

Open derekbruening opened 8 months ago

derekbruening commented 8 months ago

The Linux client.detach_test was just added in PR #6513 and it is failing (non-deterministically) on x86-64.

https://github.com/DynamoRIO/dynamorio/actions/runs/7381905578/job/20080993968?pr=6531

      Test #185: code_api|client.detach_test ......................................***Failed  Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]  0.26 sec
CMake Error at /home/runner/work/dynamorio/dynamorio/suite/tests/runall.cmake:130 (message):
  *** kill failed (1): kill: (24600): No such process
  ***
Call Stack (most recent call first):
  /home/runner/work/dynamorio/dynamorio/suite/tests/runall.cmake:317 (kill_background_process)
derekbruening commented 8 months ago

Xref client.attach_test failing with a similar error; the frequency of its failures may have increased with PR #6513? Maybe not enough data to say for that but it failed 2 out of 4 times in recent PR's.

derekbruening commented 8 months ago

@onroadmuwl hoping you can help eliminate the flakiness in this new test that you added

onroadmuwl commented 8 months ago

@onroadmuwl hoping you can help eliminate the flakiness in this new test that you added

I'm sorry to hear that. I'm not able to reproduce this problem in my local testing. And I haven't seen similar problems when I submitted code for previous PR. I guess there may be a conflict with the newly submitted code. I'm trying to debug the test suite step by step in a new PR #6537. If I delete the test code for Linux's detachment, the attach test works well. So I think there may be some problems in the detach test instead of real detach code's problem.

derekbruening commented 8 months ago

The attach test was failing every once in a while with the same output: #6452. So it may be the same underlying pre-existing issue that is just more frequent in the detach test.

From the output it looks like the attach never happened b/c it never prints "thank you for testing..."?

derekbruening commented 8 months ago

@onroadmuwl did you try running many times inside the ctest framework to see if it reproduces non-deterministically on your local machine? E.g.:

$ ctest --repeat-until-fail 500 -R client.detach
Test project /usr/local/google/home/bruening/dr/git/build_x64_dbg_tests
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.35 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......***Failed  Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]  0.44 sec

0% tests passed, 1 tests failed out of 1
onroadmuwl commented 8 months ago

The attachis confirmed by a variable saw_attach_event because of the uncertain order of thread init and attach. And we never wait for the output of thank you for testing... inside the runall.cmake file.

dr_exit(void)
{
    if (!saw_attach_event)
        dr_fprintf(STDERR, "Error: never saw attach event!\n");
#ifdef WINDOWS
    dr_fprintf(STDERR, "done\n");
#else
        /* The app prints 'done' for us. */
#endif
}

Is it possible to solve this problem by letting the detach test execute by launching instead of attaching?

onroadmuwl commented 8 months ago

@onroadmuwl did you try running many times inside the ctest framework to see if it reproduces non-deterministically on your local machine? E.g.:

$ ctest --repeat-until-fail 500 -R client.detach
Test project /usr/local/google/home/bruening/dr/git/build_x64_dbg_tests
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.35 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......***Failed  Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]  0.44 sec

0% tests passed, 1 tests failed out of 1

Thank you, this method is quite useful. But it seems that this error output in local testing isn't equal to the output of github's test suite.

177/468 Testing: code_api|client.detach_test
177/468 Test: code_api|client.detach_test
Command: "/usr/bin/cmake" "-D" "toolbindir=/home/wlmu/dynamorio/build/bin64" "-D" "cmd=/home/wlmu/dynamorio/build/bin64/run_in_bg@-out@/home/wlmu/dynamorio/build/suite/tests/client.detach_test-out@/home/wlmu/dynamorio/build/bin64/drrun@-no_inject@--@/home/wlmu/dynamorio/build/suite/tests/bin/linux.infloop@-v" "-D" "out=/home/wlmu/dynamorio/build/suite/tests/client.detach_test-out" "-D" "nudge=<detach>@-quiet@-stderr_mask@0xC@-dumpcore_mask@0@-code_api@-c@/home/wlmu/dynamorio/build/suite/tests/bin/libclient.detach_test.dll.so" "-D" "clear=" "-D" "pidfile=" "-P" "/home/wlmu/dynamorio/suite/tests/runall.cmake"
Directory: /home/wlmu/dynamorio/build/suite/tests
"code_api|client.detach_test" start time: Jan 03 23:23 CST
Output:
----------------------------------------------------------
<end of output>
Test time = 1500.13 sec
----------------------------------------------------------
Test Fail Reason:
Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]
"code_api|client.detach_test" end time: Jan 03 23:48 CST
"code_api|client.detach_test" time elapsed: 00:25:00
----------------------------------------------------------

End testing: Jan 03 23:48 CST
derekbruening commented 8 months ago

Your local run times out (1500s == 25:00). BTW that is a long timeout: maybe it should be shortened (xref #6127). It is still definitely worth debugging and fixing that local timeout -- aarch64 sees a timeout (#5740). Plus the fix may end up affecting the GA behavior too.

onroadmuwl commented 8 months ago

After completing next week's exams, I will attempt to address the above questions.

derekbruening commented 8 months ago

Sure, thank you for your efforts.

onroadmuwl commented 8 months ago

It seems that the flakiness of attach and detach has disappeared inexplicably in PR 6537. In the current code, the done string is printed in infloop.c. With reference to Windows, I once thought that we can reduce attach's flakiness by print donein the attach_test.dll.c. The previous tests follow this rule. But we can't get such a conclusion in recent tests.

derekbruening commented 8 months ago

It seems that the flakiness of attach and detach has disappeared inexplicably in https://github.com/DynamoRIO/dynamorio/pull/6537.

You mean you can't reproduce the failures on the Github Actions (GA) VM's? I thought you were able to reproduce locally by running --repeat-until-fail or just a shell loop?

onroadmuwl commented 8 months ago

You mean you can't reproduce the failures on the Github Actions (GA) VM's? I thought you were able to reproduce locally by running --repeat-until-fail or just a shell loop? The flakiness is caused by kill cannot find target process, I can't reproduce the same error on my machine.

derekbruening commented 8 months ago

Hmm, it did repro on my machine every so often as shown at https://github.com/DynamoRIO/dynamorio/issues/6536#issuecomment-1875548248. Not sure when I will have time to try to debug it though.

onroadmuwl commented 8 months ago

The error of my local testing is 120 - code_api|client.detach_test (Timeout), I can fix my error through the newest code in PR 6537, which modifies the position of printing "done\n". But it can't always fix the errors of Github Actions (GA) VM. You can try to see if this modification works on your machine.

    Start 120: code_api|client.detach_test
    Test #120: code_api|client.detach_test ......   Passed    0.25 sec
    Start 120: code_api|client.detach_test
    Test #120: code_api|client.detach_test ......   Passed    0.25 sec
    Start 120: code_api|client.detach_test
    Test #120: code_api|client.detach_test ......   Passed    0.25 sec
     ......
    Start 120: code_api|client.detach_test
    Test #120: code_api|client.detach_test ......***Timeout 1500.16 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 1520.32 sec

The following tests FAILED:
        120 - code_api|client.detach_test (Timeout)
Errors while running CTest
derekbruening commented 8 months ago

Does your timeout fix solve the timeouts that we see in these tests in the ci-aarch64-native run on Github Actions? If so, or even if not, it is worth checking in any improvement.

derekbruening commented 8 months ago

The timeouts turn out to be just missing ptrace privs: #6558

onroadmuwl commented 8 months ago

I have tested again in PR 6537, the detach test can be passed smoothly now. Would you like to tell me why the detach test was passed sometimes without setting ptrace_scope as 0 ? I'm curious about how the test did it.

derekbruening commented 8 months ago

I have tested again in PR 6537, the detach test can be passed smoothly now. Would you like to tell me why the detach test was passed sometimes without setting ptrace_scope as 0 ? I'm curious about how the test did it.

I think it failed every single time on the aarch64 machine due to the ptrace_scope.

derekbruening commented 8 months ago

I have tested again in https://github.com/DynamoRIO/dynamorio/pull/6537, the detach test can be passed smoothly now.

Oh you mean you figured out the error about not finding the process to kill and having no output?

derekbruening commented 8 months ago

I have tested again in #6537, the detach test can be passed smoothly now.

Oh you mean you figured out the error about not finding the process to kill and having no output?

Please change #6537 from Draft and request a review.

onroadmuwl commented 8 months ago

All right,I haven‘t find the solution of this problem. I misunderstand that the flakiness on X64 is also caused by ptrace privilege.

derekbruening commented 8 months ago

This reproduces in release build on a local machine.

Success:

183: Running |dr/git/build_x64_rel_tests/bin64/run_in_bg;-out;dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out;dr/git/build_x64_rel_tests/bin64/drrun;-no_inject;--;dr/git/build_x64_rel_tests/suite/tests/bin/linux.infloop;-v|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out| contents
183: Running |run_in_bg dr/git/build_x64_rel_tests/bin64/drrun;-attach;3620449;-takeover_sleep;-takeovers;1000;-quiet;-stderr_mask;0xC;-dumpcore_mask;0;-code_api;-c;dr/git/build_x64_rel_tests/suite/tests/bin/libclient.detach_test.dll.so|
183: Waiting for thread init in |starting
183: |
183: Running |drconfig -detach 3620449|
183: Waiting for detach in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for done in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: starting
183: thank you for testing detach
183: thread init
183: detach
183: done

Failure:

183: Running |dr/git/build_x64_rel_tests/bin64/run_in_bg;-out;dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out;dr/git/build_x64_rel_tests/bin64/drrun;-no_inject;--;dr/git/build_x64_rel_tests/suite/tests/bin/linux.infloop;-v|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out| contents
183: Running |run_in_bg dr/git/build_x64_rel_tests/bin64/drrun;-attach;3616219;-takeover_sleep;-takeovers;1000;-quiet;-stderr_mask;0xC;-dumpcore_mask;0;-code_api;-c;dr/git/build_x64_rel_tests/suite/tests/bin/libclient.detach_test.dll.so|
183: Waiting for thread init in |starting
183: |
183: Running |drconfig -detach 3616219|
183: Waiting for detach in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: CMake Error at dr/git/src/suite/tests/runall.cmake:131 (message):
183:   *** kill failed (1): /usr/bin/kill: (3616219): No such process

The file:

$ cat dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out
starting
thank you for testing detach
thread init
detach
derekbruening commented 8 months ago

That kill that failed I believe is sending SIGTERM which is the expected method for the infloop app to exit. So I would assume that the detach crashed the process (after printing "detach") and so it's gone by the time the SIGTERM is sent.

onroadmuwl commented 7 months ago

If the detach crashed the process on X86-64 platform sometimes , it may happen at the stage of setting nudged thread's context. On the X86-64 platform, I didn't use the methods of sigreturn before due to the lack of fpstate information inside dcontext. But the fpstate information can be also obtained by the sc(sigcontext). So I create a new pr in https://github.com/DynamoRIO/dynamorio/pull/6579 to detach the process by sigreturn on X86-64 platform. It works well on my local machine(repeat-until-fail). Could you try to see if it works on your machine stably?

derekbruening commented 7 months ago

It still fails locally and on the x86-64 job (after a re-run; passed 1st time): https://github.com/DynamoRIO/dynamorio/pull/6579#issuecomment-1909049628 You can test it by just re-running the x86-64 job as it seems to fail at least half the time.

We are going to have to mark this test as flaky to get our suite green. Please continue working on it and when the detach feature is fully working we can un-mark the test.

onroadmuwl commented 7 months ago

I'll work on it and submit the related code if feasible solutions are found.

derekbruening commented 7 months ago

So you still cannot reproduce on any of your local machines, and you couldn't reproduce under tmate on the Github Actions runners here? It is failing every other time so you'd think it would reproduce under tmate on the Actions runners.

onroadmuwl commented 7 months ago

I have reproduced the failure result of timeout on another new x86-64 machine. And I fixed this error by printing the done in detach_test.dll.c instead of infloop.c. It works well on the new machine now. I re-ran the GA action more than five time and the failure of detach_test hasn't appeared (PR https://github.com/DynamoRIO/dynamorio/pull/6579). Could you please try to test it on your machine again?

By the way, all of my modifications are based on the previous version of DynamoRIO. There will be the following error when I try to build the newest DynamoRIO. Would you like to tell me why this error occurred if you ever met it?

-- Using elfutils
CMake Error at ext/drsyms/CMakeLists.txt:176 (message):
  Failed to find single source for
  /home/wlmu/dynamorio-detach_x86/ext/drsyms/elfutils/dwarf_begin_elf.c.patch

-- Configuring incomplete, errors occurred!
See also "/home/wlmu/dynamorio-detach_x86/build/CMakeFiles/CMakeOutput.log".
See also "/home/wlmu/dynamorio-detach_x86/build/CMakeFiles/CMakeError.log".
derekbruening commented 7 months ago

Did you initialize the elfutils submodule? I.e., is the directory empty? It needs to be initialized.

derekbruening commented 7 months ago

And I fixed this error by printing the done in detach_test.dll.c instead of infloop.c

But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?

onroadmuwl commented 7 months ago

But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?

In the 317 line of suite/tests/runall.cmake file, there is a "_kill_backgroundprocess(OFF)" command. This command is executed after detaching and before printing done. If the program is crashed after detaching, there will print an error like "kill cannot find target process". So it can ensure that the program isn't crashed. And this modification refers to the detachment test on Windows platform.

onroadmuwl commented 7 months ago

Did you initialize the elfutils submodule? I.e., is the directory empty? It needs to be initialized.

It works for me, thanks.

onroadmuwl commented 7 months ago

But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?

In the 317 line of suite/tests/runall.cmake file, there is a "_kill_backgroundprocess(OFF)" command. This command is executed after detaching and before printing done. If the program is crashed after detaching, there will print an error like "kill cannot find target process". So it can ensure that the program isn't crashed. And this modification refers to the detachment test on Windows platform.

It's also unstable sometimes. https://github.com/DynamoRIO/dynamorio/actions/runs/7698709453/job/20978621263?pr=6579