ASSERT: Race/crash within DR dispatch when performing unlink_flush during app thread creation

nextsilicon-itay-bookstein commented 4 years ago

Describe the bug I wrote a piece of minimally adaptive instrumentation for indirect branches. To adapt to dynamically discovered indirect branch targets, this implementation calls dr_unlink_flush_region (I also tried delay_flush) from a clean-call, flushing the fragment to which it's going to return so that my instrumentation code will be dynamically reconstructed with the newly discovered target.

A series of spurious crashes in the fragment unlink flow, application register corruptions, and other fun things led me to try and debug the problem and narrow the repro down to something more minimal, and I traced it down to the usage of dr_unlink_flush_region at the time that the application is creating a lot of threads. Because it was flaky/racy/non-deterministic, I tried to force/stress the problem by unlinking much more aggresively (I had a high threshold before deciding to unlink).

In addition, I tried using a Debug DR build from most recent master. The assert I encountered is this:

Internal Error: DynamoRIO debug check failure: ../core/dispatch.c:757 wherewasi == DR_WHERE_FCACHE || wherewasi == DR_WHERE_TRAMPOLINE || wherewasi == DR_WHERE_APP || (dcontext->go_native && wherewasi == DR_WHERE_DISPATCH)

Adding a print revealed that the relevant values for this assert were as follows:

wherewasi = 2 (DR_WHERE_DISPATCH), dcontext->go_native = 0

When I tried to use delay_flush instead of unlink_flush I encountered this assert:

Internal Error: DynamoRIO debug check failure: ../core/vmareas.c:9502 false && "stale multi-init entry on frags list"

Because both the application and the DR client are reasonably complex, I tried to narrow a minimal repro down to a tiny DR client and a tiny application. The tiny application simply creates a lot of threads, each calling printf 100 times. The client simply calls dr_unlink_flush_region from a clean-call out of every indirect call and indirect jump, once every few times that the clean-call happens. I had to toy with the thread count and the unlink_flush call ratio to get at a good deterministic repro. I've attached the code for the repro.

The attached tar.gz file contains build.sh, CMakeLists.txt, src/repro_client.c and src/repro_app.c. Note that build.sh nukes /build and re-creates it by invoking cmake. unlink_flush_repro.tar.gz

A plain drrun with the provided client and the provided app should trigger the issue. I haven't tested it on multiple machines to ensure that there's no dependence on core count or anything like that.

I can potentially try to debug this further, but at this point I thought asking here would be a good idea :)

Expected behavior Application should successfully run to completion (albeit slowly).

Versions

What version of DR are you using? Top of master branch from day of posting this
- What operating system version are you running on? Debian GNU/Linux 10 (buster)
- Is your application 32-bit or 64-bit? 64-bit

nextsilicon-itay-bookstein commented 4 years ago

I think the issue may somehow be related to the fcache_reset_all_caches_proactively flow since I experience a single print from there just prior to the assert (temporally), but I don't understand the code there enough yet to say anything definitive (and correlation does not imply causation :) )

derekbruening commented 4 years ago

I think the issue may somehow be related to the fcache_reset_all_caches_proactively flow

Does the problem disappear with -no_enable_reset?

nextsilicon-itay-bookstein commented 4 years ago

Does the problem disappear with -no_enable_reset?

It does indeed stop tripping the assert, letting the repro run to completion on Debug. For the original application from which I distilled the repro, it stops tripping on Internal Error: DynamoRIO debug check failure: ../core/dispatch.c:757 wherewasi == DR_WHERE_FCACHE || wherewasi == DR_WHERE_TRAMPOLINE || wherewasi == DR_WHERE_APP || (dcontext->go_native && wherewasi == DR_WHERE_DISPATCH) and starts tripping on Internal Error: DynamoRIO debug check failure: ../core/vmareas.c:9502 false && "stale multi-init entry on frags list". So it's possible that these are two separate issues and my repro only distills the first one.

derekbruening commented 4 years ago

Thank you for the report. The wherewasi assert may not be serious and may be isolated to the whereami tracking and not really affect much, but it's hard to say. Is it related to #3175?

The stale entry assert definitely seems more serious. The mentioned crashes are more serious still.

It would be great if you could look further into these.

nextsilicon-itay-bookstein commented 4 years ago

Regarding the wherewasi assert - It's the same assert, but the condition is false since the relaxation that was added in connection to #3175 checks that go_native is true, and in the case caused by the proactive reset it is not (but it is DR_WHERE_DISPATCH, though).

Seem to have hit some more asserts:

Internal Error: DynamoRIO debug check failure: ../core/dispatch.c:489 dr_get_isa_mode(dcontext) == FRAG_ISA_MODE(targetf->flags) IF_X64(|| (dr_get_isa_mode(dcontext) == DR_ISA_IA32 && !FRAG_IS_32(targetf->flags) && DYNAMO_OPTION(x86_to_x64)))

Internal Error: DynamoRIO debug check failure: ../core/vmareas.c:9810 pend->frags != NULL

I have a repro that sometimes hits the "multi-init" assert and sometimes hits the above asserts with a minimized client, but not with a minimized app yet; The app is QuickSilver from here. There's a precompiled binary attached inside the archive (drrun -c librepro.so -- qs should hit the issue a small time after printing "Finished initMesh"; making the modulo of the counter that controls the calls to dr_delay_flush_region larger should make that time shorter). It seems to always happen upon executing the first OpenMP parallel region within the app, upon the creation of many threads. The crashes/hangs/application-corruptions happen in DR Release, the asserts in DR Debug. This time around I'm using dr_delay_flush_region instead of dr_unlink_flush_region. I'll try to investigate some more; do you have any additional pointers/tips/insight that might make this more effective? qs_delay_flush.tar.gz

derekbruening commented 4 years ago

do you have any additional pointers/tips/insight that might make this more effective?

Nothing magical, I think it's just hand-to-hand combat vs each symptom to figure out what the problem is. Hopefully several of them are related and there's just one underlying race or other cause.

nextsilicon-itay-bookstein commented 4 years ago

This reproduces the Internal Error: DynamoRIO debug check failure: ../core/vmareas.c:9502 false && "stale multi-init entry on frags list" assert with a minimized client and a minimized application (both attached). Should repro with Debug DR by running drrun -c librepro.so -- repro_app. delay_flush_repro.tar.gz

This repro behaves essentially like a stress-test for multi-threaded delay-flush flows, where all threads 'dance' on the same piece of code.

abhinav92003 commented 4 years ago

Just to let you know, thewherewasi assert failure in core/dispatch.c (the initial issue reported here) was fixed in PR #4507.

DynamoRIO / dynamorio

ASSERT: Race/crash within DR dispatch when performing unlink_flush during app thread creation #4481