Fix compile on Apple Clang

ccotter commented 3 months ago

Manually verified Linux GCC-11/Clang-18 still work.

ccotter commented 3 months ago

Btw, my interest in this project is specifically in getting it to work with https://github.com/NVIDIA/stdexec. relacy has helped find one bug (and verified another) in ways that TSAN was not able to!

dvyukov commented 3 months ago

Btw, my interest in this project is specifically in getting it to work with https://github.com/NVIDIA/stdexec. relacy has helped find one bug (and verified another) in ways that TSAN was not able to!

Nice. Was the bug related to a specific thread interleaving that wasn't happening under TSAN? Or was it related to subtler memory model details?

dvyukov commented 3 months ago

As a side note: Relacy's macro based instrumentation is a hack and a mess. It would benefit from a compiler instrumentation similar to TSAN (or probably maybe just reusing TSAN instrumentation). However, whenever I was thinking of it, the conclusion was that: I should probably just improve TSAN to do what Relacy can instead.

ccotter commented 3 months ago

https://github.com/NVIDIA/stdexec/pull/1400 describes the bug I was chasing. It was a rare interleaving. I was able to force the interleaving with a modified TSAN that changed pthread_mutex_unlock to usleep(rand() % 1000), and then a colleague pointed me to Relacy as a more rigorous mechanism to force more rare interleavings. I'd love to see a version of TSAN (or other) instrumentation that could do what Relacy does with compiler help.

If you have more ideas on how you think TSAN could be improved to do what Relacy can do, I'd be curious to see if I can help out. As far as I understand, TSAN tracks memory accesses and verifies it against the memory model, but doesn't have a modified schedule for example. Relacy's distinguishing feature from TSAN is its custom scheduler. Perhaps TSAN could support an option to choose the scheduler ("native" or "relacy-fiber"), and then leverage TSAN's model tracking?

dvyukov commented 3 months ago

The simple first step is to add interception of all synchronization events (mutex, atomic, etc) into TSAN runtime and inject random sleeps (similar to what you did manually, but for all events and easy to use). Then it will be possible to catch a number of bugs by running a test in a loop. At some point I added this to the older version of TSAN, but then never moved to the current version. Later based on this it's possible to add more elaborate scheduling.

I would avoid fibers in TSAN and just use threads. It's slower, but simpler and will support wider range of code w/o changes (e.g. simply using tls, or reading gettid and expecting to see different values, signal masks, etc).

ccotter commented 3 months ago

Thanks for the links! I might give a simple version of the "wait+random" scheduler described in the first paper. Let me know if you know of any old branches lying around.

dvyukov commented 3 months ago

Actually there was: https://reviews.llvm.org/D65383 which seems to slept through the cracks.

And here is impl for the old TSAN: https://github.com/dvyukov/data-race-test-1/tree/master/earthquake

I think an important aspect to make it usable for larger programs is to cap the additional slow-down. This slow-down may be controlled via the env options. Lots of large real program may start flaking due to timeouts, etc if a significant additional slow-down is added. However, unit-tests w/o timeouts can of course use more aggressive setting.

ccotter commented 3 months ago

I manually applied https://reviews.llvm.org/D65383 partially to the latest upstream compiler-rt, but I'm seeing quite slow performance with mutex/cvs. In one example, the program often gets caught with the lone RUNNING thread inside of pthread_mutex_lock, with another thread in the WAIT state in pthread_cond_wait()->SynchronizationPoint(). The RUNNING thread is blocked by the OS, while the WAIT thread is blocked by the TSAN fuzz scheduler runtime. The program only makes progress when the WatchDog thread timesout the RUNNING thread and sets a new thread in motion. In some of my unit tests that use mutex/cv, I'm seeing a 100x slowdown between TSAN with and without the fuzz scheduler.

I wonder if the fuzz scheduler runtime needs to take over handling the mutex locking/unlocking, and cv wait/notify, and avoid hitting the OS entirely. As far as I understand, Relacy itself is responsible to implementing locking/unlocking/wait/notify. Thoughts?

CC @dorooleg - I realize it's been a long time since you might have last looked at this, but let me know if you have any thoughts on this. I'd like to see if we can resurrect https://reviews.llvm.org/D65383 back into upstream.

dvyukov commented 3 months ago

I wonder if the fuzz scheduler runtime needs to take over handling the mutex locking/unlocking, and cv wait/notify, and avoid hitting the OS entirely.

Depends on what exactly you want to differently (just doing switching manually won't help much).

In real apps lots of threads can be blocked for other reasons (IO, external libs, etc), so I would avoid relying on precise knowledge of running threads.

ccotter commented 3 months ago

I might be missing something, but what appears to be missing from https://reviews.llvm.org/D65383 was knowledge in the scheduler of which threads were blocking on a mutex acquisition or cv wait. When I tried D65383 locally, I often found the program with the lone RUNNING thread blocked in mutex acquire/cv wait, with the other thread that was needed to release the mutex or notify the cv in the WAIT state (hence, my 100x slowdown, with my unit test only making progress when the WatchDog thread timed out the RUNNING thread to let another thread run).

At the very least, I think having the TSAN runtime scheduler be aware of which threads are in a blocked state due to mutex acquire or cv wait, and not consider those threads for being schedulable (just as Relacy's scheduler is, where blocking on a held mutex removes the thread from the running_threads and whichever thread releases a mutex places any waiting thread back into running_threads).

I've managed to update D65383 in my local branch to do exactly that, and one promising result is that the bug in https://github.com/NVIDIA/stdexec/pull/1400 that TSAN was unable to find after 10,000s of iterations, my TSAN+the Relacy scheduler is able to find the race in ~50 iterations.

I think I'm on a right track, but please let me know if you have other feedback. Once I clean this up, I'd like to raise an RFC on the LLVM discourse about getting a scheduler like this merged in.

ccotter commented 2 months ago

I put up https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969 discussing adding random scheduling into TSAN

cc @dorooleg

dvyukov / relacy

Fix compile on Apple Clang #19