Open laurynas-biveinis opened 11 months ago
@hermanlee , @luqun , I am PR'ing a possible fix but note that I'm not a replication stats thread expert nor do I see any specific MTR tests I could run for it.
Rebased on 8.0.32. "This branch has conflicts that must be resolved" looks bogus
@george-reynya has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Fix START/STOP SLAVE deadlock caused by slave stats daemon
Under load, if START SLAVE IO_THREAD and STOP SLAVE execute concurrently, the following deadlock is possible:
The request of T54 is compatible with the current lock state, but according to POSIX, once a write request is pending, it is up to the implementation whether to satisfy them or block.
For the fix, observe that the starting replica I/O thread only tries to signal the stats thread to start, thus move this code to the START REPLICA command-executing thread instead, which already happens to hold the channel map lock. This also forces to move the stopping of the stats thread from the replica I/O thread to the STOP REPLICA command-executing thread.
This fixes intermittent but often-seen failures on rpl.rpl_multi_source_channel_map_stress under macOS.
Squash with b015dd3f3a75e50995443dd203d37a927126998b
Stacktraces:
Thread 44:
Thread 55:
Thread 54: