facebook / mysql-5.6

Facebook's branch of the Oracle MySQL database. This includes MyRocks.
http://myrocks.io
Other
2.48k stars 714 forks source link

Fix START/STOP SLAVE deadlock caused by slave stats daemon #1377

Open laurynas-biveinis opened 11 months ago

laurynas-biveinis commented 11 months ago

Fix START/STOP SLAVE deadlock caused by slave stats daemon

Under load, if START SLAVE IO_THREAD and STOP SLAVE execute concurrently, the following deadlock is possible:

The request of T54 is compatible with the current lock state, but according to POSIX, once a write request is pending, it is up to the implementation whether to satisfy them or block.

For the fix, observe that the starting replica I/O thread only tries to signal the stats thread to start, thus move this code to the START REPLICA command-executing thread instead, which already happens to hold the channel map lock. This also forces to move the stopping of the stats thread from the replica I/O thread to the STOP REPLICA command-executing thread.

This fixes intermittent but often-seen failures on rpl.rpl_multi_source_channel_map_stress under macOS.

Squash with b015dd3f3a75e50995443dd203d37a927126998b

Stacktraces:

Thread 44:

    frame #5: 0x0000000106483108 mysqld`inline_mysql_cond_timedwait(that=0x000000014585d438, mutex=0x000000014585d2f0, abstime=0x000000016db61588, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_replica.cc", src_line=2367) at mysql_cond.h:224:16
    frame #6: 0x0000000106460818 mysqld`terminate_slave_thread(thd=0x00000001458fb800, term_lock=0x000000014585d2f0, term_cond=0x000000014585d438, slave_running=0x000000014585d4f4, stop_wait_timeout=0x000000016db61700, need_lock_term=false, force=false) at rpl_replica.cc:2367:9
    frame #7: 0x0000000106460188 mysqld`terminate_slave_threads(mi=0x000000014585ce00, thread_mask=1, stop_wait_timeout=31536000, need_lock_term=false) at rpl_replica.cc:2204:18
    frame #8: 0x000000010645873c mysqld`stop_slave(thd=0x0000000145952800, mi=0x000000014585ce00, net_report=true, for_one_channel=true, push_temp_tables_warning=0x000000016db61a37, invoked_by_raft=false) at rpl_replica.cc:10032:9
    frame #9: 0x00000001064593c4 mysqld`stop_slave_cmd(thd=0x0000000145952800) at rpl_replica.cc:971:13

Thread 55:

    frame #2: 0x000000018e5bf73c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720
    frame #3: 0x0000000104c3d2ac mysqld`native_rw_wrlock(rwp=0x00006000039dc0e8) at thr_rwlock.h:101:10
    frame #4: 0x0000000104c3d21c mysqld`inline_mysql_rwlock_wrlock(that=0x00006000039dc0e8, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_gtid.h", src_line=473) at mysql_rwlock.h:410:12
    frame #5: 0x0000000104c3c6f0 mysqld`Checkable_rwlock::wrlock(this=0x00006000039dc0e0) at rpl_gtid.h:473:5
    frame #6: 0x00000001054dc6e0 mysqld`Multisource_info::wrlock(this=0x000000010a182980) at rpl_msr.h:441:25
    frame #7: 0x0000000106458bb0 mysqld`start_slave_cmd(thd=0x0000000130008a00) at rpl_replica.cc:832:15

Thread 54:

    frame #2: 0x000000018e5bf73c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720
    frame #3: 0x0000000104b62d90 mysqld`native_rw_rdlock(rwp=0x00006000039dc0e8) at thr_rwlock.h:82:10
    frame #4: 0x0000000104b62d00 mysqld`inline_mysql_rwlock_rdlock(that=0x00006000039dc0e8, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_gtid.h", src_line=464) at mysql_rwlock.h:340:12
    frame #5: 0x0000000104b62bcc mysqld`Checkable_rwlock::rdlock(this=0x00006000039dc0e0) at rpl_gtid.h:464:5
    frame #6: 0x0000000104b61cec mysqld`Multisource_info::rdlock(this=0x000000010a182980) at rpl_msr.h:415:25
    frame #7: 0x0000000104b618f0 mysqld`start_handle_slave_stats_daemon() at slave_stats_daemon.cc:233:15
    frame #8: 0x00000001064627dc mysqld`handle_slave_io(arg=0x000000014585ce00) at rpl_replica.cc:6050:36
laurynas-biveinis commented 11 months ago

@hermanlee , @luqun , I am PR'ing a possible fix but note that I'm not a replication stats thread expert nor do I see any specific MTR tests I could run for it.

laurynas-biveinis commented 10 months ago

Rebased on 8.0.32. "This branch has conflicts that must be resolved" looks bogus

facebook-github-bot commented 7 months ago

@george-reynya has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.