Shopify / ghostferry

The swiss army knife of live data migrations
https://shopify.github.io/ghostferry
MIT License
694 stars 65 forks source link

Possible deadlock issue in logrus, or deadlock between the data writer and the DataIterator #280

Open shuhaowu opened 3 years ago

shuhaowu commented 3 years ago

While testing, I've observed an intermittent test failure within the ruby test InterruptResumeTest#test_interrupt_resume_idempotence_with_writes_to_source. When debugging that, I can see that all the Ghostferry goroutine to be stuck at some logrus mutex methods[1]. The ruby process is always trying to modify a row a the same time via the data writer at the same time. This modification query appears to be always stuck waiting for lock, presumably because the data iterator just went over that row and issued a FOR UPDATE against that row (BEFORE_ROW_COPY and AFTER_ROW_COPY in the integrationferry is sent while the FOR UPDATE is held, which makes the situation worse).

I'm not quite sure why this is the case. It's possible that the problem lies in an upstream logrus issue (https://github.com/sirupsen/logrus/issues/1201) or it's somehow a race between DataIterator which holds the lock while trying to send BEFORE_ROW_COPY and AFTER_ROW_COPY to the integration test server (reader should check integrationferry.go to see that, and check how the listeners on the data iterator + cursor works with the FOR UPDATE lock). I'm not sure which one is more likely, as maybe we should see the same deadlock in production if it is purely a problem with logrus.

I've found that by reducing the frequency of the data writer, the deadlock appears to be gone. However, it seems that even a very small sleep (5ms) solves this issue. Such a small sleep only reduces the total number of modification query from about 30 to 22 (for the duration over which the datawriter is active). I'm not sure why this make such a big difference, either.

A failure grinding script is included below, at [2]


[1] Ghostferry per go routine stack trace:

goroutine profile: total 19
3 @ 0x43c2e5 0x44d297 0x76ce25 0x472bc1
#   0x76ce24    github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher.func1+0xa4 /home/shuhao/src/ghostferry/vendor/github.com/go-sql-driver/mysql/connection_go18.go:178

2 @ 0x43c2e5 0x43493b 0x46d175 0x4c9385 0x4ca135 0x4ca117 0x58020f 0x58c111 0x5543a2 0x47e067 0x721347 0x72130a 0x7210db 0x730bdc 0x472bc1
#   0x46d174    internal/poll.runtime_pollWait+0x54                     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/netpoll.go:222
#   0x4c9384    internal/poll.(*pollDesc).wait+0x44                     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_poll_runtime.go:87
#   0x4ca134    internal/poll.(*pollDesc).waitRead+0x1d4                    /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_poll_runtime.go:92
#   0x4ca116    internal/poll.(*FD).Read+0x1b6                          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:166
#   0x58020e    net.(*netFD).Read+0x4e                              /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/fd_posix.go:55
#   0x58c110    net.(*conn).Read+0x90                               /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/net.go:183
#   0x5543a1    bufio.(*Reader).Read+0x221                          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/bufio/bufio.go:227
#   0x47e066    io.ReadAtLeast+0x86                             /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/io/io.go:328
#   0x721346    io.ReadFull+0x86                                /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/io/io.go:347
#   0x721309    github.com/siddontang/go-mysql/packet.(*Conn).ReadPacketTo+0x49         /home/shuhao/src/ghostferry/vendor/github.com/siddontang/go-mysql/packet/conn.go:96
#   0x7210da    github.com/siddontang/go-mysql/packet.(*Conn).ReadPacket+0xda           /home/shuhao/src/ghostferry/vendor/github.com/siddontang/go-mysql/packet/conn.go:85
#   0x730bdb    github.com/siddontang/go-mysql/replication.(*BinlogSyncer).onStream+0x2fb   /home/shuhao/src/ghostferry/vendor/github.com/siddontang/go-mysql/replication/binlogsyncer.go:647

2 @ 0x43c2e5 0x44d297 0x700e35 0x472bc1
#   0x700e34    database/sql.(*DB).connectionOpener+0xb4    /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/database/sql/sql.go:1133

1 @ 0x40dfb4 0x46f3c5 0x779d65 0x472bc1
#   0x46f3c4    os/signal.signal_recv+0xa4  /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sigqueue.go:168
#   0x779d64    os/signal.loop+0x24     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/os/signal/signal_unix.go:23

1 @ 0x43c2e5 0x40824f 0x407e8b 0x77e2e5 0x7b2d77 0x472bc1
#   0x77e2e4    github.com/Shopify/ghostferry.(*BinlogWriter).Run+0x364 /home/shuhao/src/ghostferry/binlog_writer.go:36
#   0x7b2d76    github.com/Shopify/ghostferry.(*Ferry).Run.func7+0x56   /home/shuhao/src/ghostferry/ferry.go:705

1 @ 0x43c2e5 0x40824f 0x407e8b 0x7b29eb 0x472bc1
#   0x7b29ea    github.com/Shopify/ghostferry.(*Ferry).Run.func5+0xea   /home/shuhao/src/ghostferry/ferry.go:673

1 @ 0x43c2e5 0x43493b 0x46d175 0x4c9385 0x4cac12 0x4cabf4 0x5810c5 0x594f12 0x593ea5 0x67d765 0x67d47a 0x7c0938 0x7c0902 0x472bc1
#   0x46d174    internal/poll.runtime_pollWait+0x54     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/netpoll.go:222
#   0x4c9384    internal/poll.(*pollDesc).wait+0x44     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_poll_runtime.go:87
#   0x4cac11    internal/poll.(*pollDesc).waitRead+0x211    /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_poll_runtime.go:92
#   0x4cabf3    internal/poll.(*FD).Accept+0x1f3        /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:401
#   0x5810c4    net.(*netFD).accept+0x44            /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/fd_unix.go:172
#   0x594f11    net.(*TCPListener).accept+0x31          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/tcpsock_posix.go:139
#   0x593ea4    net.(*TCPListener).Accept+0x64          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/tcpsock.go:261
#   0x67d764    net/http.(*Server).Serve+0x284          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2981
#   0x67d479    net/http.(*Server).ListenAndServe+0xb9      /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2910
#   0x7c0937    net/http.ListenAndServe+0x57            /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:3164
#   0x7c0901    main.main.func1+0x21                /home/shuhao/src/ghostferry/test/lib/go/integrationferry.go:337

1 @ 0x43c2e5 0x44d297 0x72dc65 0x7afd6c 0x77b49e 0x7b2eda 0x472bc1
#   0x72dc64    github.com/siddontang/go-mysql/replication.(*BinlogStreamer).GetEvent+0x104 /home/shuhao/src/ghostferry/vendor/github.com/siddontang/go-mysql/replication/binlogstreamer.go:29
#   0x7afd6b    github.com/Shopify/ghostferry.(*BinlogStreamer).Run.func2+0x8b          /home/shuhao/src/ghostferry/binlog_streamer.go:164
#   0x77b49d    github.com/Shopify/ghostferry.(*BinlogStreamer).Run+0x19d           /home/shuhao/src/ghostferry/binlog_streamer.go:170
#   0x7b2ed9    github.com/Shopify/ghostferry.(*Ferry).Run.func9+0x59               /home/shuhao/src/ghostferry/ferry.go:721

1 @ 0x43c2e5 0x44dd65 0x44dd4e 0x46eba5 0x47db85 0x792e57 0x7c0896 0x472bc1
#   0x46eba4    sync.runtime_Semacquire+0x44                /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sema.go:56
#   0x47db84    sync.(*WaitGroup).Wait+0x64             /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/waitgroup.go:130
#   0x792e56    github.com/Shopify/ghostferry.(*Ferry).Run+0x536    /home/shuhao/src/ghostferry/ferry.go:785
#   0x7c0895    main.(*IntegrationFerry).Main.func1+0x55        /home/shuhao/src/ghostferry/test/lib/go/integrationferry.go:157

1 @ 0x43c2e5 0x44dd65 0x44dd4e 0x46eba5 0x47db85 0x7bef68 0x7c0133 0x43beb6 0x472bc1
#   0x46eba4    sync.runtime_Semacquire+0x44        /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sema.go:56
#   0x47db84    sync.(*WaitGroup).Wait+0x64     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/waitgroup.go:130
#   0x7bef67    main.(*IntegrationFerry).Main+0x247 /home/shuhao/src/ghostferry/test/lib/go/integrationferry.go:169
#   0x7c0132    main.main+0x2d2             /home/shuhao/src/ghostferry/test/lib/go/integrationferry.go:340
#   0x43beb5    runtime.main+0x255          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/proc.go:225

1 @ 0x43c2e5 0x44dd65 0x44dd4e 0x46ecc7 0x47c0e5 0x70e8ba 0x70e8a9 0x70e735 0x70ef66 0x796076 0x79601e 0x795d67 0x794865 0x7b2694 0x472bc1
#   0x46ecc6    sync.runtime_SemacquireMutex+0x46               /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sema.go:71
#   0x47c0e4    sync.(*Mutex).lockSlow+0x104                    /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/mutex.go:138
#   0x70e8b9    sync.(*Mutex).Lock+0x259                    /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/mutex.go:81
#   0x70e8a8    github.com/sirupsen/logrus.(*MutexWrap).Lock+0x248      /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/logger.go:58
#   0x70e734    github.com/sirupsen/logrus.(*Entry).log+0xd4            /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:233
#   0x70ef65    github.com/sirupsen/logrus.(*Entry).Log+0x85            /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:293
#   0x796075    github.com/sirupsen/logrus.(*Entry).Debug+0x2b5         /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:302
#   0x79601d    github.com/Shopify/ghostferry.postCallback+0x25d        /home/shuhao/src/ghostferry/http_callback.go:41
#   0x795d66    github.com/Shopify/ghostferry.HTTPCallback.Post+0xe6        /home/shuhao/src/ghostferry/http_callback.go:25
#   0x794864    github.com/Shopify/ghostferry.(*Ferry).ReportProgress+0x2a4 /home/shuhao/src/ghostferry/ferry.go:1028
#   0x7b2693    github.com/Shopify/ghostferry.(*Ferry).Run.func3+0x93       /home/shuhao/src/ghostferry/ferry.go:644

1 @ 0x43c2e5 0x44dd65 0x44dd4e 0x46ecc7 0x47c0e5 0x70e8ba 0x70e8a9 0x70e735 0x70ef66 0x796076 0x79601e 0x795d67 0x794b54 0x7b2814 0x472bc1
#   0x46ecc6    sync.runtime_SemacquireMutex+0x46               /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sema.go:71
#   0x47c0e4    sync.(*Mutex).lockSlow+0x104                    /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/mutex.go:138
#   0x70e8b9    sync.(*Mutex).Lock+0x259                    /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/mutex.go:81
#   0x70e8a8    github.com/sirupsen/logrus.(*MutexWrap).Lock+0x248      /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/logger.go:58
#   0x70e734    github.com/sirupsen/logrus.(*Entry).log+0xd4            /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:233
#   0x70ef65    github.com/sirupsen/logrus.(*Entry).Log+0x85            /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:293
#   0x796075    github.com/sirupsen/logrus.(*Entry).Debug+0x2b5         /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:302
#   0x79601d    github.com/Shopify/ghostferry.postCallback+0x25d        /home/shuhao/src/ghostferry/http_callback.go:41
#   0x795d66    github.com/Shopify/ghostferry.HTTPCallback.Post+0xe6        /home/shuhao/src/ghostferry/http_callback.go:25
#   0x794b53    github.com/Shopify/ghostferry.(*Ferry).ReportState+0x133    /home/shuhao/src/ghostferry/ferry.go:1044
#   0x7b2813    github.com/Shopify/ghostferry.(*Ferry).Run.func4+0x93       /home/shuhao/src/ghostferry/ferry.go:662

1 @ 0x46cd7d 0x6e5dae 0x6e5b85 0x6e2712 0x6f3fe5 0x6f5837 0x67a584 0x67c40d 0x67d383 0x6792cd 0x472bc1
#   0x46cd7c    runtime/pprof.runtime_goroutineProfileWithLabels+0x5c   /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/mprof.go:716
#   0x6e5dad    runtime/pprof.writeRuntimeProfile+0xcd          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/pprof/pprof.go:724
#   0x6e5b84    runtime/pprof.writeGoroutine+0xa4           /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/pprof/pprof.go:684
#   0x6e2711    runtime/pprof.(*Profile).WriteTo+0x3f1          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/pprof/pprof.go:331
#   0x6f3fe4    net/http/pprof.handler.ServeHTTP+0x384          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/pprof/pprof.go:253
#   0x6f5836    net/http/pprof.Index+0x8d6              /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/pprof/pprof.go:371
#   0x67a583    net/http.HandlerFunc.ServeHTTP+0x43         /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2069
#   0x67c40c    net/http.(*ServeMux).ServeHTTP+0x1ac            /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2448
#   0x67d382    net/http.serverHandler.ServeHTTP+0xa2           /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2887
#   0x6792cc    net/http.(*conn).serve+0x8cc                /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:1952

1 @ 0x4b2845 0x4b0bfa 0x4ca0a5 0x4ca07b 0x4ca061 0x58020f 0x58c111 0x672ad8 0x472bc1
#   0x4b2844    syscall.Syscall+0x4             /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/asm_linux_amd64.s:20
#   0x4b0bf9    syscall.read+0x59               /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/zsyscall_linux_amd64.go:686
#   0x4ca0a4    syscall.Read+0x144              /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/syscall_unix.go:187
#   0x4ca07a    internal/poll.ignoringEINTRIO+0x11a     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:581
#   0x4ca060    internal/poll.(*FD).Read+0x100          /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:162
#   0x58020e    net.(*netFD).Read+0x4e              /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/fd_posix.go:55
#   0x58c110    net.(*conn).Read+0x90               /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/net.go:183
#   0x672ad7    net/http.(*connReader).backgroundRead+0x57  /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:692

1 @ 0x4b2845 0x4b0dfa 0x4ca65a 0x4ca642 0x4ca744 0x4cf42e 0x4cf3fe 0x70edb8 0x70e7eb 0x70ef66 0x77b896 0x77b83e 0x7b2e1b 0x472bc1
#   0x4b2844    syscall.Syscall+0x4                     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/asm_linux_amd64.s:20
#   0x4b0df9    syscall.write+0x59                      /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/zsyscall_linux_amd64.go:914
#   0x4ca659    syscall.Write+0x139                     /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/syscall_unix.go:212
#   0x4ca641    internal/poll.ignoringEINTRIO+0x121             /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:581
#   0x4ca743    internal/poll.(*FD).Write+0x223                 /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:274
#   0x4cf42d    os.(*File).write+0x8d                       /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/os/file_posix.go:48
#   0x4cf3fd    os.(*File).Write+0x5d                       /nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/os/file.go:174
#   0x70edb7    github.com/sirupsen/logrus.(*Entry).write+0x157         /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:286
#   0x70e7ea    github.com/sirupsen/logrus.(*Entry).log+0x18a           /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:251
#   0x70ef65    github.com/sirupsen/logrus.(*Entry).Log+0x85            /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:293
#   0x77b895    github.com/sirupsen/logrus.(*Entry).Debug+0x595         /home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:302
#   0x77b83d    github.com/Shopify/ghostferry.(*BinlogStreamer).Run+0x53d   /home/shuhao/src/ghostferry/binlog_streamer.go:187
#   0x7b2e1a    github.com/Shopify/ghostferry.(*Ferry).Run.func8+0x5a       /home/shuhao/src/ghostferry/ferry.go:712

[2] Failure grinding script

Before running this, you should comment out the sleep in test/helpers/data_writer_helper.rb, after the call to write_data.

#!/bin/bash

i=0
while true; do
  i=$((i+1))
  echo "testing run $i"
  DEBUG=1 bundle exec ruby test/main.rb -v -n "InterruptResumeTest#test_interrupt_resume_idempotence_with_writes_to_source" 2>&1 >test.log
  if [ "$?" == "1" ]; then
    tail -n15 test.log
    echo "bad test found!"
    exit 0
  else
    grep "data writes" test.log
    tail -n15 test.log
    echo "test run $i succeeded, trying again"
  fi
done