Shopify / ghostferry

The swiss army knife of live data migrations
https://shopify.github.io/ghostferry
MIT License
693 stars 65 forks source link

Ensure BinlogEventBuffer Channel is initialized before running the BinlogStreamer #331

Open Scalvando opened 2 years ago

Scalvando commented 2 years ago

Problem

During a run of copydb we found a problem where the binlog streamer lag kept increasing and the lag was almost identical to the run time of the copy.

Tracing and Investigation

A huge thanks to @shuhaowu for helping me investigate this problem and helping to find the bug.

Initially I looked at the logs for the copydb container:

time="2022-02-11T16:17:51Z" level=info msg="starting binlog streaming" file=mysql-bin.001433 host=/app/tmp/mysql_connection.sock port=0 position=9494960 tag=source_binlog_streamer
...
time="2022-02-11T16:17:52Z" level=info msg="starting binlog streamer" tag=source_binlog_streamer
time="2022-02-11T16:17:52Z" level=info msg="binlog file rotated" last_filename=mysql-bin.001433 last_position=9494960 new_filename=mysql-bin.001433 new_position=9494960 tag=source_binlog_streamer

This was the last mention of the binlog streamer in the logs, I assumed at this point that the streamer had stopped for some reason or another.

Trying to investigate further we used the pprof endpoint for the copydb instance and we were able to get some more information.

First we took a 30 second trace from the copydb instance. The blocking information showed that the main program thread's time was spent almost entirely on the inline verifier's PeriodicallyVerifyBinlogEvents

Screen Shot 2022-02-15 at 4 19 46 PM

There wasn't much more useful information from the trace aside from the binlog streamer not appearing in it. We looked to see if anything was deadlocked or waiting to get a mutex but nothing showed up.

We went back to the pprof endpoint for copydb and looked at the goroutine stack dump which just takes a snapshot of the running goroutines. After eliminating the go routines that weren't relevant we were left with the ones related to binlogs.

Of the remaining stack dumps these 2 were of importance:

Goroutine Stack Dump

goroutine 34 [chan receive, 6019 minutes]:
github.com/Shopify/ghostferry.(*BinlogWriter).Run(0xc00028da00)
    /tmp/ghostferry/binlog_writer.go:36 +0x355
github.com/Shopify/ghostferry.(*Ferry).Run.func8(0xc0004781f0, 0xc0000c00f0)
    /tmp/ghostferry/ferry.go:743 +0x57
created by github.com/Shopify/ghostferry.(*Ferry).Run
    /tmp/ghostferry/ferry.go:741 +0x2ed

goroutine 35 [chan send (nil chan), 6019 minutes]:
github.com/Shopify/ghostferry.(*BinlogWriter).BufferBinlogEvents(0xc00028da00, 0xc00048c2f0, 0x1, 0x1, 0x3ff0000000000000, 0x1)
    /tmp/ghostferry/binlog_writer.go:75 +0x7a
github.com/Shopify/ghostferry.(*BinlogStreamer).handleRowsEvent(0xc0002a05a0, 0xc0002ef290, 0x0, 0x0, 0x0, 0xc00047af00, 0x16)
    /tmp/ghostferry/binlog_streamer.go:403 +0x6aa
github.com/Shopify/ghostferry.(*BinlogStreamer).Run(0xc0002a05a0)
    /tmp/ghostferry/binlog_streamer.go:234 +0xc5e
github.com/Shopify/ghostferry.(*Ferry).Run.func9(0xc0004781f0, 0xc0000c00f0)
    /tmp/ghostferry/ferry.go:750 +0x5b
created by github.com/Shopify/ghostferry.(*Ferry).Run
    /tmp/ghostferry/ferry.go:747 +0x339

Here we can see that while the BinlogWriter was reading from the channel, the BinlogStreamer was sending the events into a nil channel. We refreshed a number of times to make sure it wasn't an anomaly.

The minutes also corresponded with the lag and run time for the copy showing that this had been happening since almost the beginning of the run.

Bug

When ferry.go creates a new BinlogWriter, it does not create the event buffer channel. That happens in BinlogWriter.Run().

The problem is in https://github.com/Shopify/ghostferry/blob/abcda3af1ee7d42b7820205d78ef2b6c655eb574/ferry.go#L741-L752

Here there is a chance that BinlogStreamer.Run() runs before BinlogWriter.Run() has the chance to create the channel. This can result in the event listener for the binlog streamer (BinlogWritter.BufferBinlogEvents) trying to write binlog events into the uninitialized BinlogEventBuffer channel.

Here's an example of a similar scenario we used to reproduce the stack dump:

//Taken from https://golangbyexample.com/send-receive-nil-channel-go/
package main

import (
    "fmt"
    "time"
)

func main() {
    var ch chan int
    go send(ch)
    ch = make(chan int)
    <-ch
    time.Sleep(time.Second * 1)
}

func send(ch chan int) {
    fmt.Println("Sending value to channnel start")
    ch <- 1
    fmt.Println("Sending value to channnel finish")
}

The resulting goroutine dump is very similar to what we saw in ghostferry.

Sending value to channnel start fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan receive]:
main.main()
    /tmp/sandbox3043519188/prog.go:13 +0x90

goroutine 6 [chan send (nil chan)]:
main.send(0x0)
    /tmp/sandbox3043519188/prog.go:19 +0x6c
created by main.main
    /tmp/sandbox3043519188/prog.go:11 +0x7b  Program exited.

Possible Solution

Currently NewBinlogWriter in ferry.go just creates a new BinlogWriter but doesn't initialize the channel

https://github.com/Shopify/ghostferry/blob/abcda3af1ee7d42b7820205d78ef2b6c655eb574/ferry.go#L163-L178

What could be done is that instead of immediately returning the writer, initialize the buffer channel on the writer in ferry.go and then return it.

This would be similar to what we do in NewBatchWriter https://github.com/Shopify/ghostferry/blob/abcda3af1ee7d42b7820205d78ef2b6c655eb574/ferry.go#L186-L202