Shopify / ghostferry

The swiss army knife of live data migrations
https://shopify.github.io/ghostferry
MIT License
693 stars 65 forks source link

Resuming can caused missed replication events #351

Closed SpencerMalone closed 1 month ago

SpencerMalone commented 1 month ago

I've seen a few cases of missed rows that I believe can be tied to resuming, but please correct me if I'm wrong:

The binlog processing algorithm looks something like this with state tracking:

  1. Streamer is a mysql replication client that retrieves events
  2. Pass off events by default to defaultEventHandler
  3. defaultEventHandler calls handleRowsEvent
  4. handleRowsEvent filters + creates batches of events
  5. handleRowsEvent hands off to event listeners (zoom into this later on)
  6. Record binlog pos in state

So the default event listener is BinlogWriter.BufferBinlogEvents, which pushes events onto a channel, while a separate thread pulls off the channel and processes it.

Here's the problem: step 5 for an applicable event is only blocked by pushing onto the channel. Actual event processing happens in another non-blocking thread from that one (unless it backs up so much the channel gets full), but in practice this means it is somewhat common that when write load is high, and a ferry run is interrupted, the data in the event channel is lost.

riccardo-casazza commented 1 month ago

@SpencerMalone, if you configure a verifier, the move will fail and you will be notified.