Shopify / ghostferry

The swiss army knife of live data migrations
https://shopify.github.io/ghostferry
MIT License
693 stars 65 forks source link

Alternate exit criteria for DataIterators #335

Open vyshah opened 1 year ago

vyshah commented 1 year ago

For some context, the schema I'm trying to use Ghostferry with is based on EAV. Each table is a property table and an object will typically correspond to multiple records across various tables.

I'm also trying to implement sharding support with an implementation of CopyFilter.

My CopyFilter implementation generates object IDs for a given shard in batches and uses those IDs in BuildSelect to form the base SQL query to copy records, e.g.

SELECT ... FROM ... WHERE pagination_key IN [object_id1...object_id20]

I'm currently running into a problem where this query sometimes yields 0 rows for a property table for a batch of object IDs, leading the Cursor to terminate iteration early even though there are still objects in the shard left to process.

Ideally, the flow I need looks something like:

  1. Generate a batch of object IDs
  2. Copy all records across all property tables corresponding to these object IDs
  3. Terminate DataIterators if no more object IDs remain in shard

I don't believe this is possible without opening a PR against ghostferry, but let me know if I'm missing something. If it isn't possible, do you have any suggestions on how to implement this?

I'm thinking I'll need to have the DataIterator understand these object ID batches and do multiple dataIterator.Run()s for each batch before exiting. I see a few options here:

Any guidance here would be appreciated - thanks for your time.