ExpediaGroup / circus-train

Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.
Apache License 2.0
86 stars 15 forks source link

Support for replication after non-backwards-compatible schema changes #183

Closed Dima1224 closed 4 years ago

Dima1224 commented 4 years ago

Is your feature request related to a problem? Please describe. Early on in the development lifecycle, it is common to change a table's schema in non-backwards-compatible ways and repopulate the table from scratch. This could also happen with a mature table, though it would be much more rare.

Describe the solution you'd like I'd like Circus Train to support this use case and replicate the update. As a user of Circus Train I am not expecting it to protect me from inadvertent breaking changes, I just expect it to copy data and metadata from one place to the next.

If there are CT users who depend on CT to protect them from breaking schema changes, I propose we add some metadata to indicate which tables should be protected and which shouldn't. This can be thought of as a dev/prod distinction or something along the lines of safe/unsafe. CT could still attempt to detect partitions which weren't updated to match the schema change and fail to copy if such partitions exist. Though I'd argue that this isn't the job of the copy tool, but rather the Data Lake tooling surrounding the upstream schema change.

This has been discussed with @massdosage

JayGreeeen commented 4 years ago

Thanks for the extension suggestion!

I'm working on adding in a FULL_OVERWRITE replication mode, which when specified in the config file will drop the existing target table and replace it with a copy of the source table, keeping the same name as the dropped target table. Effectively, doing a full update of data and metadata each time. Is this what you're after?

Dima1224 commented 4 years ago

Yep, that sounds like it would do the trick. One thing I'd keep in mind is that this will likely be the mode for tables that are being actively iterated on. At some point that table will be hardened and the consumer will likely want to switch the replication mode to "regular."

JayGreeeen commented 4 years ago

Great! And yep thats fine, when you no longer want to overwrite the target each time you can just use the FULL replication mode as done previously.

JayGreeeen commented 4 years ago

The PRs relating to this issue have both been merged, I will now close this ticket.