ExpediaGroup / circus-train

Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.
Apache License 2.0
86 stars 15 forks source link

Don't override target cluster replication factor #132

Closed patduin closed 5 years ago

patduin commented 5 years ago

As a user of CT I'd like to change CT to use the target cluster replication factor by default So I don't silently overwrite it

CT uses the hadoop configuration (core-site.xml, etc..) on the cluster it runs on to configure the M/R job used for copying the data. If you push data to a different cluster that means you might override the dfs.replication factor for that data using the source cluster setting and not the target cluster setting. We should change that behaviour and make sure we don't override that setting unless it is explicitly set in the CT yml configuration copier-options section.

It is particularly tricky when you run in EMR (dfs.replication=1 by default) and replicate to an on-premise HDFS cluster which usually has dfs.replication=3.

Acceptance Criteria:

patduin commented 5 years ago

We can't really do this correctly, see comments in the PR. I'm closing this.