ExpediaGroup / circus-train

Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.
Apache License 2.0
86 stars 15 forks source link

Avro external schemas set on partitions are copied for all partitions #203

Closed patduin closed 4 years ago

patduin commented 4 years ago

A replication for aa partitioned table containing serde properties pointing to an external "avro.schema.url". Will generate a copy job (M/R) for every partition even if that file is the same for all partitions.

We need to see if we can optimise that one copy.