ExpediaGroup / circus-train

Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.
Apache License 2.0
86 stars 15 forks source link

Circus Train should handle Avro Table replication out of the box #131

Closed abhimanyugupta07 closed 5 years ago

abhimanyugupta07 commented 5 years ago

Circus Train should be able to detect that a table is an Avro Table with a possibility of an external schema and should trigger the ct-avro transform automatically to copy over the external schema to the replica data lake.

Context

At the moment, we have a circus-train-avro transform which gets triggered only when the following configuration is provided in the CT config file:

transform-options:
    avro-serde-options:
      base-url: s3://shunting-yard-target/bdp/abhi_avro_test

If the configuration is not provided, CT treats the replication as a usual replication and as a result the replica table has the parameter avro.schema.url which is pointing to the source table's schema location which is not correct.

Proposed solution:

CT should be able to detect that the table which is being replicated is a Avro Table and hence should trigger the ct-avro transform and use the table's location as a default location for the schema.