[SUPPORT] Backwards Incompatible Schema Evolution

symfrog commented 4 years ago

Describe the problem you faced

Is there any way to retain the commit instant time for records when using delta streamer with a Hudi table source?

I took a look at the code, and it does not seem possible.

I am trying to migrate tables, but would like downstream clients to be able to continue doing incremental pulls transparently using their existing instant time values after migration.

Is there any other way to achieve this?

It seems it might be possible when constructing a HoodieWriteClient directly and using startCommitWithTime (https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L867), would this be a viable route?

Environment Description

Hudi version : 0.5.2

bvaradar commented 4 years ago

It is not clear what is the advantage of retaining instant timestamps between pipelines. Can you elaborate more on what the issue is when migrating tables ?

To your question related to code, it is not just commit which works on instant times. Other background actions like compactions, cleaning uses internally generated timestamps. So, those needs to be handled too. My suggestion is to really make sure if this way of operating has valid use-case and is really needed ?

symfrog commented 4 years ago

@bvaradar the purpose would be in the case of an unavoidable schema evolution that is not backward compatible, we would maintain the original tables for some period of time to allow for downstream clients to migrate to the new set of tables.

The new set of tables would be a transformation (e.g. rename columns) of the original tables.

However, we would like downstream clients to be able to use their instant values to continue to do incremental pulls without receiving data they have already processed when they switch over to the new tables (conforming to the new schema).

The new tables would be created during an initialization process to ingest all the data from the old tables and transform it to the new schema. After this initialization process, we would like the instant timestamps to be the same in the new target tables after the transformation so that downstream clients can continue to use their existing instant values while performing incremental pull queries.

vinothchandar commented 4 years ago

we would like the instant timestamps to be the same in the new target tables after the transformation so that downstream clients can continue to use their existing instant values while performing incremental pull queries.

IIUC the current initialization process hands you a single commit for the first ingest.. but you basically want a physical copy of the old data, as the new data , with just renamed fields/new schema.. In general, this may be worth adding support for in the new exporter tool cc @xushiyan ... wdyt? essentially, something that will preserve file names and just transform the data.

For now, even if you create those commit timeline files yourself in .hoodie, it may not work since the metadata inside will point to files that no longer exist in the new table.. Here's an approach that could work.. Writing a small program, that will

First copy the .hoodie folder to new table location
Then list all files (directly using fs.listStatus()) and filter them such that their commit time < latest commit time in the .hoodie folder you copied above
Read all files out using AvroParquetReader to get RDD[GenericRecord] (if it's MOR, we need more work), do your schema adjusting to derive a new RDD[GenericRecord]
Write this out using HoodieAvroParquetWriter back into the same file names..

Essentially, you will have the same file names and same timline (.hoodie) metadata, just with different schema..

Let's also wait to hear from @xushiyan . may be the exporter tool could be reused here

symfrog commented 4 years ago

@vinothchandar yes, exactly, some schema evolution operations may also involve the splitting or merging of tables

xushiyan commented 4 years ago

@vinothchandar Yes the exporter tool can be used for this purpose, with some changes. It currently supports copying Hudi dataset as is. With this migration use case, we could extend the feature to include transformation when --output-format hudi, using a custom Transformer.

Though MOR is a bit troublesome with log files conversions, we could start with COW tables support? Does this work for your case? @symfrog

As for splitting/merging usecases, something feasible as well; some more logic to implement for exporter to take multiple source/target paths. Also some efforts to support multiple datasets in Transformer interface.

@vinothchandar Are my thoughts above aligned with yours?

symfrog commented 4 years ago

@xushiyan Yes, thanks, that would work. I am using COW for the tables.

vinothchandar commented 4 years ago

we could start with COW tables support?

sg. as long as we throw a loud exception saying MOR + transformer is not supported :)

Time to file a JIRA?

xushiyan commented 4 years ago

@vinothchandar filed https://jira.apache.org/jira/browse/HUDI-767 https://jira.apache.org/jira/browse/HUDI-768

@bvaradar @vinothchandar ok to close this?

bvaradar commented 4 years ago

Thanks @xushiyan . If you are planning to have the jiras done in 0.6.0, can you mark the fix versions accordingly.

xushiyan commented 4 years ago

@bvaradar Yes, I marked 767 for 0.6.0. I'll put 768 on waiting list at the moment 😄

apache / hudi

[SUPPORT] Backwards Incompatible Schema Evolution #1480