Closed symfrog closed 4 years ago
It is not clear what is the advantage of retaining instant timestamps between pipelines. Can you elaborate more on what the issue is when migrating tables ?
To your question related to code, it is not just commit which works on instant times. Other background actions like compactions, cleaning uses internally generated timestamps. So, those needs to be handled too. My suggestion is to really make sure if this way of operating has valid use-case and is really needed ?
@bvaradar the purpose would be in the case of an unavoidable schema evolution that is not backward compatible, we would maintain the original tables for some period of time to allow for downstream clients to migrate to the new set of tables.
The new set of tables would be a transformation (e.g. rename columns) of the original tables.
However, we would like downstream clients to be able to use their instant values to continue to do incremental pulls without receiving data they have already processed when they switch over to the new tables (conforming to the new schema).
The new tables would be created during an initialization process to ingest all the data from the old tables and transform it to the new schema. After this initialization process, we would like the instant timestamps to be the same in the new target tables after the transformation so that downstream clients can continue to use their existing instant values while performing incremental pull queries.
we would like the instant timestamps to be the same in the new target tables after the transformation so that downstream clients can continue to use their existing instant values while performing incremental pull queries.
IIUC the current initialization process hands you a single commit for the first ingest.. but you basically want a physical copy of the old data, as the new data , with just renamed fields/new schema.. In general, this may be worth adding support for in the new exporter tool cc @xushiyan ... wdyt? essentially, something that will preserve file names and just transform the data.
For now, even if you create those commit timeline files yourself in .hoodie
, it may not work since the metadata inside will point to files that no longer exist in the new table.. Here's an approach that could work.. Writing a small program, that will
.hoodie
folder to new table location.hoodie
folder you copied aboveEssentially, you will have the same file names and same timline (.hoodie) metadata, just with different schema..
Let's also wait to hear from @xushiyan . may be the exporter tool could be reused here
@vinothchandar yes, exactly, some schema evolution operations may also involve the splitting or merging of tables
@vinothchandar Yes the exporter tool can be used for this purpose, with some changes. It currently supports copying Hudi dataset as is. With this migration use case, we could extend the feature to include transformation when --output-format hudi
, using a custom Transformer
.
Though MOR is a bit troublesome with log files conversions, we could start with COW tables support? Does this work for your case? @symfrog
As for splitting/merging usecases, something feasible as well; some more logic to implement for exporter to take multiple source/target paths. Also some efforts to support multiple datasets in Transformer
interface.
@vinothchandar Are my thoughts above aligned with yours?
@xushiyan Yes, thanks, that would work. I am using COW for the tables.
we could start with COW tables support?
sg. as long as we throw a loud exception saying MOR + transformer is not supported :)
Time to file a JIRA?
@vinothchandar filed https://jira.apache.org/jira/browse/HUDI-767 https://jira.apache.org/jira/browse/HUDI-768
@bvaradar @vinothchandar ok to close this?
Thanks @xushiyan . If you are planning to have the jiras done in 0.6.0, can you mark the fix versions accordingly.
@bvaradar Yes, I marked 767 for 0.6.0. I'll put 768 on waiting list at the moment 😄
Describe the problem you faced
Is there any way to retain the commit instant time for records when using delta streamer with a Hudi table source?
I took a look at the code, and it does not seem possible.
I am trying to migrate tables, but would like downstream clients to be able to continue doing incremental pulls transparently using their existing instant time values after migration.
Is there any other way to achieve this?
It seems it might be possible when constructing a HoodieWriteClient directly and using startCommitWithTime (https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L867), would this be a viable route?
Environment Description