apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
853 stars 140 forks source link

add support for converting traditional hive tables to iceberg/delta/hudi #550

Open djouallah opened 1 day ago

djouallah commented 1 day ago

Feature Request / Improvement

there are a lot of systems that produce parquet files only, it will be useful if xtable can convert those parquet to modern tables formats without rewriting data just by adding metadata continuously.

Delta do that already but it is a one off operation and can't accept new files

Are you willing to submit PR?

Code of Conduct

JDLongPLMR commented 14 hours ago

This seems like a pretty easy lift. There are a number of use cases where simply adding parquet files to the table would be handy.

vinishjail97 commented 11 hours ago

Yes this can be done, we need to implement a parquet source class which needs to do two things - retrieve snapshot and retrieve change log since lastSyncTime.

Using List files

  1. List all parquet files in ObjectStorage or HDFS root path to retrieve the snapshot. This would be a simple list call.
  2. Fetch the parquet files that have been added since last syncInstant to retrieve the change log. We can this via the same list call and filtering files based on their creationTime is the simplest way but it's expensive.

Using cloud notifications queue

  1. The efficient way of doing this for object stores would be to setup the notifications queue for the bucket, consume and insert the file location, creationTime etc. to a key-value store or hudi/delta/iceberg table (let's call this events table). To handle duplicate notifications in the queue, the events table would have the parquet versioned file location as the primary key. If you are using hudi for events table, we don't need to write new code here, this step would be setting up cloud queue via terraform and starting a job which consumes from the queue. Ref: S3EventsSource GcsEventsSource SQS PubSub
  2. XTable parquet source class would trigger an incremental query to the events table to get the new files that have been added since the lastSyncTime and generate hudi, iceberg and delta metadata for them.

The design is similar to what hudi does for ingesting large number of files, steps 7 and 8 in the architecture would become XTable sync. https://hudi.apache.org/blog/2021/08/23/s3-events-source/

If you are using HDFS or object stores which don't support a queue based system for file notifications, we need to build/re-use existing queue implementation for file notifications.

vinishjail97 commented 11 hours ago

@djouallah @JDLongPLMR Let me know what you think of the two approaches, we can write this as utility tool in xtable-utilities similar to RunSync

https://github.com/apache/incubator-xtable/blob/main/xtable-utilities/src/main/java/org/apache/xtable/utilities/RunSync.java

alberttwong commented 9 hours ago

I believe you can covert parquet to hudi files via hudi bootstraping (https://hudi.apache.org/docs/migration_guide). Once it's in hudi, you can apache xtable to other formats. Onehouse can do this automatically.

djouallah commented 8 hours ago

@djouallah @JDLongPLMR Let me know what you think of the two approaches, we can write this as utility tool in xtable-utilities similar to RunSync

https://github.com/apache/incubator-xtable/blob/main/xtable-utilities/src/main/java/org/apache/xtable/utilities/RunSync.java

using listing files seems good for my use case