Support source/sink for plain Parquet/ORC/Avro Tables

anoopj commented 1 year ago

Supporting plain Parquet/ORC/Avro (partitioned as well as unpartitioned) may be useful for "upgrading" legacy data to table formats. Sink may be useful for exporting a specific snapshot for interoperability reasons.

This feature is lower priority, as Iceberg/Delta etc have native support for metadata-only conversions and offer Spark procedures.

the-other-tim-brown commented 1 year ago

@anoopj what would the metadata look like for a sink export?

I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools.

anoopj commented 1 year ago

@anoopj what would the metadata look like for a sink export?

Sink could be based on manifest files in SymlinkTextInputFormat. BigQuery also now supports manifest files.

I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools.

Yes, bootstrap is probably higher priority than sink.

the-other-tim-brown commented 10 months ago

@jackwener any interest in looking into something like this?

marqub commented 6 months ago

@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info...

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

the-other-tim-brown commented 6 months ago

@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info...

I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

Yes you could do that as well.

There is another issue I had my eye on that I could guide you through as well if you are interested: https://github.com/apache/incubator-xtable/issues/411

marqub commented 6 months ago

I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

Yes you could do that as well.

Ok, if you agree that we want to move away from this workaround approach, then I think supporting Parquet is a good first issue for me to smooth the learning curve.

There is another issue I had my eye on that I could guide you through as well if you are interested: #411

ok, this one could be a good next step, but for now, I prefer to limit the amount of novelty.

I should have some time to start on the parquet issue next week. How do you prefer to communicate? Is there a slack channel?

the-other-tim-brown commented 6 months ago

@marqub we do not have a slack setup for the project yet, I can shoot you an email to connect and discuss any of the details in the meantime.

Reactor11 commented 1 month ago

Hi, Is someone working on it? I am new to this project and would like to get started.

the-other-tim-brown commented 3 weeks ago

Hi, Is someone working on it? I am new to this project and would like to get started.

@Reactor11 there is a similar effort for a parquet file source that is being worked on: https://github.com/apache/incubator-xtable/issues/553

apache / incubator-xtable

Support source/sink for plain Parquet/ORC/Avro Tables #166