Explore integration with Delta Lake

apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator

https://datafusion.apache.org/comet

Apache License 2.0

823 stars 163 forks source link

Explore integration with Delta Lake #174

Open sunchao opened 8 months ago

sunchao commented 8 months ago

What is the problem the feature request solves?

Comet currently only support either Spark's built-in data sources, or Iceberg (WIP). We should also consider supporting Delta Lake in future especially given it already has a Rust implementation delta-rs. To achieve that, however, we may need to first move away from our hybrid Parquet reader implementation to a fully native one.

cc @dennyglee per our discussion

Describe the potential solution

Integrate Comet with delta-rs, so Spark queries reading from Delta Lake tables can also leverage Comet native execution.

Additional context

No response

viirya commented 8 months ago

I'm thinking if we move to fully native reader, does it mean we need to drop current JVM-based source (built-in and Iceberg)? Or we are going to have two types of readers? I ask this because I think they might affect how we handle native execution and they might be conflicting each other. (maybe not, if we have different native source operator. 🤔 )

sunchao commented 8 months ago

... does it mean we need to drop current JVM-based source (built-in and Iceberg)?

TBH I don't have concrete ideas on how the switch to fully native Parquet read will look like at the moment. But, we definitely still need to support these after the migration. We might want to keep both implementations around for certain time before the new implementation get matured, and eventually remove the old one.

rtyler commented 8 months ago

Hello friends! Checking in from the delta-rs project :smile: We rely heavily on the parquet crate which now has support for almost all the data types I have every seen in the wild with Apache Parquet. We recently broke up subcrates so if you have your own file reading tools, pulling in deltalake-core will give you the smallest dependency surface area necessary to process a Delta table.

We do also publish deltalake-aws, deltalake-azure, deltalake-gcp which have storage specific requirements handled within them, such as some of the silly hacks to work around atomic renames in S3.

You can also take the metacrate deltalake if you want the whole bucket of fun :smile: