apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.87k stars 1.11k forks source link

Add `deltalake` feature #2025

Open matthewmturner opened 2 years ago

matthewmturner commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

I would like to be able to register a delta lake deltatable as a table from SQL as part of working on datafusion-tui. For example:

CREATE EXTERNAL TABLE dt
STORED AS DELTATABLE
LOCATION 's3://bucket/schema/table'

From what ive seen this would require adding a FileType and FileFormat for deltatable under deltalake feature, similar to how there is avro feature.

While I understand a delta table isnt exactly a file type / format - i think for the purposes of what were doing with those it meets the definition. Ive played with querying delta tables before and they use register_table as opposed to register_listing_table. So i think we would just need to match based on FileType and then for delta table use register_table instead.

Describe the solution you'd like A clear and concise description of what you want to happen. Enable deltatable FileFormat and FileType as features under deltalake

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

matthewmturner commented 2 years ago

@houqp i imagine youll have a view on this.

houqp commented 2 years ago

I think you should be able to add deltalake support to datafusion-tui by leveraging the existing table provider directly without touching datafusion core, see: https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs.

There is no need for adding new filetype or file format because deltalake only uses parquet, which we already support in datafusion.

matthewmturner commented 2 years ago

Interesting, ok. I thought there was some more magic going on under the hood (but I hadn't really had chance to look into it) but maybe that only comes into play with some of delta lakes more advanced features like time travel which I don't think is doable without sql extensions.

I'll try it out and get back to you. Thanks!

avantgardnerio commented 2 years ago

I think @matthewmturner is on to something. The SQL in this issue is straightforward and makes sense from an intuitive user perspective. Why it doesn't work seems like a limitation of DataFusion:

  1. Additional TableProviders can be registered in Rust applications (i.e. datafusion-tui or ballista)
  2. Files can be registered in SQL dynamically at run time - but for built-in TableProviders only
  3. However, there is no dynamic (SQL) way to register a new table with a custom table provider

Our use-case is running a Ballista server, with delta-rs compiled in, with the intention of allowing users to register tables in locations we can't know at compile time. Unfortunately, I think the way FileFormats work currently doesn't make this possible?