apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
673 stars 159 forks source link

Query specific table snapshot with datafusion. #702

Closed ryzhyk closed 2 days ago

ryzhyk commented 3 days ago

I would like to query an iceberg table via datafusion, but I need to run the query against a specific snapshot of the table. I've been studying the datafusion table provider implementation and, if my understanding is correct, it always runs the query against the latest snapshot of the table. So my question is whether it is currently possible to register a specific table snapshot with datafusion.

Thank you!

ryzhyk commented 2 days ago

I created #707 to try to address this

liurenjie1024 commented 2 days ago

Hi, @ryzhyk It's possible using iceberg-rust's api: https://github.com/apache/iceberg-rust/blob/6e0bcf56028e0d20d5ceeedf87dbb3db7c929ee3/crates/iceberg/src/scan.rs#L131

But currently not possible with datafusion since datafusion provides not place to allows user to specify snapshot id of table.

ryzhyk commented 2 days ago

Thank you!

Yes, I understand it's doable directly with the iceberg crate, but I prefer to use datafusion in this case, as it allows running a SQL statement over the Iceberg table. In the past I implemented the same functionality using delta-rs and their datafusion adapter. Their API is similar in spirit to what I implemented in my PR: the DeltaLake table is configured with a specific version before exposing it to datafusion.