apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
477 stars 97 forks source link

Basic Integration with Datafusion #324

Closed marvinlanhenke closed 2 months ago

marvinlanhenke commented 3 months ago

I had some fun, working with Datafusion and I want to share the (rough) design with you.

As outlined in #242 I went ahead and:


The overall structure should be extensible, meaning - if we want to support an integration with e.g. polars we simply create a new subfolder and implement the traits.

The test infra is supposed to be used by every integration.

The IcebergCatalogProvider, for example, should be usable by any Catalog (since it Arc for the client (dynamic dispatch))

I really would appreciate your thoughts on this approach - so we can discuss how to proceed, whats missing upstream (like #277), etc.

marvinlanhenke commented 3 months ago

@liurenjie1024 @ZENOTME @Fokko PTAL and let me know what you think.

liurenjie1024 commented 3 months ago

Thanks @marvinlanhenke This is amazing, I'll take a review later!

ZENOTME commented 2 months ago

Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed https://github.com/apache/iceberg-rust/pull/277. Maybe we can convert this PR to ready for review now. @marvinlanhenke

marvinlanhenke commented 2 months ago

Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed #277. Maybe we can convert this PR to ready for review now. @marvinlanhenke

thanks for the feedback - i made some minor changes based on the suggestions; and converted this PR into ready for review. If we agree on the basic design here, we could merge and split the actual impl into multiple issues.

@liurenjie1024 @ZENOTME @Fokko @Xuanwo @sdd @tshauck PTAL - especially, the part about "feature-flags" I have no idea whats best practice in 🦀

marvinlanhenke commented 2 months ago

I've left some comment to improve, but it looks great! I'll invite datafusion community to help review.

@liurenjie1024 Thanks for the review. I fixed most of the basic issues regarding structure, naming, async-trait etc.

I'll work on the missing impls over the next couple of days - and see how far we can get with the current state of iceberg-rust.

marvinlanhenke commented 2 months ago

@liurenjie1024 @ZENOTME @viirya @simonvandel ...I just went ahead and pushed the recent updates; PTAL

unresolved / todo:

liurenjie1024 commented 2 months ago

@liurenjie1024 @ZENOTME @viirya @simonvandel ...I just went ahead and pushed the recent updates; PTAL

unresolved / todo:

  • [ ] proper integration tests for table scan (requires us to setup a table with actual snapshots / data )
  • [ ] a cache for IcebergCatalogProvider and IcebergSchemaProvider / so data does not become stale
  • [ ] improve impl ExecutionPlan for IcebergTableScan (once we support filter pushdown, etc.)

Hi, @marvinlanhenke How about opening a tracking issue to track the issues related to datafusion integration? I guess we will have more things to do, and we can edit on that tracking issue.

liurenjie1024 commented 2 months ago

Let's wait a moment to see what others think about the catalog snapshot problem. cc @Xuanwo @Fokko @viirya @sdd PTAL

marvinlanhenke commented 2 months ago

Let's wait a moment to see what others think about the catalog snapshot problem. cc @Xuanwo @Fokko @viirya @sdd PTAL

If it turns out, that the PR is okay (for now) - let me update the missing docs; before we merge - should make the follow up work a lot easier in a couple of days/weeks.

marvinlanhenke commented 2 months ago

I've addressed those issues raised by @Xuanwo - thanks for the review.

I think we can merge this PR for now - and continue working on #357; especially the caching issues; and proper integration testing with data.

@Fokko @liurenjie1024 I haven't checked now, but from memory I believe py-iceberg uses pyspark to setup tables with actual data for proper integration testing? Would this make sense to tackle such an issue in the next weeks in order to provide proper integration testing? Or should we wait once #349 lands and we have a dedicated crate for e2e testing?

viirya commented 2 months ago

Thanks for working on this @marvinlanhenke . Looks good to me. There are some more integrations with DataFusion I'm interested to work with like improving ExecutionPlan for IcebergTableScan. Let's move this forward.

liurenjie1024 commented 2 months ago

I haven't checked now, but from memory I believe py-iceberg uses pyspark to setup tables with actual data for proper integration testing?

I think so. I think datafusion also python bindings so we can use python for all tests?

I've not reviewed #349 yet, but for datafusion, we can go on without a separate crate for integrate tests?

liurenjie1024 commented 2 months ago

Thanks @marvinlanhenke for this pr, and @Fokko @Xuanwo @viirya @simonvandel @tshauck for review!