Closed marvinlanhenke closed 2 months ago
@liurenjie1024 @ZENOTME @Fokko PTAL and let me know what you think.
Thanks @marvinlanhenke This is amazing, I'll take a review later!
Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed https://github.com/apache/iceberg-rust/pull/277. Maybe we can convert this PR to ready for review now. @marvinlanhenke
Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed #277. Maybe we can convert this PR to ready for review now. @marvinlanhenke
thanks for the feedback - i made some minor changes based on the suggestions; and converted this PR into ready for review. If we agree on the basic design here, we could merge and split the actual impl into multiple issues.
@liurenjie1024 @ZENOTME @Fokko @Xuanwo @sdd @tshauck PTAL - especially, the part about "feature-flags" I have no idea whats best practice in 🦀
I've left some comment to improve, but it looks great! I'll invite datafusion community to help review.
@liurenjie1024 Thanks for the review. I fixed most of the basic issues regarding structure, naming, async-trait etc.
I'll work on the missing impls over the next couple of days - and see how far we can get with the current state of iceberg-rust.
@liurenjie1024 @ZENOTME @viirya @simonvandel ...I just went ahead and pushed the recent updates; PTAL
IcebergCatalogProvider
and IcebergSchemaProvider
/ so data does not become stale@liurenjie1024 @ZENOTME @viirya @simonvandel ...I just went ahead and pushed the recent updates; PTAL
unresolved / todo:
- [ ] proper integration tests for table scan (requires us to setup a table with actual snapshots / data )
- [ ] a cache for
IcebergCatalogProvider
andIcebergSchemaProvider
/ so data does not become stale- [ ] improve impl ExecutionPlan for IcebergTableScan (once we support filter pushdown, etc.)
Hi, @marvinlanhenke How about opening a tracking issue to track the issues related to datafusion integration? I guess we will have more things to do, and we can edit on that tracking issue.
Let's wait a moment to see what others think about the catalog snapshot problem. cc @Xuanwo @Fokko @viirya @sdd PTAL
Let's wait a moment to see what others think about the catalog snapshot problem. cc @Xuanwo @Fokko @viirya @sdd PTAL
If it turns out, that the PR is okay (for now) - let me update the missing docs; before we merge - should make the follow up work a lot easier in a couple of days/weeks.
I've addressed those issues raised by @Xuanwo - thanks for the review.
I think we can merge this PR for now - and continue working on #357; especially the caching issues; and proper integration testing with data.
@Fokko @liurenjie1024 I haven't checked now, but from memory I believe py-iceberg uses pyspark to setup tables with actual data for proper integration testing? Would this make sense to tackle such an issue in the next weeks in order to provide proper integration testing? Or should we wait once #349 lands and we have a dedicated crate for e2e testing?
Thanks for working on this @marvinlanhenke . Looks good to me. There are some more integrations with DataFusion I'm interested to work with like improving ExecutionPlan for IcebergTableScan. Let's move this forward.
I haven't checked now, but from memory I believe py-iceberg uses pyspark to setup tables with actual data for proper integration testing?
I think so. I think datafusion also python bindings so we can use python for all tests?
I've not reviewed #349 yet, but for datafusion, we can go on without a separate crate for integrate tests?
Thanks @marvinlanhenke for this pr, and @Fokko @Xuanwo @viirya @simonvandel @tshauck for review!
I had some fun, working with Datafusion and I want to share the (rough) design with you.
As outlined in #242 I went ahead and:
integrations
datafusion
with its respective modules (catalog.rs, schema.rs, etc.)CatalogProvider
The overall structure should be extensible, meaning - if we want to support an integration with e.g.
polars
we simply create a new subfolder and implement the traits.The test infra is supposed to be used by every integration.
The IcebergCatalogProvider, for example, should be usable by any Catalog (since it Arc for the client (dynamic dispatch))
I really would appreciate your thoughts on this approach - so we can discuss how to proceed, whats missing upstream (like #277), etc.