Open Fokko opened 9 months ago
Hi @Fokko,
Thanks for making this issue! Funnily enough, Jay and I were talking about implementing something similar to this yesterday after running some initial profiling of the pyiceberg query on a table with a large of amount of partitions. (we can share the results offline). But we found that pyiceberg spent a bunch of time creating record objects for partitions that could have been pruned at the manifest level.
I'm thinking we can implement a visitor Trait in Rust and then expose that as a ABC in python. I think the pattern you have in your visitor interfaces look pretty good. I can try to follow that as much as possible.
The main gotcha i'm thinking about is if we run into an expression that can't be translated to an iceberg expression (like a UDF or one of our IO function expressions). Should we filter out these subtrees first before beginning the traversal in python? Another alternative is to just throw an exception for non-supported nodes, then the user could then decide how to deal with the non-supported node.
But we found that pyiceberg spent a bunch of time creating record objects for partitions that could have been pruned at the manifest level.
Yes, that's exactly the case now :) And I would love to implement a visitor, similar to what we did for PyArrow.
The main gotcha i'm thinking about is if we run into an expression that can't be translated to an iceberg expression (like a UDF or one of our IO function expressions). Should we filter out these subtrees first before beginning the traversal in python? Another alternative is to just throw an exception for non-supported nodes, then the user could then decide how to deal with the non-supported node.
This is an interesting case. There are a couple of things we can add, for example (lower
), etc. But there will always be functions that PyIceberg won't be able to support. This will result in a similar situation when there are no statistics, we'll just include them to be queried.
There is also an interesting optimization that we can do here. Let's say that your table is partitioned daily, and you query the data for the last 24 hours. For the current day, we don't need to do any further filtering, which might be interesting from a performance standpoint. Most engines today, still apply the row-filter anyway. In Iceberg we call this residual filtering.
Let me know if there is anything I can help with, I'm very excited to see this happening 👍
Is your feature request related to a problem? Please describe.
For the current Iceberg implementation, we read more metadata than necessary. The query planning phase of an Iceberg query starts with the manifest list, and for each manifest list, it is known which partitions are being stored. Please refer to the spec where
507: partitions (list<508: field_summary>)
contains the upper-and-lower bound of the partition, that is tracked in502: partition_spec_id (int)
.I would love to write a visitor to convert a Daft expression into a PyIceberg expression. Passing this to the
.scan(limit=limit, row_filter=BooleanExpression).plan_files()
will skip manifests that don't match the predicate.I tried inspecting the
pushdowns.filters
but it didn't give me anything to work with. In Python very little properties are exposed, and it looks like the object mainly lives in rust.Describe the solution you'd like
I can work with:
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
While https://github.com/Eventual-Inc/Daft/issues/1798 aims to include the statistics into Daft, which is great, I think we should add this as well. Evaluating the metrics lazily will probably take quite a while, and I think this would be an easy win to speed up the query planning.
Additional context Add any other context or screenshots about the feature request here.