Closed tombarti closed 1 week ago
I think this could be done but the fix would have to be in Spark I believe. Spark needs to be able to convert Substring => StartsWith on their end
Thanks for the quick reply @RussellSpitzer, so what you are saying is that this really should be implemented in Spark and once it is, there is nothing much to do on the Iceberg side?
Iceberg uses the Datasource API from Spark, so we only see filters and expressions that Spark decides to pass through to us. In this case "substring" is just not an expression it can push through. What it can push through is "StartsWith" so in Spark we would want an analysis rule that converted Substring(1, X) => StartsWith.
Another possible avenue to support this sort of thing would be to use the Iceberg truncate expression and an in clause. That may be possible in just Iceberg.
Thanks for taking the time to explain, that all makes sense now!
I can see that #7886 in Iceberg 1.4.0 could be handful for the other avenue you are suggesting!
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
Feature Request / Improvement
Summary
When filtering an Iceberg table in Spark, would it be possible to pushdown
SUBSTRING
filters when the substring begins with the start of the word (position1
)?For example, would it be possible to push down to the
BatchScan
this filter:Since it is equivalent to:
Which does indeed get pushed down as I can see from the physical plan that it is included in the
BatchScan
:Use Case
Suppose I have a table which contains location related data with a geohash column which is used to partition the data as follows:
Now let's insert some data:
I would like for the filter to be pushed down when perform the following sort of query:
Where
n
could vary in size from one query to another depending on the precision (the length) of geohashes we want to filter on. For example, if we are interested in geohashes of precision 2, this would be:This is currently not the case as can be seen by the physical plan generated by the above query:
Query engine
Spark