apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.37k stars 2.2k forks source link

Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

Closed tombarti closed 1 week ago

tombarti commented 12 months ago

Feature Request / Improvement

Summary

When filtering an Iceberg table in Spark, would it be possible to pushdownSUBSTRING filters when the substring begins with the start of the word (position 1)?

For example, would it be possible to push down to the BatchScan this filter:

WHERE SUBSTRING(mycol, 1, 3) IN ('foo', 'bar', 'baz')

Since it is equivalent to:

WHERE STARTSWITH(mycol, 'foo') OR STARTSWITH(mycol, 'bar') OR STARTSWITH(mycol, 'baz')

Which does indeed get pushed down as I can see from the physical plan that it is included in the BatchScan:

== Physical Plan ==
*(1) ColumnarToRow
+- BatchScan catalog.db.mytable[mycol#9, name#10] catalog.db.mytable (branch=null) [filters=((mycol LIKE 'foo%' OR mycol LIKE 'bar%') OR mycol LIKE 'baz%'), groupedBy=] RuntimeFilters: []

Use Case

Suppose I have a table which contains location related data with a geohash column which is used to partition the data as follows:

CREATE TABLE IF NOT EXISTS dev.db.locations (
    geohash string,
    location_name string,
    identifier integer
)
USING ICEBERG
PARTITIONED BY (truncate(2, geohash))

Now let's insert some data:

INSERT INTO dev.db.locations VALUES
   ('fec229', 'name_1', 1),
   ('fez228', 'name_2', 2),
   ('f2c229', 'name_1', 3),
   ('c2c23c', 'name_2', 4),
   ('c2c22c', 'name_3', 5),
   ('h3c236', 'name_3', 6),

I would like for the filter to be pushed down when perform the following sort of query:

SELECT *
FROM dev.db.locations
WHERE SUBSTRING(geohash, 1, n) IN (...)

Where n could vary in size from one query to another depending on the precision (the length) of geohashes we want to filter on. For example, if we are interested in geohashes of precision 2, this would be:

SELECT *
FROM dev.db.locations
WHERE SUBSTRING(geohash, 1, 2) IN ('fe', 'c2')

This is currently not the case as can be seen by the physical plan generated by the above query:

== Physical Plan ==
*(1) Filter substring(geohash#9, 1, 2) IN (fe,c2)
+- *(1) ColumnarToRow
   +- BatchScan dev.db.locations[geohash#9, location_name#10, identifier#11] dev.db.locations (branch=null) [filters=, groupedBy=] RuntimeFilters: []

[!IMPORTANT] Note that in this use case, the IN (...) set could contain hundreds of thousands of elements. Would this be viable?

Query engine

Spark

RussellSpitzer commented 12 months ago

I think this could be done but the fix would have to be in Spark I believe. Spark needs to be able to convert Substring => StartsWith on their end

tombarti commented 12 months ago

Thanks for the quick reply @RussellSpitzer, so what you are saying is that this really should be implemented in Spark and once it is, there is nothing much to do on the Iceberg side?

RussellSpitzer commented 12 months ago

Iceberg uses the Datasource API from Spark, so we only see filters and expressions that Spark decides to pass through to us. In this case "substring" is just not an expression it can push through. What it can push through is "StartsWith" so in Spark we would want an analysis rule that converted Substring(1, X) => StartsWith.

Another possible avenue to support this sort of thing would be to use the Iceberg truncate expression and an in clause. That may be possible in just Iceberg.

tombarti commented 12 months ago

Thanks for taking the time to explain, that all makes sense now!

I can see that #7886 in Iceberg 1.4.0 could be handful for the other avenue you are suggesting!

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 1 week ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'