Spark: Pushdown Filters / Improve Performance when Importing File Based Tables

RussellSpitzer commented 2 years ago

Currently several we rely on several Spark internal classes when attempting to list the contents of file based tables for various of our migrate/add_file functions.

See

https://github.com/apache/iceberg/blob/f5a753791f4dc6aca78569a14f731feda9edf462/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java#L810-L854

The cost of this operation scales directly with the number of files/folders in the table irregardless of the actual partition filter we are applying. It may make sense to attempt to pushdown the filters being used in the operation (in case of add_files) or do the listing in a more economical way.

For example: Imagine a user calls add_files and specifies a single partition in a table. The current code would require performing a full listing of every directory (and many of the files in the table) before it would filter that list down to only the partitions which match the request.

I don't have a good plan for doing this at the moment since our code is so reliant on Spark to achieve the listing but I assume we can do better.

huaxingao commented 2 years ago

I will give this a try.

huaxingao commented 2 years ago

I did some initial changes here https://github.com/apache/iceberg/pull/3745.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 2 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

apache / iceberg

Spark: Pushdown Filters / Improve Performance when Importing File Based Tables #3532