Closed RussellSpitzer closed 2 months ago
I will give this a try.
I did some initial changes here https://github.com/apache/iceberg/pull/3745.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
Currently several we rely on several Spark internal classes when attempting to list the contents of file based tables for various of our migrate/add_file functions.
See
https://github.com/apache/iceberg/blob/f5a753791f4dc6aca78569a14f731feda9edf462/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java#L810-L854
The cost of this operation scales directly with the number of files/folders in the table irregardless of the actual partition filter we are applying. It may make sense to attempt to pushdown the filters being used in the operation (in case of add_files) or do the listing in a more economical way.
For example: Imagine a user calls add_files and specifies a single partition in a table. The current code would require performing a full listing of every directory (and many of the files in the table) before it would filter that list down to only the partitions which match the request.
I don't have a good plan for doing this at the moment since our code is so reliant on Spark to achieve the listing but I assume we can do better.