apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
427 stars 155 forks source link

[feat] optimize read, pushdown `limit` to file level for `to_arrow` #1038

Closed kevinjqliu closed 1 month ago

kevinjqliu commented 2 months ago

Feature Request / Improvement

As of now, limit is checked only after an entire parquet file is read. https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1360-L1390

Optimization to pushdown limit to the parquet reading level

For more details, see this comment

soumya-ghosh commented 2 months ago

@kevinjqliu I would like to work on this one.

kevinjqliu commented 2 months ago

sure @soumya-ghosh, assigned to you

The solution might look similar to what is already done for project_batches in #1042 https://github.com/apache/iceberg-python/blob/f05b1aedee8451d981188adf68be5e8b360a9ca1/pyiceberg/io/pyarrow.py#L1457-L1479

kevinjqliu commented 1 month ago

Closed by #1043 (see comment)