apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
590 stars 218 forks source link

PyArrow: Avoid buffer-overflow by avoid doing a sort #1555

Closed Fokko closed 2 weeks ago

Fokko commented 2 weeks ago

Second attempt of https://github.com/apache/iceberg-python/pull/1539

This was already being discussed back here: https://github.com/apache/iceberg-python/issues/208#issuecomment-1889891973

This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually.

Fixes https://github.com/apache/iceberg-python/issues/1491

Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The combine_chunks method does this correctly.

Now:

0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds

Before:

Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds

So it comes with a nice speedup as well :)

kevinjqliu commented 2 weeks ago

make: *** [Makefile:55: test-integration] Aborted (core dumped)

uh oh

Fokko commented 2 weeks ago

@kevinjqliu I think the test is a bit too much, according to your comment here https://github.com/apache/iceberg-python/pull/1539#discussion_r1922705843 the test allocates almost 5gb 😀

kevinjqliu commented 2 weeks ago

2^32 (4_294_967_296) is around 4GB, we just need to test a scenario greater than that