apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
375 stars 137 forks source link

Parquet column array<struct<>> with null value is read in as empty list #251

Open puchengy opened 8 months ago

puchengy commented 8 months ago

Apache Iceberg version

main (development)

Please describe the bug 🐞

An Iceberg table with column type array that has null value is read in as empty list, however, it should be none instead.

reproducible scripts: https://github.com/puchengy/iceberg-python/commit/3fd6d3d3e4b237bda98e40c36bb07e7e4035c2f2

shows

>       assert pyberg_val == direct_val
E       assert [] == None
Fokko commented 8 months ago

Great catch @puchengy, let me see what's needed to fix this

Fokko commented 8 months ago

I've found the issue. We don't respect the null count when fetching the array through the accessor:

image

We just return the array and then create a new array with offset 1, and then it just injects a []

HonahX commented 7 months ago

There is still an edge case unfixed. We need to wait for an upstream fix: https://github.com/apache/arrow/issues/38809

ref: https://github.com/apache/iceberg-python/pull/252#discussion_r1467065763

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

HonahX commented 1 month ago

Reply to re-activate the issue : )