apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
5.86k stars 2.06k forks source link

Schema issue between Arrow and PyIceberg #8913

Open asheeshgarg opened 8 months ago

asheeshgarg commented 8 months ago

Apache Iceberg version

1.4.1 (latest release)

Query engine

Other

Please describe the bug 🐞

@Fokko we have a table in iceberg which has some of the column names begin with numbers. We are able to scan the table using PyIceberg. When try to bind it to Arrow or DuckDB we see its Arrow invalid FieldRef.Name no match for field.

What we observe in in Arrow the field name beginning with number like 2030_ABC is renamed to _2030_ABC while the schema on iceberg is correct to define it as 2030_ABC which is in original data. Which trigger this issue.

Seem more of Arrow Bug happy to open it at Arrow project. Let me know

Fokko commented 8 months ago

Thanks @asheeshgarg for raising this. The PyIceberg repository has been moved to https://github.com/apache/iceberg-python. To get the quickest answers, it is best to raise the question over there.

We recently merged a PR that fixes this when reading Avro fields: https://github.com/apache/iceberg-python/pull/83

Can you try if this issue still persists on the main branch? You can easily install it using:

pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[arrow]"

If you run into anything, could you also share the error? Thanks 🙌