Open bama-chi opened 5 months ago
@bama-chi would you be able to provide a file to test this with?
yees @jorisvandenbossche, sure here is a sample of a file that I couldn't read sample.zip
do you manage to test @jorisvandenbossche ?
Yes, I can confirm the error, also with the latest development version of Arrow.
Opening the file and printing the schema there:
In [63]: import fastparquet
In [64]: f = fastparquet.ParquetFile("../Downloads/sample/part-00000-e69412f4-236c-436a-a4cd-89318d2aaa3d-c000.snappy.parquet")
In [65]: print(f.schema.text)
- spark_schema:
| - id: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - email_sha256: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - params: MAP, MAP, OPTIONAL
| - map: UNKNOWN, MAP_KEY_VALUE, REPEATED
| | - key: BYTE_ARRAY, STRING, UTF8, REQUIRED
| - value: BYTE_ARRAY, STRING, UTF8, OPTIONAL
- master_id: BYTE_ARRAY, STRING, UTF8, OPTIONAL
I suppose the error is coming from the MAP
type column "params" (since the other columns are simple, non-nested columns).
According to the spec (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), map columns consist of a 3-level structure, as is also shown above. The second (middle) level "named key_value
, must be a repeated group".
So as a start, it seems that Spark doesn't use the correct name, since above that level is named "map" and not "key_value".
But I assume the issue is that this middle level is annotated with the "UNKNOWN" logical type. And in our code, we don't allow a group node to have such logical type (to be my understanding, this annotation also doesn't make sense, as it should be used to indicate that all values in that column are null).
Looking at the parquet-mr repo, it seems this was fixed in version 1.12 (https://github.com/apache/parquet-mr/pull/798 / https://issues.apache.org/jira/browse/PARQUET-1879). So if you update your spark (and parquet-mr) version, and write the file again, then I assume it will be readable by Arrow.
Now, this still means that we cannot read files written by parquet-mr < 1.12. I assume also on the Arrow side it should be possible to add some workaround to ignore the UNKNOWN logical type of a group node if the converted type is MAP_KEY_VALUE.
Describe the bug, including details regarding any error messages, version, and platform.
I'm trying to read a parquet file with pandas using 'pyarrow' engine and I'm having a problem while reading it. the stack trace error :
here is the schema of the parquet file that I'm trying to read:
otherwise when I'm reading the same file with
fastparquet
everything goes smoothlypandas version: 1.5.0 pyarrow version: 14.0.1
Component(s)
Python