apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.88k stars 3.38k forks source link

[C++][Parquet] Cannot read Parquet file with map column generated by pyspark / parquet-mr < 1.12 #39540

Open bama-chi opened 5 months ago

bama-chi commented 5 months ago

Describe the bug, including details regarding any error messages, version, and platform.

I'm trying to read a parquet file with pandas using 'pyarrow' engine and I'm having a problem while reading it. the stack trace error :

  File "<stdin>", line 1, in <module>
  File "/home/bama/.pyenv/versions/3.10.4/lib/python3.10/site-packages/pandas/io/parquet.py", line 501, in read_parquet
    return impl.read(
  File "/home/bama/.pyenv/versions/3.10.4/lib/python3.10/site-packages/pandas/io/parquet.py", line 249, in read
    result = self.api.parquet.read_table(
  File "/home/bama/.pyenv/versions/3.10.4/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2956, in read_table
    dataset = _ParquetDatasetV2(
  File "/home/bama/.pyenv/versions/3.10.4/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2496, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 1358, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Could not open Parquet input source '<Buffer>': Logical type Null can not be applied to group node

here is the schema of the parquet file that I'm trying to read:

org.apache.spark.version2.4.7)org.apache.spark.sql.parquet.row.metadata�{"type":"struct","fields":[{"name":"id","type":"string","nullable":true,"metadata":{}},{"name":"uid","type":"string","nullable":true,"metadata":{}},{"name":"params","type":{"type":"map","keyType":"string","valueType":{"type":"array","elementType":"string","containsNull":true},"valueContainsNull":true},"nullable":true,"metadata":{}},{"name":"utc_date","type":"timestamp","nullable":true,"metadata":{}},{"name":"host","type":"string","nullable":true,"metadata":{}},{"name":"customer_id","type":"string","nullable":true,"metadata":{}}]}Wparquet-mr version 1.10.99.7.1.7.0-550 (build 27a2f693f9b09573ead42e85bee2a649ac904119)�!PAR1

otherwise when I'm reading the same file with fastparquet everything goes smoothly

pandas version: 1.5.0 pyarrow version: 14.0.1

Component(s)

Python

jorisvandenbossche commented 5 months ago

@bama-chi would you be able to provide a file to test this with?

bama-chi commented 5 months ago

yees @jorisvandenbossche, sure here is a sample of a file that I couldn't read sample.zip

bama-chi commented 5 months ago

do you manage to test @jorisvandenbossche ?

jorisvandenbossche commented 5 months ago

Yes, I can confirm the error, also with the latest development version of Arrow.

Opening the file and printing the schema there:

In [63]: import fastparquet

In [64]: f = fastparquet.ParquetFile("../Downloads/sample/part-00000-e69412f4-236c-436a-a4cd-89318d2aaa3d-c000.snappy.parquet")

In [65]: print(f.schema.text)
- spark_schema: 
| - id: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - email_sha256: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - params: MAP, MAP, OPTIONAL
|   - map: UNKNOWN, MAP_KEY_VALUE, REPEATED
|   | - key: BYTE_ARRAY, STRING, UTF8, REQUIRED
|     - value: BYTE_ARRAY, STRING, UTF8, OPTIONAL
  - master_id: BYTE_ARRAY, STRING, UTF8, OPTIONAL

I suppose the error is coming from the MAP type column "params" (since the other columns are simple, non-nested columns).

According to the spec (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), map columns consist of a 3-level structure, as is also shown above. The second (middle) level "named key_value, must be a repeated group". So as a start, it seems that Spark doesn't use the correct name, since above that level is named "map" and not "key_value".

But I assume the issue is that this middle level is annotated with the "UNKNOWN" logical type. And in our code, we don't allow a group node to have such logical type (to be my understanding, this annotation also doesn't make sense, as it should be used to indicate that all values in that column are null).

Looking at the parquet-mr repo, it seems this was fixed in version 1.12 (https://github.com/apache/parquet-mr/pull/798 / https://issues.apache.org/jira/browse/PARQUET-1879). So if you update your spark (and parquet-mr) version, and write the file again, then I assume it will be readable by Arrow.

Now, this still means that we cannot read files written by parquet-mr < 1.12. I assume also on the Arrow side it should be possible to add some workaround to ignore the UNKNOWN logical type of a group node if the converted type is MAP_KEY_VALUE.