apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.63k stars 3.56k forks source link

[Docs][Python] Uncertainty with ParquetFile.schema vs ParquetFile.schema_arrow #44741

Open JacobWBradford opened 6 days ago

JacobWBradford commented 6 days ago

Describe the bug, including details regarding any error messages, version, and platform.

There appears to be some ambiguity the names attribute on pyarrow.parquet.ParquetSchema vs pyarrow.Schema when it comes to nested logical type columns.

I generated a parquet file with three columns, one containing list objects and the other two containing strings. When loading the two available schemas from pyarrow.parquet.ParquetFile (schema and schema_arrow), the corresponding <schema>.names differ for the List type column. schema_arrow appears to correctly list the column name, whereas schema lists the name of the lowest-level field in the structure (which, going off of the Logical Types documentation would usually be element).

I've included a simple example below, tested using pyarrow 17.0.0

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create parquet file
filepath = `new_parquet_file.parquet`
df = pd.DataFrame({'random_strings1': ["string1", "string2", "string3"],
           'my_list': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
           'random_strings2': ["string4", "string5", "string6"]})
table = pa.Table.from_pandas(df)
pq.write_table(table, filepath)

# Examine parquet file schema
with pq.ParquetFile(filepath) as file:
    print(file.schema.names)
    print(file.schema_arrow.names)
['random_strings1', 'element', 'random_strings2']
['random_strings1', 'my_list', 'random_strings2']

Is this intended behavior from pyarrow.parquet.ParquetSchema? The inclusion of schema_arrow means it's still trivial to get the names of the columns, but it still poses a problem for those who aren't aware of the logic.

Component(s)

Documentation, Python

mapleFU commented 4 days ago

So problem is 'element' and 'my_list'?

JacobWBradford commented 3 days ago

Yes, the actual file has that column name as my_list, but schema.names shows the column name as element.