Describe the bug, including details regarding any error messages, version, and platform.
There appears to be some ambiguity the names attribute on pyarrow.parquet.ParquetSchema vs pyarrow.Schema when it comes to nested logical type columns.
I generated a parquet file with three columns, one containing list objects and the other two containing strings. When loading the two available schemas from pyarrow.parquet.ParquetFile (schema and schema_arrow), the corresponding <schema>.names differ for the List type column. schema_arrow appears to correctly list the column name, whereas schema lists the name of the lowest-level field in the structure (which, going off of the Logical Types documentation would usually be element).
I've included a simple example below, tested using pyarrow 17.0.0
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Create parquet file
filepath = `new_parquet_file.parquet`
df = pd.DataFrame({'random_strings1': ["string1", "string2", "string3"],
'my_list': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
'random_strings2': ["string4", "string5", "string6"]})
table = pa.Table.from_pandas(df)
pq.write_table(table, filepath)
# Examine parquet file schema
with pq.ParquetFile(filepath) as file:
print(file.schema.names)
print(file.schema_arrow.names)
Is this intended behavior from pyarrow.parquet.ParquetSchema? The inclusion of schema_arrow means it's still trivial to get the names of the columns, but it still poses a problem for those who aren't aware of the logic.
Describe the bug, including details regarding any error messages, version, and platform.
There appears to be some ambiguity the
names
attribute onpyarrow.parquet.ParquetSchema
vspyarrow.Schema
when it comes to nested logical type columns.I generated a parquet file with three columns, one containing list objects and the other two containing strings. When loading the two available schemas from
pyarrow.parquet.ParquetFile
(schema
andschema_arrow
), the corresponding<schema>.names
differ for the List type column.schema_arrow
appears to correctly list the column name, whereasschema
lists the name of the lowest-level field in the structure (which, going off of the Logical Types documentation would usually beelement
).I've included a simple example below, tested using pyarrow 17.0.0
Is this intended behavior from
pyarrow.parquet.ParquetSchema
? The inclusion ofschema_arrow
means it's still trivial to get the names of the columns, but it still poses a problem for those who aren't aware of the logic.Component(s)
Documentation, Python