apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 63 forks source link

Different behavior in datafusion 35.0.0 in reading hive-partitioned parquet data #579

Open jwimberl opened 5 months ago

jwimberl commented 5 months ago

Describe the bug pip recently switched to installing datafusion with version string '35.0.0'. Compared to a previous installation of version '34.0.0', creating an external table from hive-partitioned parquet data following the [https://arrow.apache.org/datafusion/user-guide/sql/ddl.html](documented instructions) does not work. While all the partition columns show up as columns of the table, the columns from the parquet data themselves do not appear.

To Reproduce

# prepare fake data
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
data = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
table = pa.Table.from_pandas(data)
import os
os.mkdir("fake=0")
pq.write_table(table,"./fake=0/data.parquet")

# load into datafusion
import datafusion as df
ctx = df.SessionContext()
ctx.sql("""
CREATE EXTERNAL TABLE data
STORED AS PARQUET
PARTITIONED BY (fake)
LOCATION './*/data.parquet'
""")

The loaded data is missing col1 and col2:

>>> ctx.sql("SELECT * FROM data")
DataFrame()
+------+
| fake |
+------+
| 0    |
| 0    |
+------+
>>> ctx.sql("SELECT table_name, column_name FROM information_schema.columns")
DataFrame()
+------------+-------------+
| table_name | column_name |
+------------+-------------+
| data       | fake        |
+------------+-------------+

Expected behavior The same steps with DataFusion 34.0.0 produce the following output:

>>> ctx.sql("SELECT * FROM data");
DataFrame()
+------+------+------+
| col1 | col2 | fake |
+------+------+------+
| 1    | 3    | 0    |
| 2    | 4    | 0    |
+------+------+------+
>>> ctx.sql("SELECT table_name, column_name FROM information_schema.columns")
DataFrame()
+------------+-------------+
| table_name | column_name |
+------------+-------------+
| data       | col1        |
| data       | col2        |
| data       | fake        |
+------------+-------------+

Additional context Operating system: Rocky 8 Python version: 3.10.11 DataFusion version: 35.0.0, recently installed via pip pyarrow version: 15.0.0