echo "a,b\n1,2" > data1.csv
mkdir a=2
echo "b\n3" > a=2/data2.csv
datafusion-cli
> SELECT * FROM '**/*.csv';
Arrow error: Csv error: incorrect number of fields for line 1, expected 2 got 1
With Parquet:
import os
import polars as pl
pl.DataFrame({'a': [1], 'b': [2]}).write_parquet('data1.parquet')
os.mkdir('a=2')
pl.DataFrame({'b': [3]}).write_parquet('a=2/data2.parquet')
datafusion-cli
> SELECT * FROM '**/*.parquet';
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | |
+---+---+
2 row(s) fetched.
Elapsed 0.055 seconds.
To Reproduce
No response
Expected behavior
Partition evolution is handled and both cases return
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | 2 |
+---+---+
Additional context
Having played around quite a bit with ParquetExec and the SchemaAdapter machinery I think what should happen is:
Partition values are on a per-file basis, in particular on each PartitionedFile and not on the FileScanConfig
Partition values are passed into the SchemaAdapter machinery and for each file it decides if it needs to add a column generated from partition values or not
Describe the bug
With CSV:
With Parquet:
To Reproduce
No response
Expected behavior
Partition evolution is handled and both cases return
Additional context
Having played around quite a bit with ParquetExec and the SchemaAdapter machinery I think what should happen is:
PartitionedFile
and not on theFileScanConfig