ListingTable cannot handle partition evolution

adriangb commented 2 weeks ago

Describe the bug

With CSV:

echo "a,b\n1,2" > data1.csv
mkdir a=2
echo "b\n3" > a=2/data2.csv
datafusion-cli
> SELECT * FROM '**/*.csv';
Arrow error: Csv error: incorrect number of fields for line 1, expected 2 got 1

With Parquet:

import os
import polars as pl

pl.DataFrame({'a': [1], 'b': [2]}).write_parquet('data1.parquet')
os.mkdir('a=2')
pl.DataFrame({'b': [3]}).write_parquet('a=2/data2.parquet')

datafusion-cli
> SELECT * FROM '**/*.parquet';
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 |   |
+---+---+
2 row(s) fetched.
Elapsed 0.055 seconds.

To Reproduce

No response

Expected behavior

Partition evolution is handled and both cases return

+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | 2 |
+---+---+

Additional context

Having played around quite a bit with ParquetExec and the SchemaAdapter machinery I think what should happen is:

Partition values are on a per-file basis, in particular on each PartitionedFile and not on the FileScanConfig
Partition values are passed into the SchemaAdapter machinery and for each file it decides if it needs to add a column generated from partition values or not

adriangb commented 2 weeks ago

cc @alamb I had promised you this a long time ago but only got around to it now

alamb commented 2 weeks ago

Thanks @adriangb

apache / datafusion

ListingTable cannot handle partition evolution #13270

Describe the bug

To Reproduce

Expected behavior

Additional context