apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
380 stars 78 forks source link

add support for reading csv with variable number of columns #891

Open djouallah opened 1 month ago

djouallah commented 1 month ago

I am doing an ETL benchmarks that read csv files with variable number of columns, do some transformation and write it back as delta, I test it with 7 Python Engines, unfortunately datafusion support only a csv with a fixed schema.

fwiw the notebook is here with a reproducible data source : https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/ETL/Light_ETL_Python_Notebook.ipynb

timsaucer commented 1 month ago

This is a very good addition to work on, but I suspect we will need to do it upstream in the datafusion core repo and then expose the options in this repo.