I am doing an ETL benchmarks that read csv files with variable number of columns, do some transformation and write it back as delta, I test it with 7 Python Engines, unfortunately datafusion support only a csv with a fixed schema.
This is a very good addition to work on, but I suspect we will need to do it upstream in the datafusion core repo and then expose the options in this repo.
I am doing an ETL benchmarks that read csv files with variable number of columns, do some transformation and write it back as delta, I test it with 7 Python Engines, unfortunately datafusion support only a csv with a fixed schema.
fwiw the notebook is here with a reproducible data source : https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/ETL/Light_ETL_Python_Notebook.ipynb