apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Improve schema merging #4223

Open andygrove opened 1 year ago

andygrove commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I am trying to work with the nyctaxi parquet data set which has one file per month. Over time, some of the types have changed. For example passenger_count started out as Int64 and was later changed to Float64.

Arrow-rs can not merge these schemas.

Other solutions (such as DuckDB) will merge these schemas and pick the least restrictive type (Float64).

Describe the solution you'd like

Describe alternatives you've considered

Additional context

tustvold commented 1 year ago

I would have thought this logic would exist within the query engine, i.e. DataFusion, not the compute engine? In particular I would have thought it would be a TableProvider detail, that would generate plans with the relevant schema coercion logic?

andygrove commented 1 year ago

I'm fine with implementing this in DataFusion. It currently delegates to Schema::try_merge in this repo, though, so it would likely mean duplicating some of this code in DF. I'll transfer this issue.