Closed tustvold closed 7 months ago
I'm not entirely sure of the history of this split, but to the uninitiated the split is confusing and frustrating. It also results in a non-trivial amount of schema munging logic to convert to/from the relevant representations
I wrote the original DFSchema
wrapping logic back in the day. If it can be improved that would be great.
The key design goal is that DataFusion should always use a DFSchema
(aka a qualified name) so there should not be an implicit conversion from SchemaRef
--> (implicitly unqualified) DFSchema
to avoid subtle bugs where the qualifiers are stripped off.
Going the other way, DFSchema
--> Schema
is 👍
cc @Dandandan @jackwener @mingmwang @andygrove in case you have additional thoughts
This mostly works but runs into issues with the handling of schemas in datafusion-expr. In particular they have a fairly hard-coded assumption that they can pass around DFField, concatenate them and munge them back into a DFSchema. It would be possible to workaround this, but it would be a fairly non-trivial amount of code-churn.
Attaching link to DFSchema struct for convenience https://github.com/apache/arrow-datafusion/blob/main/datafusion/common/src/dfschema.rs#L108
Ive been doing some work on planning performance and in order to lay a good foundation for the next steps I think this is necessary. So my plan is to start working on this shortly.
CC @alamb @tustvold @comphead @avantgardnerio
Ive been doing some work on planning performance and in order to lay a good foundation for the next steps I think this is necessary. So my plan is to start working on this shortly.
CC @alamb @tustvold @comphead @avantgardnerio
Thank you @matthewmturner 🙏
BTW I started hacking on some version of what this might look like in https://github.com/apache/arrow-datafusion/pull/7944 but I haven't had a chance to finish it
@alamb yes i saw, thanks. havent decided yet whether i will pick up on that PR or create a new one.
Feel free to do anyting you want with the code / PR
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently DataFusion has two ways to represent a schema,
Schema
andDFSchema
. The former is the representation used by arrow, and in most components.DFSchema
appears to "enhance" an arrow schema with the notion of a qualifier.I'm not entirely sure of the history of this split, but to the uninitiated the split is confusing and frustrating. It also results in a non-trivial amount of schema munging logic to convert to/from the relevant representations
Describe the solution you'd like
I would like to change
DfSchema
to beWe could then make
DfSchema
automatically deref toSchemaRef
, or at the very least implementAsRef<SchemaRef>
, avoiding a lot of code that ends up looking likeComponents wishing to combine the information can easily zip the two together, we could even assist this by adding
Describe alternatives you've considered
We could not do this
Additional context Add any other context or screenshots about the feature request here.