apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.27k stars 1.19k forks source link

Make DfSchema wrap SchemaRef #4680

Closed tustvold closed 7 months ago

tustvold commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently DataFusion has two ways to represent a schema, Schema and DFSchema. The former is the representation used by arrow, and in most components. DFSchema appears to "enhance" an arrow schema with the notion of a qualifier.

I'm not entirely sure of the history of this split, but to the uninitiated the split is confusing and frustrating. It also results in a non-trivial amount of schema munging logic to convert to/from the relevant representations

Describe the solution you'd like

I would like to change DfSchema to be

struct DfSchema {
    schema: SchemaRef,
    fields: Vec<DfFieldMetadata>
}

struct DfFieldMetadata {
    qualifier: Option<String>,
}

We could then make DfSchema automatically deref to SchemaRef, or at the very least implement AsRef<SchemaRef>, avoiding a lot of code that ends up looking like

let schema: Schema = self.plan.schema().as_ref().into();
Arc::new(schema)

Components wishing to combine the information can easily zip the two together, we could even assist this by adding

struct DfField<'a> {
    field: &'a Field,
    metadata: &'a DfFieldMetadata
}

impl DfSchema {

    fn df_fields() -> impl Iterator<Item=DfField<'_>> + '_ {
        self.arrow_schema.fields().iter().zip(&self.fields).map(|(field, metadata)| DfField { field, metadata })
    }
}

Describe alternatives you've considered

We could not do this

Additional context Add any other context or screenshots about the feature request here.

alamb commented 1 year ago

I'm not entirely sure of the history of this split, but to the uninitiated the split is confusing and frustrating. It also results in a non-trivial amount of schema munging logic to convert to/from the relevant representations

I wrote the original DFSchema wrapping logic back in the day. If it can be improved that would be great.

The key design goal is that DataFusion should always use a DFSchema (aka a qualified name) so there should not be an implicit conversion from SchemaRef --> (implicitly unqualified) DFSchema to avoid subtle bugs where the qualifiers are stripped off.

Going the other way, DFSchema --> Schema is 👍

cc @Dandandan @jackwener @mingmwang @andygrove in case you have additional thoughts

tustvold commented 1 year ago

This mostly works but runs into issues with the handling of schemas in datafusion-expr. In particular they have a fairly hard-coded assumption that they can pass around DFField, concatenate them and munge them back into a DFSchema. It would be possible to workaround this, but it would be a fairly non-trivial amount of code-churn.

comphead commented 10 months ago

Attaching link to DFSchema struct for convenience https://github.com/apache/arrow-datafusion/blob/main/datafusion/common/src/dfschema.rs#L108

matthewmturner commented 10 months ago

Ive been doing some work on planning performance and in order to lay a good foundation for the next steps I think this is necessary. So my plan is to start working on this shortly.

CC @alamb @tustvold @comphead @avantgardnerio

alamb commented 10 months ago

Ive been doing some work on planning performance and in order to lay a good foundation for the next steps I think this is necessary. So my plan is to start working on this shortly.

CC @alamb @tustvold @comphead @avantgardnerio

Thank you @matthewmturner 🙏

BTW I started hacking on some version of what this might look like in https://github.com/apache/arrow-datafusion/pull/7944 but I haven't had a chance to finish it

matthewmturner commented 10 months ago

@alamb yes i saw, thanks. havent decided yet whether i will pick up on that PR or create a new one.

alamb commented 10 months ago

Feel free to do anyting you want with the code / PR