apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.91k stars 1.12k forks source link

DataFusion does not validate that Substrait NamedScan schemas match registered tables #12223

Closed vbarua closed 5 days ago

vbarua commented 2 weeks ago

Describe the bug

As written, the test assertion in https://github.com/apache/datafusion/blob/1fce2a98ef9c7f8dbd7f3dedcaf4aa069ab92154/datafusion/substrait/tests/cases/logical_plans.rs#L46-L50 should fail because DataFusion registers the data table with 5 fields [a, b, c, d, e] but the schema for the table in the Substrait plan only has a single field [D].

To Reproduce

No response

Expected behavior

DataFusion should reject Substrait plans in which NamedScan schemas do not match the corresponding table that is is registered.

Additional context

Generally speaking, if the plan consumer (DataFusion) and the producer do not agree on column names and types, it is unlikely that execution will be meaningful.

vbarua commented 2 weeks ago

I'm in the process of preparing a PR for this issue.

vbarua commented 1 week ago

From conversations with @Blizzara, the requirement that the DataFusion and Substrait schemas match exactly is stricter than it needs to be. In practice, if the Substrait schema is a subset of the DataFusion schema, the consumer can adapt the plan as it consumes it to make it match the shape expected by Substrait.

For example, if DataFusion has a schema [a, b, c] for table t, and Substrait has a schema [b, c] for table t, as DataFusion consumes the plan it may add a project for fields [b,c] immediately after the read from table t to bring it in line with what the Substrait plan expects.