Closed vbarua closed 5 days ago
I'm in the process of preparing a PR for this issue.
From conversations with @Blizzara, the requirement that the DataFusion and Substrait schemas match exactly is stricter than it needs to be. In practice, if the Substrait schema is a subset of the DataFusion schema, the consumer can adapt the plan as it consumes it to make it match the shape expected by Substrait.
For example, if DataFusion has a schema [a, b, c]
for table t
, and Substrait has a schema [b, c]
for table t
, as DataFusion consumes the plan it may add a project for fields [b,c]
immediately after the read from table t
to bring it in line with what the Substrait plan expects.
Describe the bug
As written, the test assertion in https://github.com/apache/datafusion/blob/1fce2a98ef9c7f8dbd7f3dedcaf4aa069ab92154/datafusion/substrait/tests/cases/logical_plans.rs#L46-L50 should fail because DataFusion registers the
data
table with 5 fields [a, b, c, d, e] but the schema for the table in the Substrait plan only has a single field [D].To Reproduce
No response
Expected behavior
DataFusion should reject Substrait plans in which NamedScan schemas do not match the corresponding table that is is registered.
Additional context
Generally speaking, if the plan consumer (DataFusion) and the producer do not agree on column names and types, it is unlikely that execution will be meaningful.