apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.39k stars 181 forks source link

ShuffleWriterExec::schema mismatch #483

Open tustvold opened 2 years ago

tustvold commented 2 years ago

Describe the bug

ShuffleWriterExec::schema() returns the schema of the underlying plan, however, ShuffleWriterExec::execute returns a stream of RecordBatch containing metadata and a consequently completely different schema.

To Reproduce

Use ShuffleWriterExec

Expected behavior

ExecutionPlan::schema should return the same schema as the SendableRecordBatchStream yielded by ExecutionPlan::execute.

Additional context

There is a potentially valid question as to why we have the schema stored in so many places...

tustvold commented 2 years ago

I tried changing this in apache/arrow-datafusion#2428 but it leads distributed_join_plan to fail with

Error: DataFusionError(Plan("The left or right side of the join does not have all columns on \"on\": \nMissing on the left: {Column { name: \"l_orderkey\", index: 0 }}\nMissing on the right: {Column { name: \"o_orderkey\", index: 0 }}"))

I'm not familiar enough with this code to know what is going on here, but something doesn't feel right