apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.47k stars 187 forks source link

TPC-H queries are failing on main branch #1058

Open kaushik-pankaj opened 3 hours ago

kaushik-pankaj commented 3 hours ago

Describe the bug While running the TPC-H queries in distributed mode(having ballista-cli pointing to ballista-scheduler, ballista-scheduler and one ballista-executor) few queries are failing and few are getting passed. Passed Queries - q1, q3, q4, q5, q6, q11, q12, q13, q16, q17, q19, q20, q21 Failed Queries - q2, q7, q8, q9, q10, q14, q15, q18, q22

Failed queries are giving similar error. For example, sharing one below for query number 2.

ballista_scheduler::scheduler_server::query_stage_scheduler] Failed to update 1 task statuses for Executor 167eb7c2-fc0f-4232-a279-47aa0d0f70e7: DataFusionError(Internal("PhysicalExpr Column references column 's_acctbal' at index 9 (zero-based) but input schema only has 9 columns: [\"s_name\", \"s_address\", \"s_nationkey\", \"s_phone\", \"s_acctbal\", \"s_comment\", \"p_partkey\", \"p_mfgr\", \"ps_supplycost\"]"))ballista_scheduler::scheduler_server::query_stage_scheduler] Failed to update 1 task statuses for Executor 167eb7c2-fc0f-4232-a279-47aa0d0f70e7: DataFusionError(Internal("PhysicalExpr Column references column 's_acctbal' at index 9 (zero-based) but input schema only has 9 columns: [\"s_name\", \"s_address\", \"s_nationkey\", \"s_phone\", \"s_acctbal\", \"s_comment\", \"p_partkey\", \"p_mfgr\", \"ps_supplycost\"]"))

Note - This issue started coming afterwards this commit 3b6964bd973d399619a33336d9cf618173985eb0

To Reproduce Steps to reproduce the behavior:

  1. check out the main branch.
  2. do cargo build (build the project)
  3. run scheduler and executor
  4. connect ballista-cli to scheduler.
  5. run TPC-H queries on ballista cli(https://github.com/apache/datafusion-ballista/tree/main/benchmarks/queries) Expected behavior A clear and concise description of what you expected to happen.

Additional context Things are working with datafusion version 35.0.0. As soon as we upgrade datafusion version to 39.0.0, TPC-H queries start failing.

Dandandan commented 3 hours ago

We got a similar problem with joins in our fork of ballista, we traced it down to https://github.com/apache/datafusion/pull/9236 and the JoinSelection rule when creating stages which doesn't support projections yet.

Dandandan commented 31 minutes ago

Can you confirm it is "solved" by removing the line here: https://github.com/apache/datafusion-ballista/blob/e39a7e68da093ff6f0f002e7c6553d6fd4763dbb/ballista/scheduler/src/state/execution_graph/execution_stage.rs#L353 ?