apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.56k stars 197 forks source link

Ballista standalone mode tests fail: `context::tests::test_task_stuck_when_referenced_task_failed` #25

Open alamb opened 2 years ago

alamb commented 2 years ago

Describe the bug The following ballista test is failing (not sure when it started failing given the tests weren't run in CI until apache/arrow-datafusion#1839 )

---- context::tests::test_task_stuck_when_referenced_task_failed stdout ----
Found object store LocalFileSystem for path /Users/alamb/Software/arrow-datafusion/parquet-testing/data/single_nan.parquet
thread 'context::tests::test_task_stuck_when_referenced_task_failed' panicked at 'called `Result::unwrap()` on an `Err` value: Execution("Job RcB8xKy failed: Task failed due to Tokio error: DataFusion error: Execution(\"ArrowError(ParseError(\\\"Error parsing line 2: Error(UnequalLengths { pos: Some(Position { byte: 104, line: 3, record: 2 }), expected_len: 2, len: 1 })\\\"))\")")', ballista/rust/client/src/context.rs:541:42
stack backtrace:
   0: rust_begin_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:107:14
   2: core::result::unwrap_failed
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/result.rs:1613:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/result.rs:1295:23
   4: ballista::context::tests::test_task_stuck_when_referenced_task_failed::{{closure}}
             at ./src/context.rs:541:23
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/future/mod.rs:80:19
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/future/future.rs:119:9
   7: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/basic_scheduler.rs:516:48
   8: tokio::coop::with_budget::{{closure}}
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/coop.rs:102:9
   9: std::thread::local::LocalKey<T>::try_with
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/thread/local.rs:399:16
  10: std::thread::local::LocalKey<T>::with
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/thread/local.rs:375:9
  11: tokio::coop::with_budget
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/coop.rs:95:5
  12: tokio::coop::budget
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/coop.rs:72:5
  13: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}::{{closure}}
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/basic_scheduler.rs:516:25
  14: tokio::runtime::basic_scheduler::Context::enter
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/basic_scheduler.rs:374:19
  15: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/basic_scheduler.rs:515:36
  16: tokio::runtime::basic_scheduler::CoreGuard::enter::{{closure}}
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/basic_scheduler.rs:582:57
  17: tokio::macros::scoped_tls::ScopedKey<T>::set
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/macros/scoped_tls.rs:61:9
  18: tokio::runtime::basic_scheduler::CoreGuard::enter
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/basic_scheduler.rs:582:27
  19: tokio::runtime::basic_scheduler::CoreGuard::block_on
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/basic_scheduler.rs:506:9
  20: tokio::runtime::basic_scheduler::BasicScheduler::block_on
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/basic_scheduler.rs:182:24
  21: tokio::runtime::Runtime::block_on
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.16.1/src/runtime/mod.rs:475:46
  22: ballista::context::tests::test_task_stuck_when_referenced_task_failed
             at ./src/context.rs:542:9
  23: ballista::context::tests::test_task_stuck_when_referenced_task_failed::{{closure}}
             at ./src/context.rs:473:11
  24: core::ops::function::FnOnce::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:227:5
  25: core::ops::function::FnOnce::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

To Reproduce Get the code from https://github.com/apache/arrow-datafusion/pull/1839 and run

cd arrow-datafusion/ballista
test --no-default-features --features standalone -- --ignored

Expected behavior Test should pass

Additional context Add any other context about the problem here.

Ted-Jiang commented 2 years ago

i reproduce in my local.

Ted-Jiang commented 2 years ago

@gaojun2048 i think this test_task_stuck_when_referenced_task_failed UT will get error, is this your purpose? I think it will return error in call collect() , is there something i miss?

EricJoy2048 commented 2 years ago

@gaojun2048 i think this test_task_stuck_when_referenced_task_failed UT will get error, is this your purpose? I think it will return error in call collect() , is there something i miss?

I'm sorry I replied too late. Yes, I have a pr about this issue : https://github.com/apache/arrow-datafusion/issues/1654 The problem appears in referenced_task failed, Before I submit this PR, the query will be stuck. After my PR is merged, the query will return to failure.