NVIDIA / stdexec

`std::execution`, the proposed C++ framework for asynchronous and parallel programming.
Apache License 2.0
1.52k stars 153 forks source link

flaky behavior of the system context #1360

Closed ericniebler closed 2 months ago

ericniebler commented 2 months ago

I often see failures in CI for the system context when compiling with nvc++. I had suspected an nvc++ codegen bug, but I just saw a similar failure with gcc-11.

684/694 Test #684: get_completion_scheduler ......................................................................................................Subprocess aborted***Exception:   0.12 sec
Filters: get_completion_scheduler
===============================================================================
All tests passed (2 assertions in 1 test case)

pure virtual method called
terminate called without an active exception

As far as I can tell it can happen with any of the system context tests.

attn: @lucteo

lucteo commented 2 months ago

Tried running this a couple of thousands times, and I can't reproduce the bug on my machine. Trying harder...

ccotter commented 2 months ago

I found the CI failure for nvc++, but not gcc-11.

Could this be related to shutdown order, since the thread pool threads are shutdown after main exits?

ericniebler commented 2 months ago

I found the CI failure for nvc++, but not gcc-11.

shame on me. i reran the failed job and it passed. i should have saved it for posterity. :-/

ericniebler commented 2 months ago

Could this be related to shutdown order, since the thread pool threads are shutdown after main exits?

this is an interesting observation. the error is always "pure virtual function called", and the only pure virtual functions are in the numa_policy class, which is part of the static_thread_pool implementation. this does feel like a [con|de]structor order issue.

UPDATE: this is almost certainly the issue. get_numa_policy() returns a pointer to an object in thread-local storage. the destruction order is unspecified.

lucteo commented 2 months ago

Yesterday evening I let all the tests run in a loop for 1000 iterations, under Docker with gcc11. I've got a failure on iteration 995 (funny, right?) on a different test:

654: Test command: /github/workspace/build/gcc11-release/test/test.stdexec "io_uring_context schedule_after -1s"
654: Working Directory: /github/workspace/build/gcc11-release/test
654: Test timeout computed to be: 60
654: Filters: io_uring_context schedule_after -1s
654:
654: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
654: test.stdexec is a Catch v2.13.6 host application.
654: Run with -? for options
654:
654: -------------------------------------------------------------------------------
654: io_uring_context schedule_after -1s
654: -------------------------------------------------------------------------------
654: /github/workspace/test/exec/test_io_uring_context.cpp:314
654: ...............................................................................
654:
654: /github/workspace/test/exec/test_io_uring_context.cpp:337: FAILED:
654:   CHECK( is_called_1 == true )
654: with expansion:
654:   false == true
654:
654: ===============================================================================
654: test cases: 1 | 1 failed
654: assertions: 3 | 2 passed | 1 failed
654:
654/704 Test #654: io_uring_context schedule_after -1s ...........................................................................................***Failed    0.00 sec
Filters: io_uring_context schedule_after -1s

So, no, I couldn't reproduce this.

I also tend to agree that the issue comes from static_thread_pool.

ericniebler commented 2 months ago

i think this issue is fixed now. i will reopen if i see this error again.