NVIDIA / stdexec

`std::execution`, the proposed C++ framework for asynchronous and parallel programming.
Apache License 2.0
1.56k stars 159 forks source link

race condition in io_uring scheduler #816

Closed ericniebler closed 1 year ago

ericniebler commented 1 year ago

One of my local test runs yielded this:

[ctest] 579/579 Test #575: After -1s .........................................................................................Subprocess aborted***Exception:   0.08 sec
[ctest] test.stdexec: /workspaces/stdexec/include/exec/linux/./__detail/io_uring_context.hpp:280: bool exec::__io_uring::__context::submit(exec::__io_uring::__task*): Assertion `__prev > 0' failed.
[ctest] Filters: After -1s
[ctest] 
[ctest] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ctest] test.stdexec is a Catch v2.13.6 host application.
[ctest] Run with -? for options
[ctest] 
[ctest] -------------------------------------------------------------------------------
[ctest] After -1s
[ctest] -------------------------------------------------------------------------------
[ctest] /workspaces/stdexec/test/exec/test_io_uring_context.cpp:222
[ctest] ...............................................................................
[ctest] 
[ctest] /workspaces/stdexec/test/exec/test_io_uring_context.cpp:222: FAILED:
[ctest]   {Unknown expression after the reported line}
[ctest] due to a fatal error condition:
[ctest]   SIGABRT - Abort (abnormal termination) signal
[ctest] 
[ctest] ===============================================================================
[ctest] test cases: 1 | 1 failed
[ctest] assertions: 2 | 1 passed | 1 failed
[ctest] 
[ctest] 
[ctest] 
[ctest] 99% tests passed, 1 tests failed out of 579
[ctest] 
[ctest] Total Test time (real) =   0.78 sec
[ctest] 
[ctest] The following tests FAILED:
[ctest]     575 - After -1s (Subprocess aborted)
[ctest] Errors while running CTest

attn: @maikel

maikel commented 1 year ago

Can you tell me which compiler and which compile options you used?

maikel commented 1 year ago

I could reproduce it once on my machine. Looking :mag: 👀

maikel commented 1 year ago

Luckily, this was an error which I already fixed in PR #815

This happened:

  1. io_thread: the io thread starts and attempts to run the io_uring context
  2. main thread: Meanwhile a schedule_after op enters the game and submits itself into the context, it increases the __n_submissions_in_flight_ from 0 to 1
  3. io_thread: We enter run and set unconditionally (caution this is the error) __n_submissions_in_filght to 0
  4. main thread: We have pushed the op to the atomic intrusive queue and decrease the __n_submissions_in_flight counter... oh its already 0, the assertion fires

The solution is to reset the __n_submission_flight in step 3 only if it was __no_new_submissions before.

I love assertions. :heart:

maikel commented 1 year ago

Can this be closed?