Open jiixyj opened 15 hours ago
The scenario seems plausible. I think a nice way to work around the early destruction could be to increment the number of expected completions before iterating: the iteration over the children is conceptually an outstanding task. Once that is done the count is decremented and the appropriate completion is triggered if all outstanding work is completed.
This seems eerily similar to an issue reported in libunifex some time back: https://github.com/facebookexperimental/libunifex/issues/445
The standard doesn't have that problem [yet?]: it sets up a callback using on-stop-request
(see [exec.when.all p12]) which isn't defined/described. However, only when_all
's stop source is passed in, ie., there is no option to play tricks with the count.
The scenario seems plausible. I think a nice way to work around the early destruction could be to increment the number of expected completions before iterating: the iteration over the children is conceptually an outstanding task. Once that is done the count is decremented and the appropriate completion is triggered if all outstanding work is completed.
This is an excellent suggestion, thanks! Now, looking at libunifex's implementation, this is how they fixed it as well.
This seems eerily similar to an issue reported in libunifex some time back: facebookexperimental/libunifex#445
Yep, seems to be the exact same issue. Thanks for the pointer! In the comments they mention stop_when
as well, which is interesting, because this is where I originally stumbled across this (trying to implement the exposition-only stop_when
sender algorithm from P3149R6). I was implementing it in a naive way, without stashing the original sender's result in a result_variant
in the opstate. But I guess there's no way around it because of this lifetime issue.
Reading P3409R0 earlier, I had hopes that maybe single_inplace_stop_source
could fix it, because it wouldn't need to loop around the list of stop callbacks as there is just a single one. But it needs to do some book-keeping after calling the stop callback, so this is sadly not a correctness fix:
inline bool single_inplace_stop_source::request_stop() noexcept {
...
callback->execute(callback);
state_.store(stop_requested_callback_done_state(), memory_order_release); // <<< this might access the opstate after its lifetime ended
state_.notify_one();
}
return true;
}
The standard doesn't have that problem [yet?]: it sets up a callback using
on-stop-request
(see [exec.when.all p12]) which isn't defined/described. However, onlywhen_all
's stop source is passed in, ie., there is no option to play tricks with the count.
It is mentioned in [exec.snd.expos]: https://eel.is/c++draft/exec#snd.expos-16
It is mentioned in [exec.snd.expos]: https://eel.is/c++draft/exec#snd.expos-16
I didn't look there! Thanks for pointing that out. The implication is, of course, that the standard does have the problem. It may be reasonable to factor out the counting behavior and the stop callback handling into a separate entity used in relevant places: on-stop-request
is also used for split
(and other similar algorithms would use the same approach).
I may have stumbled across a nasty lifetime issue in the handling of stop callbacks in the
when_all
/split
algorithms. But this would apply to any algorithm using ainplace_stop_source
inside its operation state.when_all
has ainplace_stop_source
inside its operation state, and then a stop callback, like this:In
on_stop
, a stop callback is registered, which callsstop_src.request_stop()
when there is a stop request on the receiver's stop token. All child senders of thewhen_all
are registered on thestop_src
. This propagates the stop request from the receiver to all children of thewhen_all
sender.Now, what I observed is the following chain of events:
on_stop
callback callsstop_src.stop_requested()
stop_src
now iterates over its list of registered stop callbacks (those are the ones from the children) (*)set_stopped
synchronously from inside its stop callback. In my case it's from deregistering a sleep timer from an self-written epoll-based "io context", something like this:arrive()
of thewhen_all
opstate gets calledcomplete()
of thewhen_all
opstate gets calledex::set_stopped
is called, satisfying the receiver contract of thewhen_all
senderwhen_all
opstate is synchronously (!) destroyed from inside theset_stopped
callback by some follow up work. -> UB, since we are still iterating over the list of registered stop callbacks ofstop_src
! (the line marked with "*" above)I've managed to "hack around" it by doing some checking of thread id's and deferring the completion to the stop callback if I detect that the completion is called synchronously from inside a stop callback. So something like this:
...and then using
stop_callback_with_thread_id
instead ofoptional<stop_callback>
insidewhen_all
's opstate, and having acomplete()
like this:This all feels very hacky to me, though.
I haven't deeply investigated
split
yet, but I think the solution could be a bit simpler there, by using a stop callback like this instead ofon-stop-request
:...i.e. just wrapping the
request_stop()
betweeninc_ref/dec_ref
to ensure the opstate object stays alive long enough.I do wonder if there is a more elegant way to solve this issue. I don't think synchronous completions from stop callbacks should be outlawed -- it seems "natural" to me to do the
set_stopped
right inside the stop callback if possible. Or maybe synchronous destruction from inside theset_stopped
completion ofwhen_all
is the problem? I've thought that you have to assume the lifetime of the opstate may end when calling the completion, though.