Raku / problem-solving

🦋 Problem Solving, a repo for handling problems that require review, deliberation and possibly debate
Artistic License 2.0
70 stars 16 forks source link

Supply blocks may reorder flow of execution #364

Open patrickbkr opened 1 year ago

patrickbkr commented 1 year ago

This is a follow up of https://github.com/rakudo/rakudo/issues/5141 and https://github.com/rakudo/rakudo/pull/5158

Consider the following piece of code:

my $p = Promise.new;
my $s1 = supply {
    await $p;
    say "processing whenever"
};
react {
    whenever $s1 { }
    $p.keep;
    say "end of react block";
}

That piece of code prints

end of react block
processing whenever

This is unexpected. The order seems reversed. Why is this so?

By default whenevers tap their supply instantly and as a result run the supply block they are connected to. This has a potential for deadlocking when that supply block (or one nested via more whenevers) contains an emit as the emit will call the whenever block of the original supply, but by definition only a single execution branch is allowed in a supply block at any time. Possibly (I'm not sure about that) there are other scenarios posing similar deadlock issues without recursion being involved.

The currently implemented solution for this deadlock problem in https://github.com/rakudo/rakudo/commit/26a9c313297a21c11ac30f02349497822686f507 and https://github.com/rakudo/rakudo/commit/547839200a772e26ea164e9d1fd8c9cd4a5c2d9f works as follows. When - in the setup phase, during the processing of a whenever tapping - an await happens (be it a protect, acquire, lock or plain await) a continuation starting at the root of the whenever tapping is taken. Only after the supply block itself finished running the continuation is resumed. So the order in which the code is executed is dependent on whether there is any locking going on in code tapped by the whenever.

This behavior is difficult for a unknowing programmer to follow. Even a programmer knowing about this behavior will potentially have a hard time knowing whether there is a lock somewhere in the tapped code that could cause this reordering of code. This is a case of action-at-a-distance.

Also there still is a possibility of the code deadlocking left. That's detailed in https://github.com/rakudo/rakudo/issues/5141. I consider that a problem separate from this issue.

patrickbkr commented 1 year ago

In https://github.com/rakudo/rakudo/pull/5158 I tried replacing the continuation based solution strategy with a much simpler approach of just always executing whenever tapping after the supply block itself has run. That behavior is a lot easier to understand.

It does contradict the syntax though. Currently whenever blocks can be placed anywhere in a supply block, which strongly suggests that they are processed where they are placed. Only executing them after having processed the supply block is thus highly counterintuitive.

Additionally it makes code like the following impossible to write:

# Simplified from t/spec/S17-promise/nonblocking-await.t
my $started = Promise.new;
my $ok = start react {
    whenever IO::Socket::Async.listen($s-address, $port) -> $conn {
        whenever $conn.Supply(:bin) -> $buf {
            await $conn.write: $buf;
            $conn.close;
        }
    }
    $started.keep;
}

In that piece of code a Promise is used to track the set up of the react block. When delaying whenever tapping after the processing of the react block, then that $started.keep happens too early.

I thus consider https://github.com/rakudo/rakudo/pull/5158 a non-solution. At the moment I have run out of ideas of how to solve this.

jnthn commented 1 year ago

This is unexpected. The order seems reversed. Why is this so?

A supply block (and thus react) provides concurrency control by enforcing one-message-at-a-time semantics: that is, we must completely process a message (reach the end of the whenever block) before we process the next one. An await inside of a whenever block thus prevents execution of any other whenever block starting until the awaited event completes. (If one wants more concurrency, the answer is to instead use a nested whenever; were await to have different semantics then there'd not be a means to achieve back pressure).

The body of a supply or react block itself is much like an OO constructor: just as we expect the constructor to complete before we have a complete object ready to receive method calls, we expect the setup phase of the supply or react block to run to completion before it processes any other messages.

Thus, the behavior in that example seems as intended.

Currently whenever blocks can be placed anywhere in a supply block, which strongly suggests that they are processed where they are placed.

Yes, it's important for correctness that the subscription is established right away.

The currently implemented solution for this deadlock problem in https://github.com/rakudo/rakudo/commit/26a9c313297a21c11ac30f02349497822686f507 and https://github.com/rakudo/rakudo/commit/547839200a772e26ea164e9d1fd8c9cd4a5c2d9f works as follows. When - in the setup phase, during the processing of a whenever tapping - an await happens (be it a protect, acquire, lock or plain await) a continuation starting at the root of the whenever tapping is taken. Only after the supply block itself finished running the continuation is resumed. So the order in which the code is executed is dependent on whether there is any locking going on in code tapped by the whenever.

This behavior is difficult for a unknowing programmer to follow. Even a programmer knowing about this behavior will potentially have a hard time knowing whether there is a lock somewhere in the tapped code that could cause this reordering of code. This is a case of action-at-a-distance.

The core of the issue really is that supply and react are primarily a tool for managing existing concurrency, rather than as a means to introduce concurrency. However, in an attempt to DWIM, we try to permit a supply to immediately emit in its body, rather than just subscribe to other underlying sources and emit based upon events from those. The BlockAddWheneverAwaiter approach, as noted in https://github.com/rakudo/rakudo/issues/5141, seems to have shortcomings when one mixes supply blocks with use of a Supplier, for example.

I picked the continuation juggling approach to resolve the conflicts over the alternative of scheduling the continuations using the normal thread pool scheduler and letting the standard await mechanism take over, primarily on the basis that if supply blocks really were going to have to get into the business of introducing concurrency, then at least they should avoiding introducing threading too. Perhaps, however, the latter would have been the lesser evil; I'm curious if that would solve the problem at hand here and be more robust generally.

patrickbkr commented 1 year ago

Thanks for your input!

This is unexpected. The order seems reversed. Why is this so?

A supply block (and thus react) provides concurrency control by enforcing one-message-at-a-time semantics: that is, we must completely process a message (reach the end of the whenever block) before we process the next one. An await inside of a whenever block thus prevents execution of any other whenever block starting until the awaited event completes. (If one wants more concurrency, the answer is to instead use a nested whenever; were await to have different semantics then there'd not be a means to achieve back pressure).

That question was meant as the heading for the following paragraph. But thanks for explaining anyways! ;-)

Thus, the behavior in that example seems as intended.

Which example are you refering to? The one in the first post does trigger the await magic and manages to change the order in which the supply block and tapping are executed.

Currently whenever blocks can be placed anywhere in a supply block, which strongly suggests that they are processed where they are placed.

Yes, it's important for correctness that the subscription is established right away.

I agree.

The core of the issue really is that supply and react are primarily a tool for managing existing concurrency, rather than as a means to introduce concurrency. However, in an attempt to DWIM, we try to permit a supply to immediately emit in its body, rather than just subscribe to other underlying sources and emit based upon events from those. The BlockAddWheneverAwaiter approach, as noted in rakudo/rakudo#5141, seems to have shortcomings when one mixes supply blocks with use of a Supplier, for example.

I picked the continuation juggling approach to resolve the conflicts over the alternative of scheduling the continuations using the normal thread pool scheduler and letting the standard await mechanism take over, primarily on the basis that if supply blocks really were going to have to get into the business of introducing concurrency, then at least they should avoiding introducing threading too. Perhaps, however, the latter would have been the lesser evil

I can try implementing that. So that I get it right: I keep the BAWA (BlockAddWheneverAwaiter), but instead of putting the continuations in a list for later processing, I'd just instantly schedule the continuations on the ThreadPoolScheduler (not on the outer $*AWAITER, which would be a BlockingAwaiter and thus block the BAWA itself).

I'm curious if that would solve the problem at hand here and be more robust generally.

I guess that depends on whether we want to be pedantic. If the aspiration is to never have surprising order reversals, then queuing things on the thread pool isn't a step forward. I am confident though that having things in the thread pool will solve https://github.com/rakudo/rakudo/issues/5141 (the issue that made me explore the supply / whenever implementation in the first place). In that deadlock situation a continuation earlier in the list blocks on a lock a later continuation holds. But there is no "A waits on B, B waits on A" situation. So if the other continuations would be allowed to run, things would work out.

It still feels wrong to give locks not being involved in the supply setup the same treatment. I think I'll start another experiment: I'll remove the entirety of the BlockAddWheneverAwaiter and see if the queue-on-recursion logic implemented in https://github.com/rakudo/rakudo/commit/547839200a772e26ea164e9d1fd8c9cd4a5c2d9f suffices to prevent deadlocks on construction. I think I have not yet seen a situation where a lock we need to resolve during construction has not been a recursion lock.

patrickbkr commented 1 year ago

Just removing the BlockAddWheneverAwaiter logic basically means having recursing emits during supply construction instantly return and thus killing back-pressure. All tests and spectests pass except for a single test:

tap-ok Supply.zip(
    Supply.from-list("a".."e"),
    Supply.from-list("f".."k"),
    Supply.from-list("l".."p")
  ),
  [<a f l>,<b g m>,<c h n>,<d i o>,<e j p>],
  "zipping with 3 supplies works";

The supply is done before a single value is emitted. That happens, because the first supply gets to emit all its values and done (each being put into the protect-or-queue-on-recursion queue) before the others. That queue is then processed in order causing the zip to be done as soon as all the messages of the first supply have been processed. Technically it still works as documented (The docs state: "The resulting supply is done as soon as any of the given supplies are done."), but I'm unsure if that behavior is acceptable.

patrickbkr commented 1 year ago

I have now also tried making the BlockAddWheneverAwaiter queue its continuations on the thread pool. That gets make test pass and make spectest hang in S17-supply/[flat|batch|unique|merge|zip].t, but succeed otherwise.

I have now pushed two branches to GitHub:

patrickbkr commented 1 year ago

Also to note, both branches solve the dead-lock problem in https://github.com/rakudo/rakudo/issues/5141

patrickbkr commented 1 year ago

I'll see if I can get the put-continuations-on-thread-pool solution working (i. e. get the hangs to pass).

In general I like the solution removing the BlockAddWheneverAwaiter better. I believe its easier to understand what's going on as only locks of the supplies themself are affected. Also there are no continuations involved.

@jnthn What do you think?

patrickbkr commented 1 year ago

I found the reason for the hangs with the schedule-continuations-on-the-thread-pool approach. The hanging tests replace $*SCHEDULER with CurrentThreadScheduler which purposely blocks on calls to cue. Since we rely on the scheduler to have our continuations executed concurrently to prevent deadlocks the code can't resolve locks with a CurrentThreadScheduler.

The docs say the CurrentThreadScheduler is useful mostly for testing. So what's the point of having code run with the CurrentThreadScheduler? Is it forbidden to rely on the scheduler to resolve locks?

patrickbkr commented 1 year ago

With the schedule-continuations-on-the-thread-pool approach there is also a test failure in t/spec/S17-supply/flat.t

my $s = Supplier.new;
my $f = $s.Supply.flat;
my $seen1 = [];
my $t1 = $f.tap( { $seen1.push: $_ } );
$s.emit([1,2]);
is $seen1, [1,2], 'did we get the first emit (1)';

Cause: Emitting on a .flated supply involves setting up a whenever and processing values in src/core.c/Supply-factories.pm6:460. That in turn activates the BlockAddWheneverAwaiter which causes the value to be emitted asynchronously, causing the $s.emit and is in the test to be racy.

Perhaps, however, the latter would have been the lesser evil; I'm curious if that would solve the problem at hand here and be more robust generally.

My conclusion from the above analysis is, that the schedule-continuations-on-the-thread-pool approach as I implemented it brings its own set of problems. I don't think these issues can be alleviated.

So the only approach I still have hope in is the remove-the-BlockAddWheneverAwaiter-completely approach I also explored. See this comment for the issue surrounding that approach.

Unless other ideas of approaching this come up, I think the solution space has been exhaustively explored. I'm now dependent on feedback whether we should go forward with one of the two explored approaches.

patrickbkr commented 1 year ago

I'd like to make sure the state of this issue is clear.

jnthn has explained that we have to live with occasional reordering. That's known, intentional and we can't do much about it.

So the only problem left to be solved is the deadlock described in https://github.com/rakudo/rakudo/issues/5141. I explored two approaches trying to solve the deadlock issue:

  1. The remove-the-BlockAddWheneverAwaiter-completely approach. This solves the deadlock, but has the effect that recursing emits during whenever setup do not block anymore but are queued. This causes one spec test failure in a test that turned out to depend on non-deterministic behavior. A positive side-effect of this approach is that locks in general do not cause reordering anymore, only recursion that would otherwise deadlock causes reordering. In my opinion this makes it easier to understand what's going on.
  2. The schedule-continuations-on-the-thread-pool approach. This solves the deadlock as well, but it introduces concurrency. One spec test failure made clear that this concurrency is problematic. I think this renders this approach unsuitable.

So to be able to continue working on this issue I'd like to either have:

  1. Some feedback if the remove-the-BlockAddWheneverAwaiter-completely approach has feasability. If yes I could adapt the failing Supply.zip spec test and we could do a Blin run to see if there is more fallout.
  2. Some input on how else I could approach this problem.

So with the above summary I now ask for feedback. Ping @jnthn, @vrurg, @niner

vrurg commented 1 year ago

I'm not much into supplies from their implementation side, so not much help on this. But removal of a non-obvious deadlock cause does sound good to me. What's not good is that the spectest mentioned can barely be changed without breaking backward compatibility. Thus I wonder if the option 1, would it be accepted eventually, can be a 6.e change? In which case the spectest would still define 6.c/d behavior and we'd have it different for 6.e.

Though it may also depend on the degree of change required for the spec to pass.

patrickbkr commented 1 year ago

What's not good is that the spec test mentioned can barely be changed without breaking backward compatibility.

I only partly agree.

tap-ok Supply.zip(
    Supply.from-list("a".."e"),
    Supply.from-list("f".."k"),
    Supply.from-list("l".."p")
  ),
  [<a f l>,<b g m>,<c h n>,<d i o>,<e j p>],
  "zipping with 3 supplies works";

The test strongly relies on the supplies being processed round robin, where each supply only gets to emit a single value each round. Supplies never gave any guarantee along those lines. That this works out on the current implementation is because of a delicate implementation detail.

The test looks as if zip is meant to process incoming values as long as at least one supply is still active. But that's different from the current implementation and the documentation.

So I'm confident to say that there is a mismatch between test and implementation. One of the two must be wrong.

I do agree though that the behavior of my proposed changes is sub optimal in this case (processing all values and the done from the first supply before working on the second).

niner commented 1 year ago

An alternative reading is, that the spec really does require a round-robin processing by Supply.zip and the current implementation of this method is just very fragile and relies on current implementation details of other parts of Supply. So as part of the fix, method zip could be adjusted to deal with the new situation.

Maybe by creating a local queue for each of the child supplies and taking one element at a time out of this queue and asking the child supply for more if the queue is empty? This way it wouldn't matter if the child supply emits one or multiple values.

jnthn commented 1 year ago
  1. The remove-the-BlockAddWheneverAwaiter-completely approach. This solves the deadlock, but has the effect that recursing emits during whenever setup do not block anymore but are queued. This causes one spec test failure in a test that turned out to https://github.com/Raku/problem-solving/issues/364#issuecomment-1416843630. A positive side-effect of this approach is that locks in general do not cause reordering anymore, only recursion that would otherwise deadlock causes reordering. In my opinion this makes it easier to understand what's going on.

  2. The schedule-continuations-on-the-thread-pool approach. This solves the deadlock as well, but it introduces concurrency. https://github.com/Raku/problem-solving/issues/364#issuecomment-1418285405 made clear that this concurrency is problematic. I think this renders this approach unsuitable.

If I understand option 1 correctly, this is simply saying that it is left to Lock::Async to do the queueing? In that case I think it's effectively achieving what 2 would, but by far more elegant and well-exercised means (e.g. we don't have to build anything new).

So long as we process one message at a time in a given supply and react instance (where the running of the supply or react blody block is considered a message too), then it should be fine.

An alternative reading is, that the spec really does require a round-robin processing by Supply.zip and the current implementation of this method is just very fragile and relies on current implementation details of other parts of Supply. So as part of the fix, method zip could be adjusted to deal with the new situation.

That's a reasonable reading in my opinion. I did a draft PR https://github.com/rakudo/rakudo/pull/5200 that explores an alternative way to implement Supply.zip that passes all current spectests. It would be interesting to try it in combination with the deadlock fix; I think it should help.

patrickbkr commented 1 year ago

If I understand option 1 correctly, this is simply saying that it is left to Lock::Async to do the queueing?

Yes, that's exactly what happens.

So long as we process one message at a time in a given supply and react instance (where the running of the supply or react blody block is considered a message too), then it should be fine.

From all I know that invariant is not broken. Each supply block is still guarded by a lock and no code path enters that supply block without requesting that lock.

I did a draft PR rakudo/rakudo#5200 that explores an alternative way to implement Supply.zip that passes all current spectests. It would be interesting to try it in combination with the deadlock fix; I think it should help.

I have requested a change on that draft PR. With that change make test and make spectest both succeeded. I have now created a PR with those changes in rakudo/rakudo#5202.

patrickbkr commented 1 year ago

I now consider https://github.com/rakudo/rakudo/pull/5202 ready for review. That PR passes make test, make spectest and the original deadlock in https://github.com/rakudo/rakudo/issues/5141. I have turned that deadlocking code into a roast test in https://github.com/Raku/roast/pull/833.

The PR contains one controversial change in behavior: In https://github.com/patrickbkr/rakudo/commit/547839200a772e26ea164e9d1fd8c9cd4a5c2d9f the recursion breaking logic and Lock::Async::protect-or-queue-on-recursion was originally implemented. The commit message states: "A holder's recursion competes fairly with outside messages, thanks to the queueing being through Lock::Async." This is not true anymore, as the queued code objects are now synchronously called and not put on the thread pool anymore.

This issue and the PR are now blocking on feedback / approval. @jnthn I think you have the deepest insight in that area of Raku. Thus I'd like to have your approval before going forward with my changes.

patrickbkr commented 9 months ago

I've merged the proposed fix and respective test in https://github.com/rakudo/rakudo/pull/5202 and https://github.com/Raku/roast/pull/833 We now have a full month time to find any regressions.

patrickbkr commented 9 months ago

Now that the changes have been merged, what do we do with this issue? Do I need to condense all the comments into some prose and do a problem-solving PR?