dbuenzli / affect

Composable concurrency primitives with OCaml effects handlers (unreleased)
https://erratique.ch/software/affect
ISC License
44 stars 3 forks source link

"Abortion" and resource-safety discipline #4

Closed gadmm closed 2 weeks ago

gadmm commented 2 years ago

At https://discuss.ocaml.org/t/affect-composable-concurrency-primitives-for-ocaml-5-0/9430 you propose to implement cancellation/abortion as follows:

Explicit fiber aborts raise the Abort exception in fibers. Combined with a disciplined use of Fun.protect a disciplined use of Fun.protect [...]

I suspect that there is a lot to unpack in this statement, and it might be valuable to write down what this discipline is. It looks a lot like the return of the old Thread.Exit exception which used to cause issues because no such discipline was advised to the user (e.g. third-party code that swallows the exception; one can argue that this is wrong, but more legitimate code could for instance wrap the exception into another exception).

I am curious about your point of view on the possible issues with cancellation described by @edwintorok (here) and how your discipline would go about it:

  • what if you defined some generic retry function that retries on all exceptions (it’ll need to ignore the canceled exception otherwise your computation won’t actually finish when the timeout is reached).
  • what if your cleanup function takes a long time? (if your cleanup function is just a file descriptor close, fine, but it could be an API call to a remote server, on which you may have forgotten to put a timeout thinking you have the timeout on the outer call, etc.)
  • what if your cleanup functions raise exceptions, where do those go?
  • there is no good to way to cancel some operations: you may be past a “point of no return” where you’ve already committed to finishing the operation and cancelling at that point will leave the system in a somewhat broken state (e.g. maybe the reason you’ve hit a timeout is that you were running the cleanup function of some other computation, and canceling that means you’d leak file descriptors. Not canceling it means you’ll exceed your timeout). This needs some way to mark regions of code as not cancelable, or take extra care that you really perform the most critical function of the cleanup handler even if the handler itself is canceled. (E.g. if you tried to log something in the cancel handler, cancel the logging but still close the file descriptor)

One issue that async Rust programmers have met is having to call async operations during cleanup. Can cleanup operations be aborted? If so, should Fun.protect be somehow aware of the Abort exception and not wrap it into Finally_raised?

I do not have more precise issues to point out right now, but since you called for discussions on the design, I think that this is an important design point to get right.

dbuenzli commented 2 years ago

Thanks for your questions. Here are a few, if not answers, at least comments, on these points.

I suspect that there is a lot to unpack in this statement, and it might be valuable to write down what this discipline is.

The preamble here and the documentation of the exception touch on this.

Yes this is simply Thread.Exit and/or a form of asynchronous exception. In my opinion it is the proper way to go to implement cancellation (no explicit threading of tokens and a notion of cancellation that aligns with the scope of functions which makes it easier to understand it's scope which is not the case for tokens which effectively allow to build complex and difficult to understand side effecting cancellation effects).

Given that your scheduler is implemented in "user" space, there's of course all sorts of business clients can do to derail the whole scheme. But that's about as much as you can do without cooperation from the runtime system/language.

If you go through the fiber termination function some of the cases where "something went" wrong are explicitely commented. While it does not do anything about these cases for now we could of course treat some of these cases as "programming error" or warn the end user if we had some kind of runtime system log.

  • what if you defined some generic retry function that retries on all exceptions (it’ll need to ignore the canceled exception otherwise your computation won’t actually finish when the timeout is reached).

Catch all exception handler are catch all exception handlers… Depending on what you do in your fiber function, you might eventually terminate "normally" with a value at which point it will finish it's abortion process and discard the value.

  • what if your cleanup function takes a long time?

The abort procedure will take a long time.

  • what if your cleanup functions raise exceptions, where do those go?

Following Fun.protect they must not and are treated the same way, see this code. Alternatively if we had a good design for a global exception trap (see https://github.com/ocaml/ocaml/issues/11074) we could give them to the trap – like Affect currently does in a few cases, e.g. with fibers that terminate by uncaught exception, see also point 2. here for more on this behaviour.

  • This needs some way to mark regions of code as not cancelable,

This is basically the block/unblock scopes of asynchronous exceptions mentioned in the above paper. This actually looks doable by enriching the fiber's Running state with a boolean indicating abortion should be delayed in the current state. Appropriate scoping combinators that toggle that variable (via an effect since the scheduler would be in charge) in the fiber's state could then be provided.

One issue that async Rust programmers have met is having to call async operations during cleanup.

In affect, if you try to cooperatively block a fiber (i.e. call an async operation) while it's in the aborting state it immediately aborts the blocking operation, that is calls the abort handler of the block operation you tried and raise Abort again. Basically this means that an aborting function can't perform cooperatively blocking operation anymore, they get immediately aborted.

This I don't see as an issue with (do you ?). I'm sure system interaction APIs can be used that cope with this – if only by using the abort handler function of the blocking operation in a suitable manner.

Can cleanup operations be aborted? If so, should Fun.protect be somehow aware of the Abort exception and not wrap it into Finally_raised?

Indeed there's an issue here.

Assuming a non aborting fiber yielded or blocked in a finally and gets aborted at that point, because of the discipline that you should not raise, you should swallow the Abort exception. But then you kill the abortion process (amusingly you do not if the abort happens outside a finally but your finally raises Abort, that one gets swallowed but the finally does the re-raise of the Abort your are treating).

Lacking this you should not yield or cooperatively block in finally functions for now (but is it a good idea anyways ?).

gadmm commented 2 years ago

I meant to come back to this with feedback but I am bit short on time to have a closer look at it.

One thought is that once you record the cancellation statefully, then you need not rely on testing the exception value. So you could explicitly allow the idiom of wrapping the exception into another one (as with Finally_raised). This is the choice taken by memprof-limits and eio. (Basically we all agree that the cancellable code is not arbitrary OCaml code and has to follow some discipline, but I believe that this discipline can be a bit more relaxed that you currently describe.)

Modulo this difference, I think that this is quite close in spirit to the memprof-limits/eio cancellation models. (Again, devil might be in details, and I did not have time to look further into the consequences of the differences in detail.)

The reason why I mentioned Rust's async Drop issue was not because their technical obstacles transpose to OCaml (I am not sure that they transpose), but because it seems to be strong empirical evidence that people sometimes need to cooperatively block during cleanup. (Is it a good idea though? For instance I tend to think that a file should not be flushed before closing if the cleanup follows from an exceptional condition, but that it is fine to flush before closing during normal cleanup, and that it is fine to react differently according to whether the cleanup is normal or due to a panic.)

gadmm commented 1 year ago

I think the idea of not testing the exception value but relying on the cancellation state, and the evidence from Rust "async Drop" that it is sometimes useful to block during clean-up, answers (albeit indirectly) to some of your points. (Further mechanism specific to cancellation with memprof-limits is that a "swallowed" interrupt will arise again soon after—which you have—and you can mask interrupts, e.g. during clean-up—which you do not have.) Are there still some points in the discussion where I could try to bring further clarifications? If not, should I close this issue?

dbuenzli commented 1 year ago

This is out of my head at the moment and I'm not sure what the conclusion is :-) But I'd like to go over this at some point before the release, the documentation does need clarification I think.

dbuenzli commented 2 weeks ago

I'm not sure what the conclusion is

The conclusion is that I'm no longer convinced by what I proposed then.

I don't think that cancellation should enforce the return values of fibers. This made the fiber termination code quite involved, made awaits painful (option values), forced to add a dubious concept to fibers (the finally handler, since it was possible for a fiber to be cancelled before being ever scheduled) and disallowed cancelled fibers to return partially computed results. I think that neither of these are very natural.

In c9fc5df3695f4b62026fa45a7a62d9ebc541622d I devised a new concurrency model centered on parallel asynchronous function calls with the idea of minimizing the difference between synchronous and asynchronous function calls (and thus fully embrace exceptions rather than try to contain them to fibers in the previous model). In this model I switched to a so called (by Swift) "cooperative cancellation" model. It remains structured cancellation as before (if you cancel a fiber so are its current and future descendents) but what you do about the cancellation is up to you. Since cancellation is now a fiber "flag" rather than a proper state, we get much closer to regular function calls and the finally handler can be dropped.

I'm not entirely happy with everything yet. Notably at the level of cancellation of blocking operation (for now I suggest to only raise if cancellation happens during blocking but not otherwise, I have the impression that it's not the best model to work with that leads to racy code). But I'm also not convinced by the the design of these blocking operations themselves yet, I'd like to get to something like CML's events. A new design round is needed, but not now :-)

However I'd say that the basic original issue is resolved in the new concurrency model. In particular:

One issue that async Rust programmers have met is having to call async operations during cleanup. Can cleanup operations be aborted? If so, should Fun.protect be somehow aware of the Abort exception and not wrap it into Finally_raised?

This issue no longer exists since what a fiber does when it is cancelled is entirely up to it.

Thanks for raising the issue @gadmm.