Review and Commentary - Githubissues

I feel that it is important to get some kind of unified interface(s) for suspend/resume operations soon. Why?

There are already several libraries, like kcas, lockfree, and reagents that need or could use the functionality and probably will need to come up with solutions or workarounds soon.
There are also many libraries including domainslib and Eio and several other newer experimental libraries like oslo and par-ml that would be providers of the mechanism.

So, I'm going to take a closer look at the work here, ask questions, and discuss/comment on some of the design choices and alternatives. I'll add these as separate comments to this issue.

I'm looking at (part of) the take operation in MVar.ml:

let rec take mv = 
  let old_contents = Atomic.get mv in 
  match old_contents with
  | Empty q -> let p = ref true in 
                  (try perform (Sched.Suspend (fun r -> 
                                            let newQueue = Fun_queue.push q r in
                                            let new_contents = Empty newQueue in
                                            p := Atomic.compare_and_set mv old_contents new_contents;
                                            !p 
                                          )
                        )
                  with Exit -> 
                  (*Printf.printf "\nInside Exit of take%!";*)
                                  take mv )

Where is the Exit exception raised? I don't see any code in this repository that would do that and it is not raised by the Stdlib.

Also, as written, the use of a ref seems unnecessary.

Continuing on a (part of) the take operation in MVar.ml:

let rec take mv = 
  let old_contents = Atomic.get mv in 
  match old_contents with
  | Empty q -> let p = ref true in 
                  (try perform (Sched.Suspend (fun r -> 
                                            let newQueue = Fun_queue.push q r in
                                            let new_contents = Empty newQueue in
                                            p := Atomic.compare_and_set mv old_contents new_contents;
                                            !p 
                                          )
                        )
                  with Exit -> 
                  (*Printf.printf "\nInside Exit of take%!";*)
                                  take mv )

In another discusssion here the idea was described as follows:

Also, in the type signature of Suspend effect, we have a function that takes a resumer and returns a bool type. It signifies the thread safety while suspending the task. Thread safety, in this case, is achieved through lock-free implementation. It means we are using atomic operations like compare and swap (CAS). When such a CAS operation is successful, we will return the true value, indicating the push to the suspended queue is successful. In case of CAS failure, we have to retry the operation again to get the most recent state of the queue or MVar.

I believe I understand the idea. It is an interesting idea to tailor the suspend/resome protocol to allow lock-free programming.

Unfortunately, this seems potentially problematic.

A problem is that now there is a relatively long period of time between let old_contents = Atomic.get mv and the Atomic.compare_and_set mv old_contents new_contents. Performing an effect to capture the continuation is (relatively speaking) expensive.

A further problem is that performing an effect is also essentially an indirect call to the scheduler that involves search. One could hope that the effect is handled promptly and that it would be performed by the top-most handler, but that is not given. It could be that user code has added other effect handlers and the effect is not handled by the top-most handler. In fact, there could be arbitrary many handlers on the stack. So, not only is the call indirect, it could be that it takes a (relatively) long time to search for the handler.

This is a problem, because any additional time before the compare_and_set attempt increases the propability that another operation gets to update the MVar first and cause the compare_and_set to fail. Furthermore, every failure now involves an expensive call (effect) to the scheduler.

An alternative approach is to allocate some sort of location for the other side to fill with a result and quickly add that to the waiters queue. There are several examples of this. Here is an example from draft in kcas. First:

                (* Allocate a location for inter domain/fiber communication in order to resume later if needed. *)
                let self = Atomic.make `Init in
                (* Add the awaiter to kcas locations that support await *)
                add_awaiter ~xt self cass;
                if xt.cass == cass then
                  (* This means there actually weren't any locations that support await. *)
                  commit (Backoff.once backoff) mode scheduler_opt tx
                else
                  (* Next is a lock-free attempt to add `self` to awaitable locations. *)
                  match determine_for_owner xt.casn xt.cass with
                  | true ->
                      (* Lock-free update was a success.  Here we check if we can actually avoid making a call to the scheduler. *)
                      if Atomic.get self == `Init then
                        scheduler (fun resume ->
                            (* Failure of the CAS here does not imply retry. It means we already got resumed. *)
                            if
                              not
                                (Atomic.compare_and_set self `Init
                                   (`Waiting resume))
                            then resume ());

What happens above is that a location let self = Atomic.make Init is allocated for the communication. It is then enqueued to the relevant places before involving the scheduler. Note that failure of the compare_and_set done after communicating with the scheduler does not imply a retry — it actually means that we were already resumed. Indeed, it is possible that another domain has already resumed us

       match Atomic.exchange awaiter `Resumed with
       | `Waiting resume -> resume () (* The awaiter got here first so we resume it. *)
       | _ -> ()  (* We got here first.  The awaiter will resume itself. *)

before the scheduler call. The rendezvous PR in lockfree uses a similar approach. Eio also uses similar appraches (see e.g. sync.ml).

My intuition would be that the approaches based on quickly enqueuing a location for communication will perform and scale better than the lock-free approach taken with the Suspend effect. That is because the time window for interference in the location based approaches is likely to be shorter and they can sometimes avoid the scheduler call altogether — an optimization for contended scenarios is to spin some time on the location (waiting for the other party) before calling the scheduler to actually suspend. Roughly speaking the location based approaches basically need to make an allocation or two (very quick in OCaml) and perform a compare_and_set (also very quick) to enqueue that location. The Suspend effect based approach needs to perform all of those as well, but also needs to perform the indirect scheduler call, which is likely much more expensive and makes it more likely for the compare_and_set to fail. (You can, of course, simply ignore the possibility of returning false to indicate compare_and_set failure, but then the feature would be unused.)

Where is the Exit exception raised? I don't see any code in this repository that would do that and it is not raised by the Stdlib.

This Exit exception should be raised by the library which is actually handling the Suspend effect in case of CAS failure. One such handler from Eio library is shown here.

Also, as written, the use of a ref seems unnecessary.

Agreed. I was trying something else, and later did not change. I will change it now.

A further problem is that performing an effect is also essentially an indirect call to the scheduler that involves search. One could hope that the effect is handled promptly and that it would be performed by the top-most handler, but that is not given. It could be that user code has added other effect handlers and the effect is not handled by the top-most handler. In fact, there could be arbitrary many handlers on the stack. So, not only is the call indirect, it could be that it takes a (relatively) long time to search for the handler.

Deep handler stack is not what we see in practice. If you take all the somewhat realistic programs built using effect handlers (eio, domainslib, etc), the stack is very shallow. In fact, there is only 1 handler in the program. I conjecture that real programs in OCaml will have a shallow stack. If you look at effect handler papers, they will have micro-benchmarks where they simulate ambient effects that you have in OCaml with effect handlers such as a handler for each reference allocated in the program. Such a program would be slow, but that's not a program an OCaml programmer will write.

I'm now looking at the other end of the cancellation support as done in MVar:

          match Fun_queue.pop q with
                          | None -> ()
                          | Some (x, newQueue) -> let resume = x in
                                                  let new_contents = Empty newQueue in
                                                  let ret = Atomic.compare_and_set mv old_contents new_contents in 
                                                  if ret then
                                                    begin
                                                      (* resume (Ok v);
                                                      () *)
                                                      (* This was added for cancellation purpose ::==> *)
                                                      let ret1 = resume (v) in
                                                      if ret1 = Resume_success then ()
                                                      else 
                                                      (* Retrying *)
                                                        put mv v

So, what is happening here is that the resumer of a taker is taken from the queue and a resume is then attempted. The resume may fail, which is indicated by resume returning a result other than Resume_success.

I now see a subtle problem in the MVar implementation. Consider a scenerio where a fiber tries to take from an empty MVar, suspends itself to the MVar and, after having suspended itself, is cancelled. In the current implementation in this repository, the resumer is then left into the MVar. The resumer is only removed once put is called (sufficiently many times to reach the cancelled resumer). But that might not happen any time soon. So, the current implementation basically allows an unbounded leak: in the absense of calls to put, an unbounded number of takers could be suspended to an MVar and cancelled. So, this is basically a space leak.

Such a program would be slow, but that's not a program an OCaml programmer will write.

I think it is unrealistic to assume that people wouldn't use effects at all. So, I would expect that real programs maybe will have relatively shallow, but not single level, handler stacks. I would also expect the handler providing Suspend to typically be provided by the outermost handler. Any handlers added by whatever additional libraries are being used would work to deteriorate the performance of the approach here. Also, any computation done by the Suspend handler increases the propability that some other operation invalidates the compare_and_set.

What I'm saying here, with respect to the way Suspend effect is used, is that it is very likely to scale and perform poorly and has a vulnerability with respect to code outside of the use of the effect being able to make it perform poorly (potentially causing very expensive retry loops). Having the effect performed before the enqueuing Atomic.compare_and_set is an interesting, but, in my estimate, a poorly performing idea. It is better to enqueue as quickly as possible (without inserting user code in between) and perform the effect only if necessary and keep that effect outside of the critical period between the Atomic.get and Atomic.compare_and_set to enqueue the location (cell in Eio) for resuming. This way scaling and performance is not directly dependent on the handler stack.

So, this is basically a space leak.

Yes, we are aware of the space leaks in our implementation. By default interface is providing lazy cleanup where we are not popping from the suspended queue unless there is a corresponding call for that. Advantage of having lazy cleanup is we don’t have to search the entire MVar queue to remove the cancelled tasks and thus have a constant cancellation cost. If any library wants to free the resources given to the suspended task when cancelled, it can do so by having a cancel function similar to Eio’s set_cancel_fn.

deepali2806 / unified_interface

Review and Commentary #1