Ada-Rapporteur-Group / User-Community-Input

Ada User Community Input Working Group - Github Mirror Prototype
26 stars 1 forks source link

Lock-free programming in Ada with "concurrent" objects #84

Open sttaft opened 5 months ago

sttaft commented 5 months ago

From Johann Blieberger [johann.blieberger@tuwien.ac.at](mailto:johann.blieberger@tuwien.ac.at):

We have done some work on how a memory model might be included in Ada. Part of this work has been published (mainly in the Ada-Europe conference), e.g. https://www.sciencedirect.com/science/article/pii/S1383762120300588?ref=cra_js_challenge&fr=RR-1

The main idea is to improve on how memory models are handled in C, C++, Java, ..., by introducing novel language features which take away most of the burdens from the programmer.

We have also started to build a precompiler to show the advantages of the approach.

We would like to add some new ideas and deepen the old ones. Here is a refinement of the Ada-Europe paper:

https://drive.google.com/file/d/1jBBCDWAk13zT8VKmT8Bb3cPlEpOMlEuN/view?usp=share_link

Below is the introduction from the proposal.


Memory models and memory model aware programming have gained consid- erable attention in recent years. To abstract from the diversity of hardware memory consistency models such as x86, ARM, MIPS, and RISC-V available in the market today, a programming language must provide a language-level memory model. Together with non-blocking synchronization constructs, mem- ory models allow programmers to employ non-blocking synchronization in a hardware-agnostic way. A strict memory model has already been standardized for C++ ([ISO23, Wil12]). It has later been adopted for C11, OpenCL 2.0, and X10 [ISO23, Wil12, BA08, MPA05, Zwi16]. However, the library based C++ approach cannot take into account higher level semantics.

To the author’s best knowledge, programming language support for memory model aware programming has not been proposed before [BB18a, YJM+20] and this paper. The state of the art is best presented in [Wil12]. Atomic variables and the C++ memory model provide the basis for memory model aware programming in C++. Atomic variables are read, written, and modified by calling operations defined in a library. The programmer selects the memory order via parameters. According to the C++ memory model [ISO23], memory order is specified via one of acquire, release, relaxed and a few more. The compiler then inserts suitable memory fences into the machine code.

One of the fundamental ideas of lock-free programming is that each thread is allowed to optimistically enter a critical section. Whenever a thread wants to update critical data, this is done via some sort of read-modify-write (RMW) operation. Such operations check if the critical data has been altered meanwhile. If not, the update is performed atomically inside the RMW operation. If the data has been modified by another thread, the “local” data is (again atomically) updated to the “new” value. The RMW operation returns which of these two paths has been taken. It is the task of the programmer to write code that is aware of the fact that RMW operations behave in this way. This results in various try-abort and retry loops. Such code is extremely hard to understand.

The approach taken in [BB18a] and in [YJM+20, Sec. 4] is different in that it hides most of the try-abort and retry loops inside the semantics of new language constructs, so-called concurrent objects.

It is the purpose of this paper to summarize [BB18a] and [YJM+20, Sec. 4] and to extend the approach given in these papers.

sttaft commented 5 months ago

It would help to have more direct comparisons between the proposed concurrent objects and solutions using System.Atomic_Operations, and solutions using the GNAT-specific "Lock_Free" aspect on protected types.

ARG-Editor commented 5 months ago

Ada of course pioneered Atomic and Volatile objects way back in Ada 95. So the idea that this sort of thing was recently "discovered" is more sad than anything. Tucker noted the addition of System.Atomic_Operations to make the use of Atomic objects portable in Ada 2022. He also noted the GNAT experiments with the Lock_Free aspect; we didn't standardize it mainly because the restrictions needed for it were not portably defined. But to me, it seems to be the way forward if a higher level lock free mechanism is needed.

blieberger commented 4 months ago

Answer to https://github.com/Ada-Rapporteur-Group/User-Community-Input/issues/84#issuecomment-1939572202 :+1:

Aspect/pragma Lock_Free cannot be used for lock-free programming because of the constraints given for a protected subprogram body. Lock-free programming heavily relies on retry loops, gotos, ...

The following very simple producer-consumer example can neither be implemented via standard Ada nor via System.Atomic_Operations.

... Flag : Boolean := False; Data : Integer := 0; ... -- Producer task Data := 42; Flag := True; -- Consumer task R : Integer; ... loop exit when Flag; end loop; R := Data;

Note that there is no data dependence between Flag and Data. So in the producer task, the compiler or the CPU may execute the assignment to Flag before the assignment to Data. Hence the consumer task may find Data=0 and assign R to 0. It is also possible that for similar reasons in the consumer task, the compiler or the CPU may execute the assignment to R before the loop statement.

A correct version of this program can only be achieved if Flag : Boolean := False with Atomic; Data : Integer := 0 with Atomic; is used relying on the fact that both Data and Flag are volatile. This, however, does not work if the type of Data is more complex, like a record or an array of records.

Note that System.Atomic_Operations do not improve the situation.

CPU ISAs provide memory fences (fences for short) to prohibit the CPU from reordering operations. C or C++ define a memory model which allow the programmer to state fences explicitely.

Setting such fences can be enforced by use of gcc's atomic intrinsics (which is done by our precompiler).

A solution of the producer-consumer example via our proposed language extensions is:

generic type Data is private; package Generic_Release_Acquire is

concurrent RA is procedure Write (d: Data); entry Get (D: out Data); private Ready: Boolean := False with Atomic, Memory_Order_Read => Acquire, Memory_Order_Write => Release; Da: Data; end RA;

end Generic_Release_Acquire;

package body Generic_Release_Acquire is

concurrent body RA is

procedure Write (D: Data) is
begin
  Da := D;
  Ready := True;
end Write:

entry Get (D: out Data)
  when Ready is
begin
  D := Da;
end Get;

end RA;

end Generic_Release_Acquire;

More elaborated examples can be found in the paper above. Pages 17 to 22 show how a lock-free stack can be implemented via our proposed language extensions.

sttaft commented 4 months ago

Answer to #84 (comment) 👍

Aspect/pragma Lock_Free cannot be used for lock-free programming because of the constraints given for a protected subprogram body. Lock-free programming heavily relies on retry loops, gotos, ...

The following very simple producer-consumer example can neither be implemented via standard Ada nor via System.Atomic_Operations.

...
Flag : Boolean := False;
Data : Integer := 0;
... 
-- Producer task
Data := 42;
Flag := True;
-- Consumer task
R : Integer;
 ... 
loop 
    exit when Flag;
end loop;
R := Data;

Note that there is no data dependence between Flag and Data. So in the producer task, the compiler or the CPU may execute the assignment to Flag before the assignment to Data. Hence the consumer task may find Data=0 and assign R to 0. It is also possible that for similar reasons in the consumer task, the compiler or the CPU may execute the assignment to R before the loop statement.

A correct version of this program can only be achieved if Flag : Boolean := False with Atomic; Data : Integer := 0 with Atomic; is used relying on the fact that both Data and Flag are volatile. This, however, does not work if the type of Data is more complex, like a record or an array of records.

Complex objects can be declared volatile, which would prevent the kind of reordering you mention, since volatile objects (which includes all atomic objects) are subject to the requirements of C.6(16/3) and C.6(20/5):

(16/3) All tasks of the program (on all processors) that read or update volatile variables see the same order of updates to the variables. A use of an atomic variable or other mechanism may be necessary to avoid erroneous execution and to ensure that access to nonatomic volatile variables is sequential (see 9.10).

(20/5) The external effect of a program (see 1.1.3) is defined to include each read and update of a volatile or atomic object. The implementation shall not generate any memory reads or updates of atomic or volatile objects other than those specified by the program.

So reordering the operations on volatile variables would not be a legal optimization.

So I would think the following would work properly:

...
Flag : Boolean := False with Atomic;
Data : T := Initial_Value with Volatile;
... 
-- Producer task
Data := Some_Complex_Value;
Flag := True;

-- Consumer task
R : Integer;
... 
loop 
   exit when Flag;
end loop;
R := Data;

Clearly this involves an explicit loop, which your proposal would avoid. But it would still be helpful if you provided more such examples, so we can evaluate the amount of benefit being provided by the new feature. If there are solutions using existing Ada features, it is important to actually show them. Adding a new feature as complex as the one you propose needs some compelling benefits. It is also important to understand what sort of lock-free data structures, if any, cannot be implemented in a straightforward using your proposal.

For example, it would be interesting to see how you would implement the double-ended queue that is part of the typical implementation of a work-stealing scheduler. There is an Ada implementation of such a lock-free double-ended queue using System.Atomic_Operations at https://github.com/parasail-lang/parasail/blob/main/lwt/lwt-generic_synchronized_deques.ads and https://github.com/parasail-lang/parasail/blob/main/lwt/lwt-generic_synchronized_deques.adb so it could provide a good comparison.

Note that System.Atomic_Operations do not improve the situation.

I presume the additional operations might make some solutions more robust, since you wouldn't have to rely merely on atomic variable read and write for lock-free synchronization -- you could also use compare-and-exchange and fetch-and-add. See for example the deque implementation mentioned above.

CPU ISAs provide memory fences (fences for short) to prohibit the CPU from reordering operations. C or C++ define a memory model which allow the programmer to state fences explicitely.

Using fences explicitly definitely adds complexity, so again, it is important to understand the benefits. The model provided by Ada with atomic and volatile variables is usually enough, so it would be interesting to understand the important situations where atomic and volatile are insufficient.

Setting such fences can be enforced by use of gcc's atomic intrinsics (which is done by our precompiler).

A solution of the producer-consumer example via our proposed language extensions is:

generic
     type Data is private;
package Generic_Release_Acquire is

   concurrent RA is 
      procedure Write (d: Data);
         entry Get (D: out Data);
      private
         Ready: Boolean := False
             with Atomic,
                     Memory_Order_Read => Acquire,
                     Memory_Order_Write => Release;
        Da: Data;
   end RA;
end Generic_Release_Acquire;

package body Generic_Release_Acquire is
   concurrent body RA is
      procedure Write (D: Data) is
      begin
         Da := D;
         Ready := True;
      end Write:

      entry Get (D: out Data)
         when Ready is
      begin
         D := Da;
      end Get;
   end RA;

end Generic_Release_Acquire;

It is not obvious that this is easier to understand than the version above using an explicit loop, particularly given the MemoryOrder* aspect specifications.

More elaborated examples can be found in the paper above. Pages 17 to 22 show how a lock-free stack can be implemented via our proposed language extensions.

blieberger commented 4 months ago

Note that in the solution with concurrent objects Data is not volatile and that Flag is atomic but does not need to be volatile. Since volatile is expansive wrt. performance, the solution with COs will perform better.

Note also that it is always about performance. If you do not have performance problems, you may use a lock based approach!

Even if one does not specify memory order, the default memory order is sequentially consistent and the CO solution will still perform better, because Data is not volatile. (Flag is then "volatile" because of the sequentially consistent memory order.)

ARG-Editor commented 4 months ago

Note that in the solution with concurrent objects Data is not volatile and that Flag is atomic but does not need to be volatile. Since volatile is expansive wrt. performance, the solution with COs will perform better.

This is very confused, at least in terms of Ada.

First of all, Volatile in Ada simply means that order of operations is guaranteed within and between threads, Thus, if you have two operations on an object V. A and B, another thread might see no operations done, just A done, or both A and B done, but will never see just B done. The effect of this is to prevent most optimizations on a Volatile object, but it otherwise has no effect on performance -- the code generated is the same as if it had not been volatile with optimizations off.

Second, there are no guarantees at all about objects that are not marked Volatile. Therefore, a compiler could optimize away all of the writes on an object if they appear dead in the current thread. Moreover, most accesses to non-Volatile objects from another thread are erroneous (see 9.10). To make them work, you essentially have to use a lock (like Flag in your example).

Finally, all Atomic objects are considered Volatile, since they have the same guarantees. Moreover, the value read from an Atomic object should be the same in all threads. It also guarantees the state of volatile objects. Both of these requirements mean that fences are often necessary to properly implement Atomic objects, and should be generated by the compiler as needed. That of course makes Atomic objects the ones that are "expensive" for performance. Of course, Ada is defined so the actions are predictable, rather than trying to squeeze out the last ounce of performance.

Synchronization either requires protected objects, or atomic objects that "protect" some volatile objects. To do something else would require completely going outside of the Ada model of memory, a massive job that would have to have a massive payoff.

                Randy.
ARG-Editor commented 4 months ago

Note also that it is always about performance. If you do not have performance problems, you may use a lock based approach!

The following is my personal opinion (I'm not speaking in any official capacity here).

It's fun to try to squeeze the last cycle of performance out of something. But doing that is rarely of any practical value, and that is especially true for standardization. The problem is that such solutions are very target specific. And it is often the case that the next generation of processors changes the solutions needed. So to get the last cycle of performance requires changes every year or two.

But standardization occurs on long timeframes. It takes years for a proposal to go thru vetting, prototyping, polishing, and that waiting for publication. By the time that such a proposal is actually standardized, it's highly likely that whats needed has changed.

I'm also very dubious about the performance of synchronization ever being the reason that something performs too slowly. Given the costs of creating, scheduling, and destroying threads, synchronization can only matter if there is far too much of it. Algorithms that are worth parallelizing necessarily have very little synchronization. Synchronization always implies wasted time (waiting for data even via a lock-free algorithm is still waiting!), so you want to do as little of it as possible.

Additionally, my understanding is that lock-free (and wait-free) algorithms are not so much about performance as about guarantees for progress. Lock-free algorithms necessarily include a busy-wait loop, and that can waste a lot of time that a regular mutex would have given to some other thread to do useful work. Lock-free algorithms probably have different performance, but not necessarily better performance -- it depends on the details on how they are used.

The point of standardizing something is so that it works portably in many environments. But performance necessarily is going to differ wildly in different environments. There is little advantage to standardizing behavior that isn't likely to work effectively in another environment, especially as it probably will counter-productive by the time it is standardized.

People writing parallel code are best severed writing it at as high a level as possible. They should be using parallel loops and blocks, and possibly a few protected objects. Compilers will generate those as efficiently as possible for a given target. If that isn't fast enough, most likely the process that they want to execute in parallel is inappropriate for such execution. A different algorithm is needed. Tinkering with the infrastructure can only add a few percent to the performance, which is unlikely to make enough difference. That's the standard advice for any performance problem, and it applies many times over here.

Ordinary people should never try to write synchronization with atomic objects -- it is way too easy to get subtly wrong -- and synchronization errors means silently wrong answers. This sort of thing needs to be restricted to a few gurus (or maybe AI?) writing reusable libraries for the rest of us. After all, you can always get way better performance if you don't care whether the answer is right. :-)

I think you demonstrated this when you gave a number of incorrect examples of using existing Ada features for synchronization. This is extremely tricky code - one really needs to be an expert in the Ada memory model before even attempting it. What really would be valuable is tools to check if such code is correctly written -- before worrying about extending it further.

                          Randy.
blieberger commented 4 months ago

This is very confused, at least in terms of Ada.

First of all, Volatile in Ada simply means that order of operations is guaranteed within and between threads, Thus, if you have two operations on an object V. A and B, another thread might see no operations done, just A done, or both A and B done, but will never see just B done. The effect of this is to prevent most optimizations on a Volatile object, but it otherwise has no effect on performance -- the code generated is the same as if it had not been volatile with optimizations off.

This is easy to say. However the CPU has to do a lot of things to guarantee Volatile, in particular for a multi-core with multi-level caches (cf. e.g. https://en.wikipedia.org/wiki/Cache_coherence and https://en.wikipedia.org/wiki/Consistency_model). This makes Volatile expansive.

Second, there are no guarantees at all about objects that are not marked Volatile. Therefore, a compiler could optimize away all of the writes on an object if they appear dead in the current thread. Moreover, most accesses to non-Volatile objects from another thread are erroneous (see 9.10). To make them work, you essentially have to use a lock (like Flag in your example).

I agree.

Finally, all Atomic objects are considered Volatile, since they have the same guarantees. Moreover, the value read from an Atomic object should be the same in all threads. It also guarantees the state of volatile objects. Both of these requirements mean that fences are often necessary to properly implement Atomic objects, and should be generated by the compiler as needed. That of course makes Atomic objects the ones that are "expensive" for performance. Of course, Ada is defined so the actions are predictable, rather than trying to squeeze out the last ounce of performance.

In my opinion it is the wrong approach to consider Atomic objects Volatile. Defining a memory model for Ada similarly to those of C and C++, which have been standardized 2011 and have not been changed fundamentally since then, would be a way to go. You can always do better, of course.

Synchronization either requires protected objects, or atomic objects that "protect" some volatile objects. To do something else would require completely going outside of the Ada model of memory, a massive job that would have to have a massive payoff.

Yes, indeed. But it would open up new application domains for Ada.

The following is my personal opinion (I'm not speaking in any official capacity here).

It's fun to try to squeeze the last cycle of performance out of something. But doing that is rarely of any practical value, and that is especially true for standardization. The problem is that such solutions are very target specific. And it is often the case that the next generation of processors changes the solutions needed. So to get the last cycle of performance requires changes every year or two.

But standardization occurs on long timeframes. It takes years for a proposal to go thru vetting, prototyping, polishing, and that waiting for publication. By the time that such a proposal is actually standardized, it's highly likely that whats needed has changed.

The C/C++ memory model has been proven to work and is in use since many years (cf. https://www.manning.com/books/c-plus-plus-concurrency-in-action-second-edition).

I'm also very dubious about the performance of synchronization ever being the reason that something performs too slowly. Given the costs of creating, scheduling, and destroying threads, synchronization can only matter if there is far too much of it. Algorithms that are worth parallelizing necessarily have very little synchronization. Synchronization always implies wasted time (waiting for data even via a lock-free algorithm is still waiting!), so you want to do as little of it as possible.

There are papers showing that lock-free programs are 10 to 100 times faster than lock-based programs.

Additionally, my understanding is that lock-free (and wait-free) algorithms are not so much about performance as about guarantees for progress. Lock-free algorithms necessarily include a busy-wait loop, and that can waste a lot of time that a regular mutex would have given to some other thread to do useful work. Lock-free algorithms probably have different performance, but not necessarily better performance -- it depends on the details on how they are used.

Guarantees for progress and prohibiting deadlocks are indeed reasons why lock-free algorithms are preferred. Writing lock-free or wait-free programs is indeed a difficult task and programming languages should support programmers as much as they can.

The point of standardizing something is so that it works portably in many environments. But performance necessarily is going to differ wildly in different environments. There is little advantage to standardizing behavior that isn't likely to work effectively in another environment, especially as it probably will counter-productive by the time it is standardized.

People writing parallel code are best severed writing it at as high a level as possible.

This is the reason why I think that programming languages should provide high-level features to support programmers in writing lock-free and wait-free algorithms.

They should be using parallel loops and blocks, and possibly a few protected objects. Compilers will generate those as efficiently as possible for a given target. If that isn't fast enough, most likely the process that they want to execute in parallel is inappropriate for such execution. A different algorithm is needed. Tinkering with the infrastructure can only add a few percent to the performance, which is unlikely to make enough difference. That's the standard advice for any performance problem, and it applies many times over here.

Ordinary people should never try to write synchronization with atomic objects -- it is way too easy to get subtly wrong -- and synchronization errors means silently wrong answers. This sort of thing needs to be restricted to a few gurus (or maybe AI?) writing reusable libraries for the rest of us. After all, you can always get way better performance if you don't care whether the answer is right. :-)

And which programming languages shall these gurus use for writing the libraries?

I think you demonstrated this when you gave a number of incorrect examples of using existing Ada features for synchronization. This is extremely tricky code - one really needs to be an expert in the Ada memory model before even attempting it. What really would be valuable is tools to check if such code is correctly written -- before worrying about extending it further.

--Johann

blieberger commented 4 months ago

For example, it would be interesting to see how you would implement the double-ended queue that is part of the typical implementation of a work-stealing scheduler. There is an Ada implementation of such a lock-free double-ended queue using System.Atomic_Operations at https://github.com/parasail-lang/parasail/blob/main/lwt/lwt-generic_synchronized_deques.ads and https://github.com/parasail-lang/parasail/blob/main/lwt/lwt-generic_synchronized_deques.adb so it could provide a good comparison.

I concentrated on operations Pop and Steal which will be implemented by entries of concurrent objects in our approach. In addition, declarations of various atomic types and variables have been adopted (in particular, Atomic_Index_Type, Deque.Bottom, and Deque.Data). Note that in our approach "Atomic" does not imply "Volatile" and there is a strict distinction between "Atomic" and "Read_Modify_Write" variables. Only the latter give rise to read-modify-write operations when they are assigned a new value inside of an entry.

I have used defaults for memory order, i.e. Sequentially_Consistent. Changing this to Realease/Acquire will give better performance e.g. on ARM.

The original paper by Chase and Lev and the implementation cited above excludes retry loops and leaves them to the user of the deque. (The claim of Chase and Lev that their deque implementation is wait-free, is for this reason wrong.) Our approach includes the retry loops inside of the concurrent objects and offers flexibility at the user side (cf. the end of this post).


generic type Element_Type is private; Empty : Element_Type; package LWT.Generic_Synchronized_Deques is -- Provide a double-ended queue that is -- synchronized to allow use in a work-stealing -- scheduler.

type Deque is limited private; -- A double-ended queue with a LIFO end supporting Push/Pop -- and a FIFO end supporting only Steal.

concurrent type Sync_Deque_With_Steal is

  procedure Push (Element : Element_Type)
    with Pre => Element /= Empty;
  --  Add the Element to the "LIFO" end of the Deque.

  entry Pop (Element : out Element_Type);
  --  Remove an Element from the "LIFO" end of the Deque.
  --  Element will come back as Empty if the Deque is empty.

  entry Steal
    (Element : out Element_Type);
  --  Remove an Element from the "FIFO" end of the Deque.
  --  Element will come back as Empty if the Deque is empty.
  --  If the attempt to steal fails due to a concurrent Steal or Pop,
  --  Steal will retry until it succeeds or Deque is empty.

  function Is_Empty return Boolean;
  --  Returns true if the Deque is empty at the moment of the call of the
  --  function.  This can become False at any time if there is a thread still
  --  executing that might Push on the deque.

  function Length return Natural;
  --  Returns count of number of items in Deque, at time of the call.
  --  Can change immediately after the call.

  procedure Reset
    with Pre => Is_Empty (Q), Post => Is_Empty (Q); -- is this still possible?
  --  Reset Deque to minimal capacity, reclaiming any additional storage
  --  devoted to deque from growth due to past calls on Push.
  --  This should not be called while there are still threads running
  --  that might do a Push.

  type Element_Vector is array (Positive range <>) of Element_Type;
  function All_Elements return Element_Vector;
  --  Return a vector of all of the elements, in order from
  --  oldest to newest.
  --  This should not be called while items are being added or removed!

private

  Q : Deque;

end Sync_Deque_With_Steal;

private

  Initial_Size : constant := 2**8;
  --  Initial size of double-ended queue for a work-stealing server.

  type Index_Type is mod 2**32;
  --  This index type is allowed to wrap around
  --  We get a Actual_Index_Type value by taking the Index_Type
  --  modulo the size of the array.

  type Atomic_Index_Type is new Index_Type with Read_Modify_Write;
  --  Index type used for compare-and-exchange.
  --  Default memory order is Sequentially_Consistent

  type Actual_Index_Type is mod 2**32;
  --  This is the type that actually indexes into the Element_Array

  function To_Actual_Index
    (Index : Index_Type; Modulo : Actual_Index_Type)
    return Actual_Index_Type is (Actual_Index_Type (Index) mod Modulo);
  --  This function translates the "abstract" index into an "actual" index.

  type Element_Array is array (Actual_Index_Type range <>) of Element_Type;

  type Element_Data (Modulo_Minus_One : Actual_Index_Type) is record
     --  TBD: Recent_Top : Index_Type := 0;  --  no need to be atomic
     Elements : Element_Array (0 .. Modulo_Minus_One);
  end record;

  type Element_Data_Ptr is access Element_Data;

  --  Get an object that can be updated using compare-and-swap.

  type Deque is limited record
     Top    : aliased Atomic_Index_Type := 0;  --  next to steal
        --  Is aliased really needed ?
     Bottom : Index_Type := 0 with Atomic;     --  next to fill
        --  Memory order of Bottom is Sequentially_Consistent
        --  TBD: Move into Data to minimize cache conflicts?
     Data   : not null Element_Data_Ptr :=
       new Element_Data (Modulo_Minus_One => Initial_Size - 1);
     --  no need to be atomic since Atomic_Index_Type and Bottom
     --  being of memory order Sequentially_Consistent imply
     --  that fences are set accordingly.
     --
     --  Normally Top is less than or equal to Bottom (yes, not intuitive!),
     --  with Bottom - Top being the number of items in the deque.
     --  Pushing means +1 to Bottom, Popping means -1 to Bottom.
     --  Stealing means +1 to Top.
     --  During a Pop, Bottom can temporarily be Top - 1, so need to
     --  be aware of that situation in Steal.
  end record;

end LWT.Generic_Synchronized_Deques;


package body LWT.Generic_Synchronized_Deques is

concurrent body Sync_Deque_With_Steal is

  --  Push does not really interfer with Pop or Steal,
  --  its implementation is not changed
  --  and for this reason not shown here.

  entry Pop
     (Element : out Element_Type)
  until Q.Top = Q.Top'OLD is
  --  Remove an Element from the "LIFO" end of the Deque.
  --  Element will come back as Empty if the Deque is empty.
     Data : constant Element_Data_Ptr := Q.Data;
     Mod_Minus_One : constant Actual_Index_Type := Data.Modulo_Minus_One;
     Bot : constant Index_Type := Q.Bottom;     --  TBD: or Data.Bottom?
  begin
     --  Decrement Bottom now to prevent two "Steal"s causing trouble.
     --  NOTE: Temporarily it is possible that Bottom = Top - 1.
     Q.Bottom := Bot - 1;

     declare
        --  Now get Top, having decremented Bottom.
        Top    : aliased Atomic_Index_Type := Q.Top;
           -- Is aliased really needed ?
        Length : constant Index_Type := Bot - Index_Type (Top);
     begin
        Element := Empty;
        if Length = 0 then
           --  Queue was already empty
           --  Restore Bottom
           Q.Bottom := Bot;
        else
           --  Queue not empty when we started, so fetch what was
           --  the bottom, but check whether it has been stolen
           --  in the mean time.
           Element :=
             Data.Elements (To_Actual_Index (Bot - 1, Mod_Minus_One + 1));
           if Length = 1 then
              --  We are competing with stealing,
              --  so we try to steal from ourself and then restore Bottom.
              Q.Top := Top + 1;
              --  Read-modify-write operation; if it fails, entry is retried

              --  Restore Bottom in any case, since we bumped Top
              --  if the RMW operation succeeded.
              Q.Bottom := Bot;
           --  else Length > 1, Element is correct, and Bottom already decr'd
           end if;
        end if;
     end;
  end Pop;

  entry Steal
     (Element : out Element_Type)
  until Q.Top = Q.Top'OLD is
  --  Remove an Element from the "FIFO" end of the Deque.
  --  Element will come back as Empty if the Deque is empty.
  --  If the attempt to steal fails due to a concurrent Steal or Pop,
  --  Steal will retry until it succeeds or Deque is empty
     Data   : constant Element_Data_Ptr := Q.Data;
     Bot    : constant Index_Type := Q.Bottom;    --  TBD: or Data.Bottom?
     Top    : aliased Atomic_Index_Type := Q.Top;
        -- Is aliased really needed ?
     Length : constant Index_Type := Bot - Index_Type (Top);
     Mod_Minus_One : constant Actual_Index_Type := Data.Modulo_Minus_One;
  begin
     Element := Empty;
     if Length > 0 then
        --  There is at least one element in the deque.
        --  Try to steal it.
        Element := Data.Elements
                     (To_Actual_Index (Index_Type (Top), Mod_Minus_One + 1));

        Q.Top := Top + 1;
        --  Read-modify-write operation; if it fails, entry is retried
     end if;
  end Steal;

  --  Remaining operations are not supposed to be used
  --  while several tasks are active;
  --  so their implementation is not changed and
  --  for this reason not shown here

end Sync_Deque_With_Steal;

end LWT.Generic_Synchronized_Deques;


The caller of Steal may use

(1) Steal (...); -- loops until deque is empty or element is stolen

(2) select Steal (...); else ...; -- executed if Steal detects a race, i.e., -- after one single failed execution of Steal (no retry). end select;

(3) for I in 1 .. 42 loop select Steal (...); exit; else null; end select; end loop; -- up to 42 tries of Steal

(4) select delay 5.0; then abort Steal (...); end select; -- aborts after 5 seconds, if Steal did not succeed

sttaft commented 4 months ago

Thanks for taking the time to implement the Deque example. We will review it to get a better understanding of your proposal.

ARG-Editor commented 6 days ago

Sorry about not getting back to this sooner, I've been very busy.

This is very confused, at least in terms of Ada.

First of all, Volatile in Ada simply means that order of operations is guaranteed within and between threads, Thus, if you have two operations on an object V. A and B, another thread might see no operations done, just A done, or both A and B done, but will never see just B done. The effect of this is to prevent most optimizations on a Volatile object, but it otherwise has no effect on performance -- the code generated is the same as if it had not been volatile with optimizations off.

This is easy to say. However the CPU has to do a lot of things to guarantee Volatile, in particular for a multi-core with multi-level caches (cf. e.g. https://en.wikipedia.org/wiki/Cache_coherence and https://en.wikipedia.org/wiki/Consistency_model). This makes Volatile expansive.

Of course it is easy to say, because it is True. In Ada, there is no difference in the code generated between volatile and non-volatile objects. So how could the effect be different??? The only place that there is a difference is at a synchronization point, and uses of volatile objects are not themselves synchronization points - see 9.10.

Note that Atomic object ARE a synchronization point, so it is their uses that are more expensive (and also carry guarantees).

It probably doesn't pay for me to go through this point-by-point. If you are going to make a proposal for changes to Ada, at a minimum you have to understand how Ada currently works, and in particular how the terminology is applied.

I'm going to do my best to keep my personal feelings out of the following, and stick simply to issues that I know as editor.

Ada defined the uses of the terms Atomic and Volatile in Ada 95 (I found the terms in a 1993 draft of the Standard). The rules have been tweaked since (particularly in AI05-0275-1, where the issue of memory barriers/fences was considered), but the basic model is unchanged since then. We've since added atomic operations packages so that it is possible without leaving Ada to correctly do atomic operations.

You seem to think that Atomic should not imply a synchronization point. Whether or not that should be the case in general, it would be impossible to change the Ada meaning of Atomic. Doing so would drop a guarantee that has been in Ada since Ada 95. That would be the worse kind of incompatibility: it would introduce potential erroneous execution into code that currently is perfectly safe. Not only would that incompatible not be detectable at compile-time or runtime, it wouldn't even show up in testing the vast majority of the time. And it would make every piece of existing code, every book, article, an blog post that uses Atomic invalid. I can't imagine a scenario where we would do that.

You also seem to think that Volatile implies some sort of synchronization point. The case for changing that is not as obvious, but adding such points would make existing Ada code slower (for no benefit if it is currently correct), potentially introduce new race conditions, and would require extensive changes to existing compilers.

The existence of some other memory model is not relevant for the Ada Standard. We don't have the manpower to rewrite the entire Ada Standard to depend on some other standard AND also ensure that we haven't broken any of the existing behaviors. This is already a very tricky area (especially as all of the high-level and low-level constructs have to interact sensibly), and rewriting completely would be foolhardy.

Ada users currently can build lock-free programs using Atomic objects and the atomic operations library. The hole in this area is higher-level constructs. It would be nice to have an aspect Lock_Free that could be applied to protected objects to get lock-free behavior. But if lock-free is so much better, then compilers should already be implementing protected objects that way (there's no requirement to use an expensive implementation in Ada!). In which case there is no real need for an aspect (other than as a prod to get implementers to do more).

            Randy Brukardt, ARG Editor.