Intrepid / upc-specification

Automatically exported from code.google.com/p/upc-specification
0 stars 1 forks source link

Library: Atomic Memory Operations (AMO) #7

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The 1.2 UPC Spec does not currently specify atomic memory operations, however, 
UPC AMO extensions have been implemented by LBL, MTU, and some vendors.  All 
the implementations vary, so I have linked some of the available documentation 
below.

The addition of UPC AMO extensions to the 1.3 Spec would be a big productivity 
enhancement and a big step toward having a standard set of AMOs that are 
supported across the reference and vendor implementations.

I think adoption of the BUPC AMO extensions would be sufficient for my 
interests.

References:
  - BUPC AMO Extensions:
    http://upc.lbl.gov/docs/user/index.shtml#atomics
  - MTU UPC AMO Extensions:
    http://www.cug.org/5-publications/proceedings_attendee_lists/2005CD/S05_Proceedings/pages/Authors/Merkey/Merkey_paper.pdf
  - Cray AMOs:
    http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=upc;f=man/xt_libintm/80/cat3/amo.3i.html

Original issue reported on code.google.com by nspark.w...@gmail.com on 14 Mar 2012 at 3:40

GoogleCodeExporter commented 9 years ago
I'd like to see a bit of discussion on the pros and cons of the three 
approaches before we decide.

My perspective is that there are two issues of some degree of importance:

1) How full a collection of types and operations we want.  There are quite a 
few entry points and our experience is that only a few are used.    But if 
implementers and documenters are not concerned, I don't think users will be.

2) How we "spell" them.  Again, I don't think users will be concerned as  they 
will likely use a macro in their code no matter what we do :)

Original comment by w...@uuuuc.us on 16 Mar 2012 at 11:23

GoogleCodeExporter commented 9 years ago
Is it acceptable if we came up with a subset of the BUPC functions?

(1) I dislike the "local" versions of the atomic functions (they feel like 
oversweet syntax sugar to me)

(2) I don't really "get" the mswap operator, and I can't identify any 
particular unique use of it.

Original comment by ga10...@gmail.com on 24 Apr 2012 at 9:07

GoogleCodeExporter commented 9 years ago
Gheorghe wrote:
(1) I dislike the "local" versions of the atomic functions (they feel like 
oversweet syntax sugar to me)
(2) I don't really "get" the mswap operator, and I can't identify any 
particular unique use of it.

In response to (1): How does one perform atomic operations on private pointers 
if one removes the "local" functions?  Manual "privatization" of 
pointer-to-shared is a common optimization, and the upc_cast() under 
consideration for the spec will make it MORE common.  Unless one also provides 
a way to convert private->shared [Ick!] then the "local" variants of the atomic 
operations will be needed to avoid potentially forcing the user to keep track 
of both private and shared pointers to the same datum.

In response to (2): The "mswap" (masked swap) is, I believe, intended to aide 
in implementation of atomic updates to "flag bits" (think bit-fields w/o the 
help of the syntax).  It is among the SHMEM atomics, and thus made its way on 
to the "short list" when collecting information from Lauren about her community 
of programmers.

Original comment by phhargr...@lbl.gov on 24 Apr 2012 at 9:22

GoogleCodeExporter commented 9 years ago
Hi Paul, in response to your argument about private pointers - I could argue 
(not excessively facetiously) that it is none of my damn business what you do 
with private pointers - UPC is about shared data and pointers. ... In fact I 
think Yili used as one of his arguments to shut me down when I argued that 
collectives should take private pointers to data.

I agree with you that converting private pointers to "fat" pointers may not be 
such a hot idea. It so happens that in xlupc we could do it without breaking a 
sweat, but I cannot say whether that would be a path to suicide on other 
systems.

Damn me for seeing both sides of the issue. But in return I would have you 
acknowledge the essential awkwardness of having "local" versions of every UPC 
function.

Original comment by ga10...@gmail.com on 25 Apr 2012 at 2:41

GoogleCodeExporter commented 9 years ago
I *do* acknowledge that "local" versions of every function would be a mess, but 
I doubt that polymorphism as an alternative will get many supporters.  So, it 
comes down (in my mind) to what do you LOSE if no local variant is included.  
One can argue against local versions of collectives by claiming one can always 
make a copy.  The act of copying some atomic datum sort of destroys its 
purpose.  So, I think an argument could be made for why this might be a special 
case.  However, I won't get too hung up on this, as BUPC will continue to 
support local atomics as an extension if they are not included in the spec.

So, are there any other opinions on the inclusion/exclusion of atomic operation 
on pointer-to-private?

Original comment by phhargr...@lbl.gov on 25 Apr 2012 at 8:09

GoogleCodeExporter commented 9 years ago
Going back to Bill's two issues:
  1) How full a collection of types and operations we want.
  2) How we "spell" them.

To (1): I think the minimal set of types that my users are interested in is 
T={int64_t, uint64_t}.  For operations, the primary interest is in fetch-and-OP 
and OP (no fetch), where OP={ADD, AND, OR, XOR}.  While there is interest in 
compare-and-swap, I think this is a good bit further down their list of 
priorities.  I am pretty sure that we only care about AMOs as relaxed shared 
accesses.

The interest in the non-fetching atomic OP is that it would be a non-blocking 
call for which completion is only guaranteed by the next fence.  The goal would 
be that one could issue a large set of atomic OPs for high throughput.

This is definitely a reduced subset from what BUPC and Cray offer.  Maybe this 
answers my position on George's mswap question.  I understand that some 
implementers may want to expand the type and operation sets, but I think this 
is the minimal set that I care about.

To (2): While everyone will probably have their own desired flavor of spelling, 
I would probably go for something relatively short, like:
    TYPE upc_amo_fopT(OP, shared TYPE* p, TYPE v);
    void upc_amo_opT(OP, shared TYPE* p, TYPE v);
With this, we could use the existing upc_op_t definitions for OP (only 
accepting a subset of them, naturally).  This would bring up the "where should 
we put the upc_op_t enum?" issue, as it is currently part of the collectives 
library and sharing this with an AMO library (which would make sense) would 
mean they'd need some common header for these types.  This is just another 
version of the upc_flag_t discussion in Issue #10.

Original comment by nspark.w...@gmail.com on 28 Apr 2012 at 9:46

GoogleCodeExporter commented 9 years ago
1) I support Nick's request for having atomic OPs without fetch because they 
can have better performance when fetch is not needed.  And I see at least one 
app (Graph500) can benefit from atomic OPs without fetch.  I would like to 
propose to extend OP to include MAX and MIN, i.e., OP={ADD, AND, OR, XOR, MAX, 
MIN}.  
FYI, MPI_Accumulate is something similar.   

2) Does UPC guarantee atomicity for basic ops with built-in types?
For example, assuming int64_t == long long in C99,
shared int64_t *p;
int64_t a;
Is there difference between:
i) (*p) += a;
ii) upc_amo_op_int64(ADD, p, a);

3) For the discussion of AMOs with private/local pointers, if we want to 
include them in UPC spec, we should probably consider their compatibility 
and/or potential redundancy with C11 atomics.

Original comment by yzh...@lbl.gov on 29 Apr 2012 at 12:56

GoogleCodeExporter commented 9 years ago
I don't think my users really care about local atomics.  It might make more 
sense to address shared atomics now and save local atomics for the bigger 
discussion of whether UPC moves to C11.

Original comment by nspark.w...@gmail.com on 30 Apr 2012 at 8:49

GoogleCodeExporter commented 9 years ago
With regard to "Does UPC guarantee atomicity for basic ops with built-in 
types?", the answer is unequivocally no.  As far as the memory model is 
concerned, (*p) += a; becomes (*p) = (*p) + a; which becomes (in pseudo code, 
READ is either a relaxed or strict read of a shared object, WRITE is either a 
strict or relaxed write of a shared object):

READ( *p ) => t1
READ( a ) => t2
t1 + t2 => t3
t3 => WRITE( *p )

There is nothing to guarantee that some other thread doesn't come in and modify 
*p or a after the local thread reads it, but before it writes the new result 
back to *p.  UPC statements do not have transaction semantics (though it'd 
likely be a useful extension if anyone wants to come up with such a proposal!). 
 Assuming all strict accesses, the compiler/runtime must ensure that this race 
is consistent in that all threads observe the same ordering, but it doesn't 
need to do anything to prevent the race from occurring.  For relaxed accesses, 
it doesn't even need to do that, though local ordering must still be maintained.

Original comment by sdvor...@cray.com on 11 May 2012 at 4:22

GoogleCodeExporter commented 9 years ago
To amplify Yili's point about adds w/o a fetch. Does this make sense as far as 
semantics?

Level 1: basic atomic operation. Essentially, guaranteeing that e.g. a+=b 
happens atommically. Examples: atomic increment, atomic set, atomic or, xor, ...

Level 2: fetch + basic atomic operation. There is one for every operation 
defined in Level 1. The value *before* the operation is returned to the user.

Level 3: compare + fetch + op. The operation supplies two values - a "compare" 
value and an "update" value - and returns the "old" value. The operation is 
executed if the "old" value matches the "compare" value. The "old" value is 
returned in any case. Typical example: compare-and-swap, which is really a 
compare+fetch+set.

Is there anything you can think of that is not covered by this taxonomy?

Original comment by ga10...@gmail.com on 22 May 2012 at 1:41

GoogleCodeExporter commented 9 years ago
--- AMO Taxonomy & Hardware Support ---
I don't think this taxonomy covers the "masked swap" in the BUPC AMO 
extensions.  I don't know how strongly people feel about this particular AMO, 
but (and this may be a stupid reason), I would be inclined to leave it out for 
the sake of having a more concise set of function declarations.

Again, the spelling isn't /that/ important, but I think this would be a 
relatively terse set:
    void upc_amo_opT(   upc_op_t op, shared TYPE* ptr, TYPE val );
    TYPE upc_amo_fopT(  upc_op_t op, shared TYPE* ptr, TYPE val );
    TYPE upc_amo_cfopT( upc_op_t op, shared TYPE* ptr, TYPE cmp, TYPE val );

From what I can see, there seem to be compare-and-swap extensions, but not the 
more-general compare+fetch+op.  For the implementers, would this general 'Level 
3' AMO be an implementation challenge?  More specifically, would it not see the 
same level of hardware support that the others do?  Or, would a lack of 
hardware support for a general 'Level 3' AMO constrain the performance of 
compare-and-swap (or Level 1 & 2 AMOs) in order to guarantee atomicity?

--- Local AMO Support ---
Thinking back to the issue of local AMO support, it seems from existing 
extensions that local-pointer AMOs are generally not atomic with respect to 
shared-pointer AMOs.  It's a somewhat confusing point, so (maybe I'm beating a 
dead horse) I'd probably be inclined to leave out the local AMOs to prevent 
this sort of confusion.  I expect the typical use case for AMOs to be on shared 
memory anyway, but maybe that's an incorrect assumption.

--- Relaxed vs. Shared ---
One item not yet discussed here is whether the AMO function definitions should 
explicitly address whether the accesses are shared or relaxed (as in BUPC) or 
elide the distinction (as I think is the case in Cray UPC).  I think I'd prefer 
to leave out the relaxed/shared distinction in the AMO function definition and 
leave the access to be determined by the reference-type qualifier (or the 
associated pragma).

Original comment by nspark.w...@gmail.com on 22 May 2012 at 2:59

GoogleCodeExporter commented 9 years ago
Yes, oops - I forgot the masked-swap operation. I support Nick's motion to 
leave it out *unless* someone can think of a "killer app" for this. Please 
speak up :)

--- Hardware support ---

'Level 3' would obviously not be a challenge for IBM - I would not have 
suggested it otherwise [insert evil grin here]. But you bring up an important 
point. All these operations can be emulated given a set of basic primitives - 
and those primitives are  different on every vendors' HW. 

* Is there a canonical subset of these operations that will have "native" 
performance on most vendors' HW?

* If this canonical subset can be identified, maybe we should highlight this 
subset in some way in the AMO specification?

---- Local AMO support ----
If we add local AMOs they should be interoperable with shared ones - or else a 
lot of user confusion will result. So binary decision: either guarantee 
interoperability or leave them out completely (not UPC's concern).

Original comment by ga10...@gmail.com on 23 May 2012 at 2:05

GoogleCodeExporter commented 9 years ago
Gheorghe wrote:
> Yes, oops - I forgot the masked-swap operation. I support Nick's motion to 
leave
> it out *unless* someone can think of a "killer app" for this. Please speak up 
:)

Tracker issue #35 discusses write to shared bit fields without disrupting 
adjacent ones.  Providing that assurance would require a masked-swap operation 
exist within the runtime implementation.  If that is the case, then the 
question becomes whether one exposes this capability to the UPC user as a part 
of the atomics library as well.

Original comment by phhargr...@lbl.gov on 23 May 2012 at 8:23

GoogleCodeExporter commented 9 years ago
Nick wrote:
> I don't think this taxonomy covers the "masked swap" in the BUPC AMO 
extensions.
> I don't know how strongly people feel about this particular AMO, but (and this
> may be a stupid reason), I would be inclined to leave it out for the sake of
> having a more concise set of function declarations.

Berkeley includes the masked-swap due to input we received from Lauren Smith.  
We are quite willing to leave it out of the spec and retain it as only a 
Berkeley extension.

> I think I'd prefer to leave out the relaxed/shared distinction in the AMO 
function
> definition and leave the access to be determined by the reference-type 
qualifier
> (or the associated pragma).

Unless I am missing something important, what Nick requests above is not 
possible in a library function.  Neither the relaxed/strict qualification of 
the pointer nor the pragma in effect at the call site can be known inside the 
called function.  Now if this were UPC++ we might have a chance via 
polymorphism assuming relaxed/strict are significant in the type matching.

If support were "deeper" than a library of functions (including some compiler 
support), then what Nick requests would become possible.  That would make 
atomic operations more along the lines of "compiler intrinsics" than functions. 
 I don't have any strong objection to that, but it may significantly raise the 
burden on an implementer (the actual burden being very implementation specific 
already).

Original comment by phhargr...@lbl.gov on 23 May 2012 at 8:33

GoogleCodeExporter commented 9 years ago
Leaving aside the issue of compiler support, is there any implementation of UPC 
where the difference between strict and relaxed is *not* a UPC fence?

Could we leave strict AMOs out of the picture and rely on users being able to 
bracket the AMOs with fences?

Original comment by ga10...@gmail.com on 30 May 2012 at 11:43

GoogleCodeExporter commented 9 years ago
Gheorghe asked:
> Leaving aside the issue of compiler support, is there any implementation of 
UPC
> where the difference between strict and relaxed is *not* a UPC fence?
>
> Could we leave strict AMOs out of the picture and rely on users being able to
> bracket the AMOs with fences?

It is not as simple as that...

In the BUPC implementation of "upc_fence" we need to include both architectural 
memory fences and a compiler optimization fence.  In the AMO's for some 
architecture the atomic instructions already imply the architectural memory 
fence (the LOCK prefix on x86/x86-64 being the most important example to those 
outside of IBM).  So, asking a user on such an architecture to use BOTH an AMO 
and a upc_fence would result in TWO  (or more, see below) memory fences.

Additionally, what is the user expected to use:
  Option 1)   upc_fence; relaxed_AMO(); 
  Option 2)   relaxed_AMO(); upc_fence(); 
  Option 3)   upc_fence(); relaxed_AMO(); upc_fence();

In Option 1 it is possible for shared access after the AMO to move "up" and 
take place between the AMO and the fence.  This is OK for "release" semantics.

Conversely, in Option 2 shared accesses before the AMO might "move down" and 
take place after the AMO.  This is OK for "acquire" semantics.

Only with Option 3 do we get the property that the name "strict AMO" implies to 
me: all shared access issued before the AMO complete, then the AMO completes 
before any later references can begin.  That is what I believe 5.1.2.3 of the 
UPC 1.2 spec says for a strict access, and is therefore what I think we should 
provide for a "strict AMO".

BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o 
incurring THREE architectural memory fences.

Original comment by phhargr...@lbl.gov on 1 Jun 2012 at 4:28

GoogleCodeExporter commented 9 years ago

Original comment by phhargr...@lbl.gov on 1 Jun 2012 at 6:06

GoogleCodeExporter commented 9 years ago
I see your point. I withdraw my proposal about no strict AMOs. So it boils down 
to a choice between:

* Move AMOs deep inside the compiler just to help figure out whether AMOs are 
strict or relaxed, based on whether we are in strict or relaxed mode, the 
variable is denoted as strict or relaxed etc.

* Make AMO strictness/relaxedness explicit and double our namespace complexity.

This is similar to the "strict library approach vs. get the language involved" 
dichotomy that also plagues issues 41 (nonblocking memory copies) and 42 
(nonblocking collectives). I sense that we will have to take a unified approach 
to decide all three of these.

Original comment by ga10...@gmail.com on 15 Jun 2012 at 3:26

GoogleCodeExporter commented 9 years ago
As suggested by Nick. I'd like to have part time ownership of this issue as we 
write it up. I'm willing to take fulltime ownership, but I certainly don't want 
to muscle anyone else out. -- George

Original comment by ga10...@gmail.com on 15 Jun 2012 at 5:18

GoogleCodeExporter commented 9 years ago
In issue #41 I am backing down from my position that changes to upc_fence()'s 
implementation are unacceptable.  So, in this issue, perhaps we should poll the 
implementations to determine is "getting the compiler involved" is a reasonable 
possibility before assuming that it is not.  While that would mean that the 
proposed extension is not strictly (pun totally intentional) a pure library, it 
would avoid the doubled namespace.

So, the question is:

Does your implementation (or could it w/o excessive burden) have sufficient 
"smarts" to distinguish calls to an AMO in which a dereference of the pointer 
argument is strict vs relaxed?

For Berkeley UPC, the answer is YES.
As a source-to-source translator we generate different calls to our 
communication library for strict and non-strict accesses.  By treating AMOs as 
compiler intrinsics, rather than as calls to arbitrary C functions, we could 
leverage the same internal mechanism(s) to implement distinct relaxed/shared 
versions INTERNALLY, while using only a "generic" name in the user's code.

So, are other implementers able/willing to consider AMOs that have a 
polymorphic aspect with respect to relaxed-vs-strict?

Original comment by phhargr...@lbl.gov on 16 Jun 2012 at 1:14

GoogleCodeExporter commented 9 years ago
I don't really see a doubling of the interface for strict/relaxed to be that 
much of a problem. Yes, it makes our header file a little bit longer, but the 
documentation can be written in a generic way (as in the BUPC AMO spec) to 
cover both cases and avoid page bloat of the spec. This seems preferable to 
creating a large number of compiler intrinsics (which will then be harder to 
change as the spec evolves) or trying to explain to the user how this is a 
library but has magical extra properties. It also allows third party 
implementations of atomics (eg proof of concept prototypes, open source 
reference implementations) which would otherwise be prohibited.

There are already examples of this type of interface doubling in the C spec, 
for a similar reason (lack of argument polymorphism): See the wide character 
library in C99 7.24 and wchar.h (which basically duplicates stdio.h string.h 
time.h and ctype.h in their entirety).

Original comment by danbonachea on 16 Jun 2012 at 7:19

GoogleCodeExporter commented 9 years ago
I think I'm okay with the interface double from using a suffix for strict or 
relaxed AMOs.  As Dan points out, it doesn't necessarily ruin the 
documentation.  I didn't realize at first how this would affect the compiler or 
the pure library approach.  I'd also like to be part of writing the spec text, 
along with George (and Yili, I think).

I am curious as to what Cray does with their current global AMOs with regard to 
strict vs. relaxed accesses in their extensions.

(Updated 'Type' to "Enhancement")

Original comment by nspark.w...@gmail.com on 18 Jun 2012 at 9:16

GoogleCodeExporter commented 9 years ago
Our global AMOs are essentially treated as relaxed updates.  This is true even 
when forcing relaxed accesses to be strict via pragma or the inclusion of 
upc_strict.h (which is probably a bug now that I think about it).

Original comment by sdvor...@cray.com on 18 Jun 2012 at 9:36

GoogleCodeExporter commented 9 years ago
Steven wrote:
> Our global AMOs are essentially treated as relaxed updates.

Does this mean that Cray AMOs cannot be used, for instance, to implement a 
semaphore (because the UP lacks release semantics and the DOWN lacks acquire 
semantics) without the addition of an additional strict reference (such as a 
upc_fence())?

I am asking because I want to better understand what users expect to DO with 
AMOs.

For the case where the value of the atomic variable is of importance by itself 
(as an accumulator, for instance) the relaxed access is sufficient.  However, 
once you use the atomic variable's value to control when/if one accesses 
additional locations (spinlock, semaphore, etc.) there needs to be a "strict" 
somewhere.  As I illustrated for George, there is a strong motivation to avoid 
making the user insert fences for this purpose.  Does anybody have users that 
use atomics in this way?

Original comment by phhargr...@lbl.gov on 18 Jun 2012 at 10:06

GoogleCodeExporter commented 9 years ago
I think I'll have to expand a little--they don't really fit in with the current 
UPC memory model right now.

The global AMOs are "relaxed" in the sense that they do not provide a full 
fence like strict accesses do.  They do provide acquire semantics, so relaxed 
accesses issued after an AMO will be ordered "correctly".  You still 
technically need a strict write (or a fence followed by a relaxed write) for 
release semantics.  However, many users have noticed that a relaxed write alone 
works in most cases--and is much faster--and therefore leave the fence out 
until something breaks.

With regard to what users do with them, I can't really answer that because we 
typically don't get to see source code from the customers that use them.  That 
said, I'd guess that it's more the former (updating a value) than the later 
(synchronization) at this point given the bugs we've seen to date.

Original comment by sdvor...@cray.com on 18 Jun 2012 at 11:45

GoogleCodeExporter commented 9 years ago
I'm coming a bit late to this discussion, but I really like that we're 
exploring passing an atomic op enum to a few functions instead of having one 
function per operation.  Cray has been stuck with supporting a variety of 
_amo_* functions because that was how it was originally implemented, but 
internally we use an enum passed to just a few functions, very much like 
Comment #11.  The legacy support has caused numerous headaches when adding 
support for new AMO operations just due to entry point explosion.

Also, for historical reasons, the Cray AMO extensions work on either local or 
shared data.  Aside from these extensions, our users have the option of using 
the same builtin syntax that GCC provides for local AMOs in C; however, our 
GCC-style builtins and the Cray AMO extensions are not atomic with respect to 
each other due to the way the hardware works.  Therefore, I can fully 
sympathize with not wanting to provide local AMOs in UPC because if we did so, 
it would be natural for users to expect the local UPC AMOs to be atomic with 
respect to the global UPC AMOs...and some systems may not be able to support 
that.

Original comment by johnson....@gmail.com on 19 Jun 2012 at 2:57

GoogleCodeExporter commented 9 years ago
Troy said:
> I really like that we're exploring passing an atomic op enum to a few 
functions
> instead of having one function per operation.

Would the implementers here be interested in reducing the interface size by 
including the TYPE as a function parameter?  I had thought about that, but it 
does not seem to be common in UPC (including extensions -- except for BUPC 
Value-Based Collectives inteface).

George generalized compare-and-swap into compare-fetch-op, noting with an evil 
grin that IBM could support the general case.  Is this general case of interest 
to other vendors (and would they be hardware-supported)?  Or is CAS the common 
subset of this class that is supported by most vendors?

From a spec-writing perspective, would it make sense for the spec to include 
compare-fetch-op with "set" as the only required op and leave other operations 
as vendor-supported options?  This could allow us to potentially expand the 
list of required operations in future releases if multiple networks increased 
hardware AMO support without drastically changing the AMO spec.

Original comment by nspark.w...@gmail.com on 19 Jun 2012 at 3:32

GoogleCodeExporter commented 9 years ago
Paul wrote:
> BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o 
incurring THREE architectural memory fences.

Why does option 3 incur three architectural memory fences?  It seems like a 
trivial peephole optimization for the compiler to throw away the superfluous 
fences, assuming the compiler has sufficient knowledge of how the target 
runtime works.

Original comment by sdvor...@cray.com on 19 Jun 2012 at 3:46

GoogleCodeExporter commented 9 years ago
I haven't read the AMO spec. in detail, but would like to note that it is 
convenient that the compare-swap operation supports 128 bit data types 
(presumably aligned on at least a 64-bit boundary).  This comes up in UPC 
applications (and UPC runtimes) when there is a need to compare-swap a 
pointer-to-shared value.  For GUPC, using the "struct" PTS representation on a 
64 bit host, a fully general PTS is stored in a 128-bit container.  Perhaps a 
feature macro is needed that indicates whether the AMO implementation supports 
compare-swap on 128 bit sized values.  Also, perhaps, the minimum alignment 
needs to indicated via pre-processor macro.

Original comment by gary.funck on 19 Jun 2012 at 3:50

GoogleCodeExporter commented 9 years ago
Going back to the memory semantics, perhaps we should consider providing more 
fence options than simply upc_fence?  This would benefit both the AMOs and the 
non-blocking proposal (our non-blocking proposal includes acquire semantics for 
the completion of non-blocking operations).  Should that be split out to a 
separate issue?

Original comment by sdvor...@cray.com on 19 Jun 2012 at 4:11

GoogleCodeExporter commented 9 years ago
In comment #28 Steven wrote:
> Paul wrote:
>> BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o
>> incurring THREE architectural memory fences.
>
> Why does option 3 incur three architectural memory fences?  It seems like a 
trivial
> peephole optimization for the compiler to throw away the superfluous fences,
> assuming the compiler has sufficient knowledge of how the target runtime 
works. 

I agree that this is a trivial optimization if atomics are "known" to the 
compiler.  But the current implementation is a LIBRARY and the compiler doesn't 
know a call to an AMO from any other function call.

Original comment by phhargr...@lbl.gov on 19 Jun 2012 at 5:09

GoogleCodeExporter commented 9 years ago
In comment #27 Nick asks:
> Would the implementers here be interested in reducing the interface size by
> including the TYPE as a function parameter?

This would not work for any function which returns a value.  So we would need 
to pass a pointer to the result in any function generating a result.  For this 
reason I dislike passing the type.

> George generalized compare-and-swap into compare-fetch-op, noting with an evil
> grin that IBM could support the general case.  Is this general case of 
interest
> to other vendors (and would they be hardware-supported)?  Or is CAS the common
> subset of this class that is supported by most vendors?

If even ONE required operation lacks h/w support, then we risk requiring ALL 
operations being implemented via software just to ensure they are all atomic 
with respect to each other.  Therefore I strongly support the idea that 
compare-and-swap be required but nothing more general.  I would actually go so 
far as to discourage documenting OPTIONAL atomics in the spec text because this 
would encourage writing of non-portable code.

What I *would* encourage is that vendors providing extensions to the atomics 
(more operations, more types, support for "private", etc) all agree OUTSIDE OF 
THE SPEC on the "spelling" of their extensions.  This paves a smooth(er) path 
to their later addition to the spec, and eases their use.

Original comment by phhargr...@lbl.gov on 19 Jun 2012 at 5:27

GoogleCodeExporter commented 9 years ago
Paul wrote:
> Therefore I strongly support the idea that compare-and-swap be required but 
nothing
> more general.  I would actually go so far as to discourage documenting 
OPTIONAL 
> atomics in the spec text because this would encourage writing of non-portable 
code.
>
> What I *would* encourage is that vendors providing extensions to the atomics
> (more operations, more types, support for "private", etc) all agree OUTSIDE OF
> THE SPEC on the "spelling" of their extensions.  This paves a smooth(er) path 
to
> their later addition to the spec, and eases their use.

If you're going to go that far, I'd say lets just abandon the AMOs in the spec 
altogether.  Putting only compare-and-swap in the spec would encourage users to 
only use compare-and-swap in portable codes.  So, for example, if they needed 
to do an atomic fetch-and-add in a portable fashion (say, to atomically reserve 
array elements...), they'd need to do something like:

do {
  old = last;
  new = old + reservation_size;
} while( upc_amo_cas( &last, old, new ) != old );

This is going to perform terribly on most systems, particularly in the presence 
of contention, which will only get worse as you scale up the number of threads. 
 The point of adding atomics to the spec is to make codes run faster, not 
slower.

Original comment by sdvor...@cray.com on 19 Jun 2012 at 6:54

GoogleCodeExporter commented 9 years ago
Perhaps we could add a query function(macro? intrinsic?) that could be used to 
figure out which amos an implementation supports.  Then users could do 
something to the effect of:

if ( UPC_AMO_SUPPORTED( UPC_OP_FADD, UPC_TYPE_LONG ) ) {
  myidx = upc_amo_fadd( &last, reservation_size );
}
else if ( UPC_AMO_SUPPORTED( UPC_OP_CAS, UPC_TYPE_LONG ) ) {
  do {
    old = last;
    new = old + reservation_size;
  } while( upc_amo_cas( &last, old, new ) != old );
  myidx = old;
}

Original comment by sdvor...@cray.com on 19 Jun 2012 at 7:12

GoogleCodeExporter commented 9 years ago
Steven wrote in comment #33:
> Paul wrote:
>> Therefore I strongly support the idea that compare-and-swap be required but 
nothing
>> more general.
>[...]
> If you're going to go that far, I'd say lets just abandon the AMOs in the
> spec altogether. 

Sorry if I was unclear about what I was objecting to.

I DO WANT the "Level 2" fetch-and-op AMO's, such as the "upc_amo_fadd" in 
Steven's example.
I DO NOT WANT George's "Level 3" COMPARE-fetch-op for op != "set"
[see comment 10 for Level 1,2,3 descriptions]

Now that I think more about it, I actually don't see how "compare-fetch-OP" is 
more useful than compare-and-swap.  Specifically, if the OP is only going to 
take place only if the comparison is TRUE, then I must have KNOWN the previous 
value and could have used compare-and-swap having computed the OP against the 
KNOWN previous value.  So, I guess I've just debunked my own original 
implementability argument against these ops, and replaced it with a 
they-are-just-syntactic-sugar argument.

Original comment by phhargr...@lbl.gov on 19 Jun 2012 at 7:34

GoogleCodeExporter commented 9 years ago

Original comment by gary.funck on 3 Jul 2012 at 6:07

GoogleCodeExporter commented 9 years ago

Original comment by gary.funck on 3 Jul 2012 at 6:10

GoogleCodeExporter commented 9 years ago
Cray UPC supports the following AMO extensions.  c = currently supported, f = 
supported on future hardware.  If supported, both a fetching and a non-fetching 
version exist.  (For the bitwise ops, type doesn't really matter, so you can 
typecast fp32 and fp64 pointers to int32 and int64 pointers and use the integer 
AMO extensions, but we don't _directly_ provide them.)

                     int32    int64    fp32    fp64
add                    f        c       f       f
and                    f        c
and-with-xor           f        c
compare-and-swap       f        c
min                    f        f       f       f
max                    f        f       f       f
swap                   f        c
or                     f        c
xor                    f        c

Observations:

1) Existing Cray network hardware does not support atomic 32-bit, 
floating-point, or min/max operations.  Cray UPC does not support unsigned 
integer AMOs, whereas BUPC does.  Therefore, I strongly believe that there 
needs to be a query mechanism, suggested in Comment #34, for users to figure 
out if an AMO is supported.  It is NOT acceptable to say that an implementation 
must use software to emulate the operations that it does not support in 
hardware because in order to keep the operations atomic with respect to each 
other it would be necessary to implement them all in software, negating any 
benefit from the hardware.  Furthermore, I think the query mechanism needs to 
be a function call because otherwise it will not be possible to compile a code 
once and run the same executable on two different platforms that differ only in 
the supported flavors of AMOs.

2) Providing entry points for {types} x {operations} x {fetching} is unwieldy 
for everyone...users, implementers, specification writers.  It gets even worse 
if you add {blocking/non-blocking)} to the mix.

Straw Proposal:

/** Returns 1 if the specified AMO is supported or 0 otherwise.  /fetching/ is 
non-zero to request a fetching AMO.
 * /type/ and /op/ specify the data type and operation to be performed.
int upc_amo_exists( int fetching, upc_amo_type_t type, upc_amo_op_t op );

/** Atomically performs operation /op/ on the memory pointed to by /target/.  
The data type of the operation is specified by
 * /type/.  If /fetched/ != NULL, then the previous value is fetched and stored in the memory pointed to by /fetched/.
 * Operands for the operation are pointed to by /operand1/ and /operand2/; /operand2/ may be NULL for some operations.
 *
 * Warning: Operations are not guaranteed to be atomic with respect to non-UPC AMO operations.
 */
void upc_amo( void* fetched, upc_amo_type_t type, upc_amo_op_t op, shared void* 
target, void* operand1, void* operand2 );

Example:

shared long x;
upc_lock_t x_lock;
...
if ( upc_amo_exists( 0, UPC_AMO_TYPE_LONG, UPC_AMO_OP_ADD ) ) {
    long one = 1L;
    upc_amo( NULL, UPC_AMO_TYPE_LONG, UPC_AMO_OP_ADD, &x, &one, NULL );
}
else {
    upc_lock( &x_lock );
    x += 1L;
    upc_unlock( &x_lock );
}

Original comment by johnson....@gmail.com on 25 Jul 2012 at 6:55

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I haven't been following the details of the proposed AMO library, but am 
wondering whether operations (esp. compare-and-swap) on 128 bit data types are 
planned/proposed?

As a use case, consider a double buffering scheme with two buffer pointers, 
where it is convenient and efficient to swap the pointers atomically when 
switching buffers. Here, the buffer indexes might be two 64-bit indexes, or 
perhaps two PTS's represented in a 64-bit packed format.

Or simply, swapping a single fully general PTS which is represented as a 
128-bit quantity on 64-bit targets.

Original comment by gary.funck on 3 Aug 2012 at 5:20

GoogleCodeExporter commented 9 years ago
I don't like the idea proposed in comment #38 of query functions for which AMOs 
are supported. It complicates the user code, but more importantly just passes 
the buck of lowered performance to the application. Specifically, instead of 
the runtime emulating the required AMO in software, now that emulation is 
happening in the user application (where it's likely to be slower and more 
error-prone). With this approach, any portable UPC app using the AMO's would 
also need to fold in code for a fully software implementation of every AMO it 
uses and switch to that implementation if any of the queries fail. This defeats 
the code-factorization goal of having a library.

It seems better to restrict the AMO's to a core subset that all UPC 
implementations must support, and choose that subset wisely to allow hardware 
implementation on platforms of interest.  

It's worth noting that a 64-bit compare-and-swap is sufficient to implement 
EVERY operation in the current AMO proposal (including 32-bit, unsigned, 
min/max, and float/double), although it implies an additional read and a 
possible retry under heavy contention (which should still be still 
significantly cheaper than a fully software implementation using upc_locks). 
Since Cray supports that operation in hardware, why not use that to perform the 
operations lacking direct hardware support?
(Note I'm proposing this rewriting be done within the implementation, rather 
than in the UPC program as proposed in comment #34).

Original comment by danbonachea on 3 Aug 2012 at 12:10

GoogleCodeExporter commented 9 years ago
"It's worth noting that a 64-bit compare-and-swap is sufficient to implement 
EVERY operation in the current AMO proposal (including 32-bit, unsigned, 
min/max, and float/double), although it implies an additional read and a 
possible retry under heavy contention (which should still be still 
significantly cheaper than a fully software implementation using upc_locks). 
Since Cray supports that operation in hardware, why not use that to perform the 
operations lacking direct hardware support?"

Because it is NOT significantly cheaper than a fully software implementation 
using upc_locks.  It's fine if you have a couple dozen threads, but once the 
number of threads goes beyond a certain point, the network (and more 
importantly, bus) contention degrades performance way past that of a scalable 
lock algorithm.  This gets even worse as you add more and more threads (cores) 
to a single network endpoint.  Using compare-and-swap is only a tolerable 
workaround in the absence of contention.

Original comment by sdvor...@cray.com on 3 Aug 2012 at 2:48

GoogleCodeExporter commented 9 years ago
Here's my attempt at a summary of today's discussion on the telecon.

Points of Motivation:
  - UPC users want high-performance AMOs.
  - A UPC library should be robust (over types and operations), portable, and vendor-independent.

The current proposal (as of SVN r61) would restrict some vendors into 
implementations that /may/
limit performance.  Also, here is / may be interest in expanding the types and 
operations specified
by the current proposal.

Proposed Solutions (as I saw and recall them):

 1) Provide a standard interface to the AMOs (e.g., upc_amo() in Comment #38) that supports a
    robust range of types and operations.  Also, provide a function that allows a user to specify
    which types and operations the user will call, from which the library determines whether it
    will use a software-based or hardware-based implementation.  Thus, all the possible AMOs are
    always available to the user, however, the implementation may use hardware acceleration either
    by default (if all types/operations are supported) or from a user's hinting.  The hinting
    function would likely need to be called before any AMO calls are made in order for the 
    implementation to choose the right "mode".

 2) Expanding on (1), the hinting function can also accept (in addition to desired types and
    operations), some provided use-case parameter (e.g., high throughput or low latency), from which
    the library may select one of potentially-many possible implementations (either in hardware or
    software).

 3) Expanding on (2), the hinting function is replaced by an atomicity-domain function that returns
    a handle for AMOs that supports a user-specified set of types and operations for a particular
    use-case.  AMOs would only be guaranteed to be atomic with respect to multiple calls using the
    same handle.  Atomicity would not be guaranteed for AMO calls using separate handles.

For (1-3), I think it was expressed that it could be good to have:
  - A query function that can say what the hardware is capable of supporting (so that a user may 
    restrict his type/operation choice to guarantee hardware support)
  - A priority function in which a user could specify types, operations, and a use-case of varying 
    priority such that the implementation chooses the best AMO implementation and specifies the
    supported types and operations.

My understanding of (1) and (2) is that they would both set the atomicity mode 
(i.e., hardware or
some software implementation) at the hinting call (or elision thereof) and it 
would be henceforth
fixed.  This is problematic for libraries that may use AMOs not specified by 
the hinting call or 
that may make a hinting call before the user's call.  I think that (3) is the 
only library-friendly 
(or library-general) approach.

I think the query function definitely makes sense, however, I'm not really sure 
I understand how
one would effectively use the priority function.  It would seem that this could 
lead back to
heavily-IFDEF'd code (if the priority function selects an implementation that 
doesn't support
the lower-priority types/operations), which I think would be avoided without it.

It is my preference (and I think this was part of all three proposed solutions) 
that the AMOs can
operate on ALL shared addresses.  Any restriction of use would be at the user's 
discretion and NOT
be specified by either the hinting or handle functions.

Ideally, I think it would be good for the less performance-minded users if 
there was either a 
default handle that supported all specified types and operations in an 
implementation-defined way
or some default parameter that returns "universal" handle.

Please add or correct anything here that I may have misrepresented (and please 
respond with your
comments and feedback!).

From this discussion, I'll draft up another AMO proposal, which will be posted 
for comments and
a follow-up telecon discussion (for whoever is interested).

Original comment by nspark.w...@gmail.com on 3 Aug 2012 at 10:17

GoogleCodeExporter commented 9 years ago
All "brand new" library proposals are targeted for starting in the "Optional" 
library document. Promotion to the "Required" document comes later after at 
least 6 months residence in the ratified Optional document, and other 
conditions described in the Appendix A spec process.

Original comment by danbonachea on 17 Aug 2012 at 5:53

GoogleCodeExporter commented 9 years ago
Set default Consensus to "Low".

Original comment by gary.funck on 19 Aug 2012 at 11:26

GoogleCodeExporter commented 9 years ago
At the last call, Bill asked for someone to pick a "big issue" for discussion 
at the next call.
Considering the importance of AMOs to many of the UPC users, and the very 
undecided state that it
was left in after the call-before-last, Gary, Yili, and I thought it would be 
good to discuss
AMOs on the next call.  Here is something that I hope will re-light the fire of 
discussion.

Based on my summary of the last discussion in Comment 43, I propose the 
following based on
Option 3 with Query (but no Priority):

### A Proposed Usage Scenario ###

A user creates an atomicity domain object by specifying a set of operations, a 
set of types, and
an implementation mode.  This object is a handle to some AMO implementation.  
Depending on the
specified mode, the implementation may be in hardware or software.

A user makes AMO calls as either upc_amo_relaxed() or upc_amo_shared() that 
otherwise look very
much like Troy's upc_amo() proposal in Comment 38.  Atomicity is only 
guaranteed for accesses
using the same domain.  

The library also provides upc_amo_query() so that a user can test whether a set 
of operations
and types is supported for a given mode.  There will be a default mode that 
supports all of a
Spec-specified set of ops and types.

### A Bit More Detail ###

 * Define a type (e.g., upc_amo_domain_t) to represent an atomicity domain, which specifies a set
   of operations and datatypes over which access to a memory location in a given synchronization 
   phase is guaranteed to be atomic if and only if no other mechanisms or atomicity domains are 
   used to access the same memory location in the same synchronization phase.

 * Define a type (e.g., upc_amo_mode_t) that a user can use to indicate the AMO implementation
   "mode" desired with the following acceptable constants UPC_AMO_DEFAULT, UPC_AMO_HARDWARE,
   UPC_AMO_LATENCY, UPC_AMO_BANDWIDTH.

    * UPC_AMO_DEFAULT would be an implementation-defined default mode that will support ALL of a
      specified set of types and operations for AMOs.  This would almost surely be a software-
      based implementation.  Using this mode, AMOs would always be portable, but not necessarily
      high performance.

    * UPC_AMO_HARDWARE would force the use of hardware-supported AMOs.  It is likely that every
      implementation would vary in the set of types and operations supported for AMOs under
      this mode.  I suspect that users who use this mode would do so out of performance reasons
      and would not, in general, expect cross-platform compatibility.

    * UPC_AMO_LATENCY and UPC_AMO_BANDWIDTH would indicate a user preference for low-latency or
      high-bandwidth (is "throughput" a better term here?) atomics, respectively.  IIRC, on the
      last AMO-centric call, either Steven or Troy noted at least a few times that a user
      favoring high throughput of atomic memory accesses may not necessarily want the hardware
      implementation.  

 * Initially, I envisioned atomicity domain initialization (and destruction) as happening
   similarly to how its done with UPC locks; so I had upc_amo_domain_alloc/free() functions.
   Yili (and a user) suggested making this a static initializer and strongly encouraging
   compiler optimizations.  I think that this static initialization approach makes more sense.

### Issues or Problems That I Foresee ###

 * This is definitely more complex than what most of my users would like to see.  They would
   most likely only need a short list of operations on 64-bit integer types and only ever use
   one domain.

 * In the scheme above, what happens with the static initialization call when a user specifies
   the hardware-supported mode, but specifies types or operations not supported by the hardware?

 * Code Portability: Performance-minded users might likely use the UPC_AMO_HARDWARE mode,
   however, this code would almost surely NOT be portable.  One user strongly encouraged
   identifying a common subset of types/operations that could be supported across the
   implementation space (but there were concerns, at least initially in our discussions, that
   this intersection is the empty set).  Speaking to vendors' capabilities, Cray has posted what
   they support; IBM has expressed (I think) that they're quite flexible in support; and it's
   not clear to me what SGI, HP, and InfiniBand support as hardware atomics.

 * Implementer/Implementation Burden: This proposal likely requires quite a bit of work on the
   part of the implementers to implement software and, where applicable, hardware support, as
   well as a whole new API.

-----------------------

Hopefully this is something to get the discussion restarted.  I have a few more 
thoughts and
comments from users that I'll post tomorrow.

Original comment by nspark.w...@gmail.com on 28 Aug 2012 at 9:29

GoogleCodeExporter commented 9 years ago
I apologize for being late and stupidly responding to the email list instead of 
posting here.  I will learn.

My comments regarding the last post are below.  I tried to read everything said 
already but might have missed a few details.  It is not my intent to repeat 
previously stated points.

1. For remote atomics, we need to be more explicit about what we mean by 
hardware vs. software.  Here is a partial spectrum of options:

- a single packet injected into the network induces a hardware atomic operation 
in remote memory (example: Blue Gene/Q can do this, at least in one direction)
- an active message that executes atomically in software on the remote side 
does an atomic instruction in hardware (example: Blue Gene/P DCMF_Send w/ 
handler that does lwarx+stwcx)
- an active message that does not execute atomically in software grabs a lock, 
does the update, then releases the lock
- three active messages acquire a remote lock, performance the update, then 
release the lock

I'm not saying that the good options here are portable or that the bad ones 
made sense in any situation, but I hope to convince people that our terminology 
is inadequate.

2. UPC_AMO_HARDWARE should not be an option.  It is meaningless to talk about 
hardware or software implementations on their own.  What a user wants to know 
is fast or slow for a particular usage.  Whether or not the implementation uses 
hardware for that shouldn't be visible to the user.

I'd like to know the use-case where the user specifying UPC_AMO_LATENCY or 
UPC_AMO_BANDWIDTH is actually going to help. Shouldn't the compiler have enough 
semantic information already to
know whether or not to dump a bunch of nonblocking atomics into the network and 
flush them all at once versus using a blocking atomic operation and waiting on 
the round-trip?

3. Regarding any software-based implementations of AMOs, I don't see how 
portable and slow has any benefit over the user doing atomics themselves with 
the existing UPC lock functionality.

Furthermore, while it's rather ambiguous what is meant by a software 
implementation, I would love to know what reasonable architecture can do remote 
atomics as active messages faster than it can in hardware, just from a network 
architecture curiosity perspective, as noted previously by someone on a call 
that I wasn't part of.

4. Regarding Yili's desire for static allocation only...

Can someone describe how the compiler can make use of this?  I'm not aware of 
any architecture where statically allocating memory for remote atomics has any 
performance benefit.  Requiring static
allocation is really unpleasant from a usability perspective.  It reminds me of 
FORTRAN77, which is hardly a language we want to hold in high esteem, except 
perhaps for writing math libraries.

5. Here is my proposal:

Why not just define upc_atomic_int_t without specifying the size?  The header 
can define UPC_ATOMIC_INT_MAX appropriately.  On a 32b system like BGP, 
UPC_ATOMIC_INT_MAX=2^31 while on BGQ or anywhere else that is x86_64 or IA64, 
it's going to be >2^50 (One can imagine a few bits being reserved to enable 
hardware atomics).  The operations that are provided by reasonable hardware 
have already been discussed at length.

If these and only these operations were supported, UPC would feel a lot like C, 
which seems like a reasonable guiding principle.  Using upc_atomic_int_t solves 
all of major issues with 32b vs. 64b.  I think that supporting 128b atomics is 
not a good idea because it's getting ahead of where the hardware is and forces 
a slow implementation in software.

I really don't understand why UPC can't support just these operations and let 
the user implement others in software on their own.  The performance hit 
associated with forcing software emulation of simple operations when hardware 
exists is far greater than the benefit of allowing non-simple operations.

Original comment by jeff.science@gmail.com on 30 Aug 2012 at 3:43

GoogleCodeExporter commented 9 years ago
Here are some inlined comments/responses to both Comment 47 and Jeff's previous 
email, which had
a few additional comments.

Jeff said (in Comment 47):
> 1. For remote atomics, we need to be more explicit about what we mean by 
hardware vs. software.
> Here is a partial spectrum of options: [...]

I'm sure this only illustrates my overly-simplified understanding of things, 
but I see the options
you present as representing one "hardware" implementation (to me, this means 
that the NIC does the
work) and three "software" solutions (the exact implementation of which 
shouldn't be specified by
the UPC Spec).  In general, I think that my users want to be able to get to the 
NIC's atomic
capabilities.

Jeff said (in his initial email):
> So we're talking about atomics that exceed the atomicity of the particular 
hardware, thereby
> forcing a general implementation to use software despite the fact that all 
reasonable
> supercomputing hardware provides a reasonable set of remote hardware atomics 
or at least enables
> something close to them, with the caveat that not all capability is available 
on all systems
> (e.g. BGP can only do 32b atomics in hardware, while DMAPP only exposes 64b 
atomics).

I think the caveat you present is the basis for the challenge in defining a UPC 
atomics
specification.  From the last call and the discussion on Issue 7, I think it's 
reasonably clear
that the goals of (1) a "robust" AMO library that supports a wide range of 
types and operations
and (2) a "hardware-accelerated" (again, I see this as the NIC doing the work) 
generally pull in
opposition to each other.

I think you are misunderstanding the intent of the "atomicity domain" (or I did 
a terrible job
explaining my intent).  The use of atomicity domains is intended such that a 
user is *not* forced
into a low-performance software implementation, but has the option to use the 
network hardware to
accelerate shared-pointer AMOs if the requested types/operations are supported 
by the hardware.

Jeff said (email):
> I disagree that trying to use hardware as much as possible leads to lack of 
portability.  There
> is a set of atomics that are widely portable.  One need only inspect Intel 
TBB's atomics or GCC
> atomic extensions for examples.  TBB atomics are not just available on Intel 
processors.  We've
> ported TBB to PPC32, PPC64 and POWER7 and I believe my collaborator who did 
the PPC ports
> previously did DEC Alpha, Sparc and a whole bunch of weirdo processors with 
non-x86 memory models.

My interest in the/a UPC AMO specification is to have AMOs on UPC shared 
pointers.  I think the
field of network hardware support for atomics is a little less even.  In the 
long term, I would love
to see a standard base of network atomics supported by all the major HPC 
vendors.  The intersection
may be non-empty at present, but specifying the minimal common set doesn't 
quite satisfy the
interested parties that want to also provide a robust AMO API in UPC.

Jeff said (email):
> Does anyone know of hardware that does not support atomic load, store, 
fetch-and-add and
> compare-and-swap for either 32b or 64b integers?  Do we want features beyond 
these?

I definitely want to see fetch-and-{and, or, xor} added to your list.  
Otherwise, I don't see the
point in defining an AMO spec.  I also want to see the non-fetch versions; 
i.e., atomic-{add, and,
or, xor}.  You've already noted that DMAPP only exposes 64-bit atomics.  I 
would be okay with only
64-bit atomics, but I'm not sure everyone else would.

Jeff said (Comment 47):
> 4. Regarding Yili's desire for static allocation only... [...]

It's funny how perspectives differ between users.  One of my users said he 
*wouldn't* want to
dynamically allocate domains.  He wants a statically-allocated domain in a 
library he'll use across
his applications that supports exactly what he wants on the hardware he uses.

I'm not a compiler person, so I probably make too many assumptions about what 
magic can happen behind
the scenes, however, I can imagine a scenario in which the compiler, after 
seeing that only one
"hardware-only" domain is used, does something like replace the upc_amo() calls 
with inlined calls to
the network API.  I suspect this would provide some performance benefit on some 
systems where 
pass-through function calls might be costly, especially if one was making a lot 
of atomic accesses.

Jeff said (Comment 47):
> Why not just define upc_atomic_int_t without specifying the size?  The header 
can define 
> UPC_ATOMIC_INT_MAX appropriately.  On a 32b system like BGP, 
UPC_ATOMIC_INT_MAX=2^31 while on BGQ or 
> anywhere else that is x86_64 or IA64, it's going to be >2^50 (One can imagine 
a few bits being 
> reserved to enable hardware atomics).  The operations that are provided by 
reasonable hardware have 
> already been discussed at length.

I don't think that restricting an atomic datatype to something less than the 
full 64-bits is a good
idea.  I'm not pushing for 128-bit atomics, but I think 64-bit and 32-bit 
atomics make sense.  If your
solution is to give users access to ~50 bits of a 64-bit integer (and you seem 
to feel strongly about
exposing hardware performance), how does the NIC properly handle bitwise 
operations on only the
"user bits"?  And how much of an application already using a vendor's 
extensions for 64-bit AMOs would
have to be rewritten to only use the "user bits."  I don't see how your 
solution /wouldn't/ force a
software implementation.

Jeff said (Comment 47):
> I really don't understand why UPC can't support just these operations and let 
the user implement
> others in software on their own.  The performance hit associated with forcing 
software emulation of
> simple operations when hardware exists is far greater than the benefit of 
allowing non-simple
> operations.

Again, by creating a domain with the UPC_AMO_HARDWARE flag, the intent is that 
one would *not* be
forced into a software implementation, but be explicitly selecting a 
hardware-only implementation.
The last thing that I want is to force software emulation of simple AMOs in all 
cases.  But I also
appreciate the view that a UPC AMO library could benefit from providing a 
vendor-independent, wide
base of types/ops for users who might not demand the highest level of 
performance.

Jeff said (email):
> Another issue here is ordering.  What's the ordering requirement in UPC?  So 
I dump 100 atomics
> into the network targeting the same remote address.  Are they going to 
complete in-order?  That's
> probably a much bigger performance hit on Cray Gemini or PERCS than anything 
entailed by latency
> vs. bandwidth.  Forgive me for being a UPC noob if there's something in the 
language spec. about
> ordering of load-store already, but I would be surprised if UPC can say more 
than C, which is going
> to depend on the architecture for ordering semantics.  Requiring order in 
remote atomics when
> load-store aren't necessary ordered on various processors seems unreasonable. 

Ordering would necessarily be controlled by whether one makes a relaxed atomic 
access or a strict
atomic access.  I expect that my users would almost only ever use relaxed 
atomic accesses, so no
ordering of the atomics would be expected (until one uses a upc_fence).

Original comment by nspark.w...@gmail.com on 30 Aug 2012 at 5:05

GoogleCodeExporter commented 9 years ago
Not sure if anyone else has looked at it for inspiration yet, but the way that 
C++11 provides atomics is that there are separate atomic types (e.g., 
std::atomic<int>) and each type provides an is_lock_free() member function to 
query whether AMOs on that type are implemented using locks or not.  An 
implementation could have is_lock_free() return true for an int type but false 
for a long type. The fact that one uses locks and the other doesn't use locks 
is not an issue because int and long are essentially in separate atomicity 
domains, to use the terminology of our recent discussion.  It differs from our 
atomicity domain concept in that is_lock_free() covers all required atomic 
operations on the type, so if there is hardware support for an atomic add but 
an atomic xor requires a lock, then the implementation would report that the 
atomic type is not lock free and it would use locks for both.  Thus, the user 
that cares only about fast atomic adds is out of luck.  The idea of creating an 
atomicity domain in UPC and specifying which operations actually will be used 
in the program is one way out of this problem, but it is more complicated.  So 
I'm undecided whether a UPC atomicity domain concept should cover just the type 
or the type and the possible operations.

Original comment by johnson....@gmail.com on 6 Sep 2012 at 5:59

GoogleCodeExporter commented 9 years ago
This is in response to Jeff's comment 47:
"4. Regarding Yili's desire for static allocation only...

Can someone describe how the compiler can make use of this?  I'm not aware of 
any architecture where statically allocating memory for remote atomics has any 
performance benefit.  Requiring static
allocation is really unpleasant from a usability perspective.  It reminds me of 
FORTRAN77, which is hardly a language we want to hold in high esteem, except 
perhaps for writing math libraries."

YZ: I guess there is a confusion between Atomic Domains and Atomic Variables.  
And this is mostly because we haven't formally defined what is an Atomic Domain 
yet :-).  Nick had a draft document about the definition of Atomic Domains and 
the related API but it's not published yet.

The suggested static allocation/initialization is only for Atomic Domains. An 
atomic domain is not an atomic variable nor an address to such a variable. 
Roughly speaking, an atomic domain defines a domain in which the atomic 
operations are atomic with respect to each other in the same domain.

The motivation of Atomic Domains was inspired by some observations about 
practical usage of atomic operations and earlier discussions on the topic.

1) There can be more than one hardware components in a compute node that can 
access/modify the same memory.  An example would be a Cray XK6 node with a 
Gemini NIC and a GPU -- the CPU, the NIC (connected by HyperTransport) and the 
GPU (connected by PCI) all support "hardware atomic operations" to the host 
memory but the atomic operations issued from one component are not atomic with 
respect to those from other components.

2) In many practical cases, an app only needs a particular atomic op type in a 
small region of code or data.  For example, one may need compare-and-swap when 
implementing a lock-free queue. Another may need fetch-and-add for implementing 
a Particle-In-Cell code.  But it's uncommon to apply both compare-and-swap and 
fetch-and-add to a single memory location.

Therefore, we attempt a provide a mechanism (through Atomic Domains) to users 
to specify the intended usage pattern of atomic ops and thus gives the 
implementation more information and opportunities to optimize performance.

Proposed API for Atomic Domains (by Nick):

upc_domain_t *upc_global_domain_alloc(upc_op_t ops, upc_type_t types, 
upc_amo_mode_t mode);

upc_domain_t *upc_all_domain_alloc(upc_op_t ops, upc_type_t types, 
upc_amo_mode_t mode);

void upc_domain_free(upc_domain_t *ptr);

void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr, upc_op_t op, 
upc_type_t type, shared void *target, void *operand1, void *operand2);

void upc_amo_relaxed(upc_domain_t *domain, void *fetch_ptr, upc_op_t op, 
upc_type_t type, shared void *target, void *operand1, void *operand2);

What I suggested for static allocation/initialization for Atomic Domains is 
something like:

upc_domain_t global_domain = DOMAIN_INITIALIZER(UPC_CSWAP, UPC_INT64 | 
UPC_UINT64);

This permits but doesn't require compiler optimization because the intended 
atomic usage pattern is known at compile time.  It also makes common usage 
easier to write (no worry about alloc and free).

Original comment by yzh...@lbl.gov on 6 Sep 2012 at 6:08