Closed GoogleCodeExporter closed 9 years ago
I'd like to see a bit of discussion on the pros and cons of the three
approaches before we decide.
My perspective is that there are two issues of some degree of importance:
1) How full a collection of types and operations we want. There are quite a
few entry points and our experience is that only a few are used. But if
implementers and documenters are not concerned, I don't think users will be.
2) How we "spell" them. Again, I don't think users will be concerned as they
will likely use a macro in their code no matter what we do :)
Original comment by w...@uuuuc.us
on 16 Mar 2012 at 11:23
Is it acceptable if we came up with a subset of the BUPC functions?
(1) I dislike the "local" versions of the atomic functions (they feel like
oversweet syntax sugar to me)
(2) I don't really "get" the mswap operator, and I can't identify any
particular unique use of it.
Original comment by ga10...@gmail.com
on 24 Apr 2012 at 9:07
Gheorghe wrote:
(1) I dislike the "local" versions of the atomic functions (they feel like
oversweet syntax sugar to me)
(2) I don't really "get" the mswap operator, and I can't identify any
particular unique use of it.
In response to (1): How does one perform atomic operations on private pointers
if one removes the "local" functions? Manual "privatization" of
pointer-to-shared is a common optimization, and the upc_cast() under
consideration for the spec will make it MORE common. Unless one also provides
a way to convert private->shared [Ick!] then the "local" variants of the atomic
operations will be needed to avoid potentially forcing the user to keep track
of both private and shared pointers to the same datum.
In response to (2): The "mswap" (masked swap) is, I believe, intended to aide
in implementation of atomic updates to "flag bits" (think bit-fields w/o the
help of the syntax). It is among the SHMEM atomics, and thus made its way on
to the "short list" when collecting information from Lauren about her community
of programmers.
Original comment by phhargr...@lbl.gov
on 24 Apr 2012 at 9:22
Hi Paul, in response to your argument about private pointers - I could argue
(not excessively facetiously) that it is none of my damn business what you do
with private pointers - UPC is about shared data and pointers. ... In fact I
think Yili used as one of his arguments to shut me down when I argued that
collectives should take private pointers to data.
I agree with you that converting private pointers to "fat" pointers may not be
such a hot idea. It so happens that in xlupc we could do it without breaking a
sweat, but I cannot say whether that would be a path to suicide on other
systems.
Damn me for seeing both sides of the issue. But in return I would have you
acknowledge the essential awkwardness of having "local" versions of every UPC
function.
Original comment by ga10...@gmail.com
on 25 Apr 2012 at 2:41
I *do* acknowledge that "local" versions of every function would be a mess, but
I doubt that polymorphism as an alternative will get many supporters. So, it
comes down (in my mind) to what do you LOSE if no local variant is included.
One can argue against local versions of collectives by claiming one can always
make a copy. The act of copying some atomic datum sort of destroys its
purpose. So, I think an argument could be made for why this might be a special
case. However, I won't get too hung up on this, as BUPC will continue to
support local atomics as an extension if they are not included in the spec.
So, are there any other opinions on the inclusion/exclusion of atomic operation
on pointer-to-private?
Original comment by phhargr...@lbl.gov
on 25 Apr 2012 at 8:09
Going back to Bill's two issues:
1) How full a collection of types and operations we want.
2) How we "spell" them.
To (1): I think the minimal set of types that my users are interested in is
T={int64_t, uint64_t}. For operations, the primary interest is in fetch-and-OP
and OP (no fetch), where OP={ADD, AND, OR, XOR}. While there is interest in
compare-and-swap, I think this is a good bit further down their list of
priorities. I am pretty sure that we only care about AMOs as relaxed shared
accesses.
The interest in the non-fetching atomic OP is that it would be a non-blocking
call for which completion is only guaranteed by the next fence. The goal would
be that one could issue a large set of atomic OPs for high throughput.
This is definitely a reduced subset from what BUPC and Cray offer. Maybe this
answers my position on George's mswap question. I understand that some
implementers may want to expand the type and operation sets, but I think this
is the minimal set that I care about.
To (2): While everyone will probably have their own desired flavor of spelling,
I would probably go for something relatively short, like:
TYPE upc_amo_fopT(OP, shared TYPE* p, TYPE v);
void upc_amo_opT(OP, shared TYPE* p, TYPE v);
With this, we could use the existing upc_op_t definitions for OP (only
accepting a subset of them, naturally). This would bring up the "where should
we put the upc_op_t enum?" issue, as it is currently part of the collectives
library and sharing this with an AMO library (which would make sense) would
mean they'd need some common header for these types. This is just another
version of the upc_flag_t discussion in Issue #10.
Original comment by nspark.w...@gmail.com
on 28 Apr 2012 at 9:46
1) I support Nick's request for having atomic OPs without fetch because they
can have better performance when fetch is not needed. And I see at least one
app (Graph500) can benefit from atomic OPs without fetch. I would like to
propose to extend OP to include MAX and MIN, i.e., OP={ADD, AND, OR, XOR, MAX,
MIN}.
FYI, MPI_Accumulate is something similar.
2) Does UPC guarantee atomicity for basic ops with built-in types?
For example, assuming int64_t == long long in C99,
shared int64_t *p;
int64_t a;
Is there difference between:
i) (*p) += a;
ii) upc_amo_op_int64(ADD, p, a);
3) For the discussion of AMOs with private/local pointers, if we want to
include them in UPC spec, we should probably consider their compatibility
and/or potential redundancy with C11 atomics.
Original comment by yzh...@lbl.gov
on 29 Apr 2012 at 12:56
I don't think my users really care about local atomics. It might make more
sense to address shared atomics now and save local atomics for the bigger
discussion of whether UPC moves to C11.
Original comment by nspark.w...@gmail.com
on 30 Apr 2012 at 8:49
With regard to "Does UPC guarantee atomicity for basic ops with built-in
types?", the answer is unequivocally no. As far as the memory model is
concerned, (*p) += a; becomes (*p) = (*p) + a; which becomes (in pseudo code,
READ is either a relaxed or strict read of a shared object, WRITE is either a
strict or relaxed write of a shared object):
READ( *p ) => t1
READ( a ) => t2
t1 + t2 => t3
t3 => WRITE( *p )
There is nothing to guarantee that some other thread doesn't come in and modify
*p or a after the local thread reads it, but before it writes the new result
back to *p. UPC statements do not have transaction semantics (though it'd
likely be a useful extension if anyone wants to come up with such a proposal!).
Assuming all strict accesses, the compiler/runtime must ensure that this race
is consistent in that all threads observe the same ordering, but it doesn't
need to do anything to prevent the race from occurring. For relaxed accesses,
it doesn't even need to do that, though local ordering must still be maintained.
Original comment by sdvor...@cray.com
on 11 May 2012 at 4:22
To amplify Yili's point about adds w/o a fetch. Does this make sense as far as
semantics?
Level 1: basic atomic operation. Essentially, guaranteeing that e.g. a+=b
happens atommically. Examples: atomic increment, atomic set, atomic or, xor, ...
Level 2: fetch + basic atomic operation. There is one for every operation
defined in Level 1. The value *before* the operation is returned to the user.
Level 3: compare + fetch + op. The operation supplies two values - a "compare"
value and an "update" value - and returns the "old" value. The operation is
executed if the "old" value matches the "compare" value. The "old" value is
returned in any case. Typical example: compare-and-swap, which is really a
compare+fetch+set.
Is there anything you can think of that is not covered by this taxonomy?
Original comment by ga10...@gmail.com
on 22 May 2012 at 1:41
--- AMO Taxonomy & Hardware Support ---
I don't think this taxonomy covers the "masked swap" in the BUPC AMO
extensions. I don't know how strongly people feel about this particular AMO,
but (and this may be a stupid reason), I would be inclined to leave it out for
the sake of having a more concise set of function declarations.
Again, the spelling isn't /that/ important, but I think this would be a
relatively terse set:
void upc_amo_opT( upc_op_t op, shared TYPE* ptr, TYPE val );
TYPE upc_amo_fopT( upc_op_t op, shared TYPE* ptr, TYPE val );
TYPE upc_amo_cfopT( upc_op_t op, shared TYPE* ptr, TYPE cmp, TYPE val );
From what I can see, there seem to be compare-and-swap extensions, but not the
more-general compare+fetch+op. For the implementers, would this general 'Level
3' AMO be an implementation challenge? More specifically, would it not see the
same level of hardware support that the others do? Or, would a lack of
hardware support for a general 'Level 3' AMO constrain the performance of
compare-and-swap (or Level 1 & 2 AMOs) in order to guarantee atomicity?
--- Local AMO Support ---
Thinking back to the issue of local AMO support, it seems from existing
extensions that local-pointer AMOs are generally not atomic with respect to
shared-pointer AMOs. It's a somewhat confusing point, so (maybe I'm beating a
dead horse) I'd probably be inclined to leave out the local AMOs to prevent
this sort of confusion. I expect the typical use case for AMOs to be on shared
memory anyway, but maybe that's an incorrect assumption.
--- Relaxed vs. Shared ---
One item not yet discussed here is whether the AMO function definitions should
explicitly address whether the accesses are shared or relaxed (as in BUPC) or
elide the distinction (as I think is the case in Cray UPC). I think I'd prefer
to leave out the relaxed/shared distinction in the AMO function definition and
leave the access to be determined by the reference-type qualifier (or the
associated pragma).
Original comment by nspark.w...@gmail.com
on 22 May 2012 at 2:59
Yes, oops - I forgot the masked-swap operation. I support Nick's motion to
leave it out *unless* someone can think of a "killer app" for this. Please
speak up :)
--- Hardware support ---
'Level 3' would obviously not be a challenge for IBM - I would not have
suggested it otherwise [insert evil grin here]. But you bring up an important
point. All these operations can be emulated given a set of basic primitives -
and those primitives are different on every vendors' HW.
* Is there a canonical subset of these operations that will have "native"
performance on most vendors' HW?
* If this canonical subset can be identified, maybe we should highlight this
subset in some way in the AMO specification?
---- Local AMO support ----
If we add local AMOs they should be interoperable with shared ones - or else a
lot of user confusion will result. So binary decision: either guarantee
interoperability or leave them out completely (not UPC's concern).
Original comment by ga10...@gmail.com
on 23 May 2012 at 2:05
Gheorghe wrote:
> Yes, oops - I forgot the masked-swap operation. I support Nick's motion to
leave
> it out *unless* someone can think of a "killer app" for this. Please speak up
:)
Tracker issue #35 discusses write to shared bit fields without disrupting
adjacent ones. Providing that assurance would require a masked-swap operation
exist within the runtime implementation. If that is the case, then the
question becomes whether one exposes this capability to the UPC user as a part
of the atomics library as well.
Original comment by phhargr...@lbl.gov
on 23 May 2012 at 8:23
Nick wrote:
> I don't think this taxonomy covers the "masked swap" in the BUPC AMO
extensions.
> I don't know how strongly people feel about this particular AMO, but (and this
> may be a stupid reason), I would be inclined to leave it out for the sake of
> having a more concise set of function declarations.
Berkeley includes the masked-swap due to input we received from Lauren Smith.
We are quite willing to leave it out of the spec and retain it as only a
Berkeley extension.
> I think I'd prefer to leave out the relaxed/shared distinction in the AMO
function
> definition and leave the access to be determined by the reference-type
qualifier
> (or the associated pragma).
Unless I am missing something important, what Nick requests above is not
possible in a library function. Neither the relaxed/strict qualification of
the pointer nor the pragma in effect at the call site can be known inside the
called function. Now if this were UPC++ we might have a chance via
polymorphism assuming relaxed/strict are significant in the type matching.
If support were "deeper" than a library of functions (including some compiler
support), then what Nick requests would become possible. That would make
atomic operations more along the lines of "compiler intrinsics" than functions.
I don't have any strong objection to that, but it may significantly raise the
burden on an implementer (the actual burden being very implementation specific
already).
Original comment by phhargr...@lbl.gov
on 23 May 2012 at 8:33
Leaving aside the issue of compiler support, is there any implementation of UPC
where the difference between strict and relaxed is *not* a UPC fence?
Could we leave strict AMOs out of the picture and rely on users being able to
bracket the AMOs with fences?
Original comment by ga10...@gmail.com
on 30 May 2012 at 11:43
Gheorghe asked:
> Leaving aside the issue of compiler support, is there any implementation of
UPC
> where the difference between strict and relaxed is *not* a UPC fence?
>
> Could we leave strict AMOs out of the picture and rely on users being able to
> bracket the AMOs with fences?
It is not as simple as that...
In the BUPC implementation of "upc_fence" we need to include both architectural
memory fences and a compiler optimization fence. In the AMO's for some
architecture the atomic instructions already imply the architectural memory
fence (the LOCK prefix on x86/x86-64 being the most important example to those
outside of IBM). So, asking a user on such an architecture to use BOTH an AMO
and a upc_fence would result in TWO (or more, see below) memory fences.
Additionally, what is the user expected to use:
Option 1) upc_fence; relaxed_AMO();
Option 2) relaxed_AMO(); upc_fence();
Option 3) upc_fence(); relaxed_AMO(); upc_fence();
In Option 1 it is possible for shared access after the AMO to move "up" and
take place between the AMO and the fence. This is OK for "release" semantics.
Conversely, in Option 2 shared accesses before the AMO might "move down" and
take place after the AMO. This is OK for "acquire" semantics.
Only with Option 3 do we get the property that the name "strict AMO" implies to
me: all shared access issued before the AMO complete, then the AMO completes
before any later references can begin. That is what I believe 5.1.2.3 of the
UPC 1.2 spec says for a strict access, and is therefore what I think we should
provide for a "strict AMO".
BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o
incurring THREE architectural memory fences.
Original comment by phhargr...@lbl.gov
on 1 Jun 2012 at 4:28
Original comment by phhargr...@lbl.gov
on 1 Jun 2012 at 6:06
I see your point. I withdraw my proposal about no strict AMOs. So it boils down
to a choice between:
* Move AMOs deep inside the compiler just to help figure out whether AMOs are
strict or relaxed, based on whether we are in strict or relaxed mode, the
variable is denoted as strict or relaxed etc.
* Make AMO strictness/relaxedness explicit and double our namespace complexity.
This is similar to the "strict library approach vs. get the language involved"
dichotomy that also plagues issues 41 (nonblocking memory copies) and 42
(nonblocking collectives). I sense that we will have to take a unified approach
to decide all three of these.
Original comment by ga10...@gmail.com
on 15 Jun 2012 at 3:26
As suggested by Nick. I'd like to have part time ownership of this issue as we
write it up. I'm willing to take fulltime ownership, but I certainly don't want
to muscle anyone else out. -- George
Original comment by ga10...@gmail.com
on 15 Jun 2012 at 5:18
In issue #41 I am backing down from my position that changes to upc_fence()'s
implementation are unacceptable. So, in this issue, perhaps we should poll the
implementations to determine is "getting the compiler involved" is a reasonable
possibility before assuming that it is not. While that would mean that the
proposed extension is not strictly (pun totally intentional) a pure library, it
would avoid the doubled namespace.
So, the question is:
Does your implementation (or could it w/o excessive burden) have sufficient
"smarts" to distinguish calls to an AMO in which a dereference of the pointer
argument is strict vs relaxed?
For Berkeley UPC, the answer is YES.
As a source-to-source translator we generate different calls to our
communication library for strict and non-strict accesses. By treating AMOs as
compiler intrinsics, rather than as calls to arbitrary C functions, we could
leverage the same internal mechanism(s) to implement distinct relaxed/shared
versions INTERNALLY, while using only a "generic" name in the user's code.
So, are other implementers able/willing to consider AMOs that have a
polymorphic aspect with respect to relaxed-vs-strict?
Original comment by phhargr...@lbl.gov
on 16 Jun 2012 at 1:14
I don't really see a doubling of the interface for strict/relaxed to be that
much of a problem. Yes, it makes our header file a little bit longer, but the
documentation can be written in a generic way (as in the BUPC AMO spec) to
cover both cases and avoid page bloat of the spec. This seems preferable to
creating a large number of compiler intrinsics (which will then be harder to
change as the spec evolves) or trying to explain to the user how this is a
library but has magical extra properties. It also allows third party
implementations of atomics (eg proof of concept prototypes, open source
reference implementations) which would otherwise be prohibited.
There are already examples of this type of interface doubling in the C spec,
for a similar reason (lack of argument polymorphism): See the wide character
library in C99 7.24 and wchar.h (which basically duplicates stdio.h string.h
time.h and ctype.h in their entirety).
Original comment by danbonachea
on 16 Jun 2012 at 7:19
I think I'm okay with the interface double from using a suffix for strict or
relaxed AMOs. As Dan points out, it doesn't necessarily ruin the
documentation. I didn't realize at first how this would affect the compiler or
the pure library approach. I'd also like to be part of writing the spec text,
along with George (and Yili, I think).
I am curious as to what Cray does with their current global AMOs with regard to
strict vs. relaxed accesses in their extensions.
(Updated 'Type' to "Enhancement")
Original comment by nspark.w...@gmail.com
on 18 Jun 2012 at 9:16
Our global AMOs are essentially treated as relaxed updates. This is true even
when forcing relaxed accesses to be strict via pragma or the inclusion of
upc_strict.h (which is probably a bug now that I think about it).
Original comment by sdvor...@cray.com
on 18 Jun 2012 at 9:36
Steven wrote:
> Our global AMOs are essentially treated as relaxed updates.
Does this mean that Cray AMOs cannot be used, for instance, to implement a
semaphore (because the UP lacks release semantics and the DOWN lacks acquire
semantics) without the addition of an additional strict reference (such as a
upc_fence())?
I am asking because I want to better understand what users expect to DO with
AMOs.
For the case where the value of the atomic variable is of importance by itself
(as an accumulator, for instance) the relaxed access is sufficient. However,
once you use the atomic variable's value to control when/if one accesses
additional locations (spinlock, semaphore, etc.) there needs to be a "strict"
somewhere. As I illustrated for George, there is a strong motivation to avoid
making the user insert fences for this purpose. Does anybody have users that
use atomics in this way?
Original comment by phhargr...@lbl.gov
on 18 Jun 2012 at 10:06
I think I'll have to expand a little--they don't really fit in with the current
UPC memory model right now.
The global AMOs are "relaxed" in the sense that they do not provide a full
fence like strict accesses do. They do provide acquire semantics, so relaxed
accesses issued after an AMO will be ordered "correctly". You still
technically need a strict write (or a fence followed by a relaxed write) for
release semantics. However, many users have noticed that a relaxed write alone
works in most cases--and is much faster--and therefore leave the fence out
until something breaks.
With regard to what users do with them, I can't really answer that because we
typically don't get to see source code from the customers that use them. That
said, I'd guess that it's more the former (updating a value) than the later
(synchronization) at this point given the bugs we've seen to date.
Original comment by sdvor...@cray.com
on 18 Jun 2012 at 11:45
I'm coming a bit late to this discussion, but I really like that we're
exploring passing an atomic op enum to a few functions instead of having one
function per operation. Cray has been stuck with supporting a variety of
_amo_* functions because that was how it was originally implemented, but
internally we use an enum passed to just a few functions, very much like
Comment #11. The legacy support has caused numerous headaches when adding
support for new AMO operations just due to entry point explosion.
Also, for historical reasons, the Cray AMO extensions work on either local or
shared data. Aside from these extensions, our users have the option of using
the same builtin syntax that GCC provides for local AMOs in C; however, our
GCC-style builtins and the Cray AMO extensions are not atomic with respect to
each other due to the way the hardware works. Therefore, I can fully
sympathize with not wanting to provide local AMOs in UPC because if we did so,
it would be natural for users to expect the local UPC AMOs to be atomic with
respect to the global UPC AMOs...and some systems may not be able to support
that.
Original comment by johnson....@gmail.com
on 19 Jun 2012 at 2:57
Troy said:
> I really like that we're exploring passing an atomic op enum to a few
functions
> instead of having one function per operation.
Would the implementers here be interested in reducing the interface size by
including the TYPE as a function parameter? I had thought about that, but it
does not seem to be common in UPC (including extensions -- except for BUPC
Value-Based Collectives inteface).
George generalized compare-and-swap into compare-fetch-op, noting with an evil
grin that IBM could support the general case. Is this general case of interest
to other vendors (and would they be hardware-supported)? Or is CAS the common
subset of this class that is supported by most vendors?
From a spec-writing perspective, would it make sense for the spec to include
compare-fetch-op with "set" as the only required op and leave other operations
as vendor-supported options? This could allow us to potentially expand the
list of required operations in future releases if multiple networks increased
hardware AMO support without drastically changing the AMO spec.
Original comment by nspark.w...@gmail.com
on 19 Jun 2012 at 3:32
Paul wrote:
> BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o
incurring THREE architectural memory fences.
Why does option 3 incur three architectural memory fences? It seems like a
trivial peephole optimization for the compiler to throw away the superfluous
fences, assuming the compiler has sufficient knowledge of how the target
runtime works.
Original comment by sdvor...@cray.com
on 19 Jun 2012 at 3:46
I haven't read the AMO spec. in detail, but would like to note that it is
convenient that the compare-swap operation supports 128 bit data types
(presumably aligned on at least a 64-bit boundary). This comes up in UPC
applications (and UPC runtimes) when there is a need to compare-swap a
pointer-to-shared value. For GUPC, using the "struct" PTS representation on a
64 bit host, a fully general PTS is stored in a 128-bit container. Perhaps a
feature macro is needed that indicates whether the AMO implementation supports
compare-swap on 128 bit sized values. Also, perhaps, the minimum alignment
needs to indicated via pre-processor macro.
Original comment by gary.funck
on 19 Jun 2012 at 3:50
Going back to the memory semantics, perhaps we should consider providing more
fence options than simply upc_fence? This would benefit both the AMOs and the
non-blocking proposal (our non-blocking proposal includes acquire semantics for
the completion of non-blocking operations). Should that be split out to a
separate issue?
Original comment by sdvor...@cray.com
on 19 Jun 2012 at 4:11
In comment #28 Steven wrote:
> Paul wrote:
>> BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o
>> incurring THREE architectural memory fences.
>
> Why does option 3 incur three architectural memory fences? It seems like a
trivial
> peephole optimization for the compiler to throw away the superfluous fences,
> assuming the compiler has sufficient knowledge of how the target runtime
works.
I agree that this is a trivial optimization if atomics are "known" to the
compiler. But the current implementation is a LIBRARY and the compiler doesn't
know a call to an AMO from any other function call.
Original comment by phhargr...@lbl.gov
on 19 Jun 2012 at 5:09
In comment #27 Nick asks:
> Would the implementers here be interested in reducing the interface size by
> including the TYPE as a function parameter?
This would not work for any function which returns a value. So we would need
to pass a pointer to the result in any function generating a result. For this
reason I dislike passing the type.
> George generalized compare-and-swap into compare-fetch-op, noting with an evil
> grin that IBM could support the general case. Is this general case of
interest
> to other vendors (and would they be hardware-supported)? Or is CAS the common
> subset of this class that is supported by most vendors?
If even ONE required operation lacks h/w support, then we risk requiring ALL
operations being implemented via software just to ensure they are all atomic
with respect to each other. Therefore I strongly support the idea that
compare-and-swap be required but nothing more general. I would actually go so
far as to discourage documenting OPTIONAL atomics in the spec text because this
would encourage writing of non-portable code.
What I *would* encourage is that vendors providing extensions to the atomics
(more operations, more types, support for "private", etc) all agree OUTSIDE OF
THE SPEC on the "spelling" of their extensions. This paves a smooth(er) path
to their later addition to the spec, and eases their use.
Original comment by phhargr...@lbl.gov
on 19 Jun 2012 at 5:27
Paul wrote:
> Therefore I strongly support the idea that compare-and-swap be required but
nothing
> more general. I would actually go so far as to discourage documenting
OPTIONAL
> atomics in the spec text because this would encourage writing of non-portable
code.
>
> What I *would* encourage is that vendors providing extensions to the atomics
> (more operations, more types, support for "private", etc) all agree OUTSIDE OF
> THE SPEC on the "spelling" of their extensions. This paves a smooth(er) path
to
> their later addition to the spec, and eases their use.
If you're going to go that far, I'd say lets just abandon the AMOs in the spec
altogether. Putting only compare-and-swap in the spec would encourage users to
only use compare-and-swap in portable codes. So, for example, if they needed
to do an atomic fetch-and-add in a portable fashion (say, to atomically reserve
array elements...), they'd need to do something like:
do {
old = last;
new = old + reservation_size;
} while( upc_amo_cas( &last, old, new ) != old );
This is going to perform terribly on most systems, particularly in the presence
of contention, which will only get worse as you scale up the number of threads.
The point of adding atomics to the spec is to make codes run faster, not
slower.
Original comment by sdvor...@cray.com
on 19 Jun 2012 at 6:54
Perhaps we could add a query function(macro? intrinsic?) that could be used to
figure out which amos an implementation supports. Then users could do
something to the effect of:
if ( UPC_AMO_SUPPORTED( UPC_OP_FADD, UPC_TYPE_LONG ) ) {
myidx = upc_amo_fadd( &last, reservation_size );
}
else if ( UPC_AMO_SUPPORTED( UPC_OP_CAS, UPC_TYPE_LONG ) ) {
do {
old = last;
new = old + reservation_size;
} while( upc_amo_cas( &last, old, new ) != old );
myidx = old;
}
Original comment by sdvor...@cray.com
on 19 Jun 2012 at 7:12
Steven wrote in comment #33:
> Paul wrote:
>> Therefore I strongly support the idea that compare-and-swap be required but
nothing
>> more general.
>[...]
> If you're going to go that far, I'd say lets just abandon the AMOs in the
> spec altogether.
Sorry if I was unclear about what I was objecting to.
I DO WANT the "Level 2" fetch-and-op AMO's, such as the "upc_amo_fadd" in
Steven's example.
I DO NOT WANT George's "Level 3" COMPARE-fetch-op for op != "set"
[see comment 10 for Level 1,2,3 descriptions]
Now that I think more about it, I actually don't see how "compare-fetch-OP" is
more useful than compare-and-swap. Specifically, if the OP is only going to
take place only if the comparison is TRUE, then I must have KNOWN the previous
value and could have used compare-and-swap having computed the OP against the
KNOWN previous value. So, I guess I've just debunked my own original
implementability argument against these ops, and replaced it with a
they-are-just-syntactic-sugar argument.
Original comment by phhargr...@lbl.gov
on 19 Jun 2012 at 7:34
Original comment by gary.funck
on 3 Jul 2012 at 6:07
Original comment by gary.funck
on 3 Jul 2012 at 6:10
Cray UPC supports the following AMO extensions. c = currently supported, f =
supported on future hardware. If supported, both a fetching and a non-fetching
version exist. (For the bitwise ops, type doesn't really matter, so you can
typecast fp32 and fp64 pointers to int32 and int64 pointers and use the integer
AMO extensions, but we don't _directly_ provide them.)
int32 int64 fp32 fp64
add f c f f
and f c
and-with-xor f c
compare-and-swap f c
min f f f f
max f f f f
swap f c
or f c
xor f c
Observations:
1) Existing Cray network hardware does not support atomic 32-bit,
floating-point, or min/max operations. Cray UPC does not support unsigned
integer AMOs, whereas BUPC does. Therefore, I strongly believe that there
needs to be a query mechanism, suggested in Comment #34, for users to figure
out if an AMO is supported. It is NOT acceptable to say that an implementation
must use software to emulate the operations that it does not support in
hardware because in order to keep the operations atomic with respect to each
other it would be necessary to implement them all in software, negating any
benefit from the hardware. Furthermore, I think the query mechanism needs to
be a function call because otherwise it will not be possible to compile a code
once and run the same executable on two different platforms that differ only in
the supported flavors of AMOs.
2) Providing entry points for {types} x {operations} x {fetching} is unwieldy
for everyone...users, implementers, specification writers. It gets even worse
if you add {blocking/non-blocking)} to the mix.
Straw Proposal:
/** Returns 1 if the specified AMO is supported or 0 otherwise. /fetching/ is
non-zero to request a fetching AMO.
* /type/ and /op/ specify the data type and operation to be performed.
int upc_amo_exists( int fetching, upc_amo_type_t type, upc_amo_op_t op );
/** Atomically performs operation /op/ on the memory pointed to by /target/.
The data type of the operation is specified by
* /type/. If /fetched/ != NULL, then the previous value is fetched and stored in the memory pointed to by /fetched/.
* Operands for the operation are pointed to by /operand1/ and /operand2/; /operand2/ may be NULL for some operations.
*
* Warning: Operations are not guaranteed to be atomic with respect to non-UPC AMO operations.
*/
void upc_amo( void* fetched, upc_amo_type_t type, upc_amo_op_t op, shared void*
target, void* operand1, void* operand2 );
Example:
shared long x;
upc_lock_t x_lock;
...
if ( upc_amo_exists( 0, UPC_AMO_TYPE_LONG, UPC_AMO_OP_ADD ) ) {
long one = 1L;
upc_amo( NULL, UPC_AMO_TYPE_LONG, UPC_AMO_OP_ADD, &x, &one, NULL );
}
else {
upc_lock( &x_lock );
x += 1L;
upc_unlock( &x_lock );
}
Original comment by johnson....@gmail.com
on 25 Jul 2012 at 6:55
[deleted comment]
I haven't been following the details of the proposed AMO library, but am
wondering whether operations (esp. compare-and-swap) on 128 bit data types are
planned/proposed?
As a use case, consider a double buffering scheme with two buffer pointers,
where it is convenient and efficient to swap the pointers atomically when
switching buffers. Here, the buffer indexes might be two 64-bit indexes, or
perhaps two PTS's represented in a 64-bit packed format.
Or simply, swapping a single fully general PTS which is represented as a
128-bit quantity on 64-bit targets.
Original comment by gary.funck
on 3 Aug 2012 at 5:20
I don't like the idea proposed in comment #38 of query functions for which AMOs
are supported. It complicates the user code, but more importantly just passes
the buck of lowered performance to the application. Specifically, instead of
the runtime emulating the required AMO in software, now that emulation is
happening in the user application (where it's likely to be slower and more
error-prone). With this approach, any portable UPC app using the AMO's would
also need to fold in code for a fully software implementation of every AMO it
uses and switch to that implementation if any of the queries fail. This defeats
the code-factorization goal of having a library.
It seems better to restrict the AMO's to a core subset that all UPC
implementations must support, and choose that subset wisely to allow hardware
implementation on platforms of interest.
It's worth noting that a 64-bit compare-and-swap is sufficient to implement
EVERY operation in the current AMO proposal (including 32-bit, unsigned,
min/max, and float/double), although it implies an additional read and a
possible retry under heavy contention (which should still be still
significantly cheaper than a fully software implementation using upc_locks).
Since Cray supports that operation in hardware, why not use that to perform the
operations lacking direct hardware support?
(Note I'm proposing this rewriting be done within the implementation, rather
than in the UPC program as proposed in comment #34).
Original comment by danbonachea
on 3 Aug 2012 at 12:10
"It's worth noting that a 64-bit compare-and-swap is sufficient to implement
EVERY operation in the current AMO proposal (including 32-bit, unsigned,
min/max, and float/double), although it implies an additional read and a
possible retry under heavy contention (which should still be still
significantly cheaper than a fully software implementation using upc_locks).
Since Cray supports that operation in hardware, why not use that to perform the
operations lacking direct hardware support?"
Because it is NOT significantly cheaper than a fully software implementation
using upc_locks. It's fine if you have a couple dozen threads, but once the
number of threads goes beyond a certain point, the network (and more
importantly, bus) contention degrades performance way past that of a scalable
lock algorithm. This gets even worse as you add more and more threads (cores)
to a single network endpoint. Using compare-and-swap is only a tolerable
workaround in the absence of contention.
Original comment by sdvor...@cray.com
on 3 Aug 2012 at 2:48
Here's my attempt at a summary of today's discussion on the telecon.
Points of Motivation:
- UPC users want high-performance AMOs.
- A UPC library should be robust (over types and operations), portable, and vendor-independent.
The current proposal (as of SVN r61) would restrict some vendors into
implementations that /may/
limit performance. Also, here is / may be interest in expanding the types and
operations specified
by the current proposal.
Proposed Solutions (as I saw and recall them):
1) Provide a standard interface to the AMOs (e.g., upc_amo() in Comment #38) that supports a
robust range of types and operations. Also, provide a function that allows a user to specify
which types and operations the user will call, from which the library determines whether it
will use a software-based or hardware-based implementation. Thus, all the possible AMOs are
always available to the user, however, the implementation may use hardware acceleration either
by default (if all types/operations are supported) or from a user's hinting. The hinting
function would likely need to be called before any AMO calls are made in order for the
implementation to choose the right "mode".
2) Expanding on (1), the hinting function can also accept (in addition to desired types and
operations), some provided use-case parameter (e.g., high throughput or low latency), from which
the library may select one of potentially-many possible implementations (either in hardware or
software).
3) Expanding on (2), the hinting function is replaced by an atomicity-domain function that returns
a handle for AMOs that supports a user-specified set of types and operations for a particular
use-case. AMOs would only be guaranteed to be atomic with respect to multiple calls using the
same handle. Atomicity would not be guaranteed for AMO calls using separate handles.
For (1-3), I think it was expressed that it could be good to have:
- A query function that can say what the hardware is capable of supporting (so that a user may
restrict his type/operation choice to guarantee hardware support)
- A priority function in which a user could specify types, operations, and a use-case of varying
priority such that the implementation chooses the best AMO implementation and specifies the
supported types and operations.
My understanding of (1) and (2) is that they would both set the atomicity mode
(i.e., hardware or
some software implementation) at the hinting call (or elision thereof) and it
would be henceforth
fixed. This is problematic for libraries that may use AMOs not specified by
the hinting call or
that may make a hinting call before the user's call. I think that (3) is the
only library-friendly
(or library-general) approach.
I think the query function definitely makes sense, however, I'm not really sure
I understand how
one would effectively use the priority function. It would seem that this could
lead back to
heavily-IFDEF'd code (if the priority function selects an implementation that
doesn't support
the lower-priority types/operations), which I think would be avoided without it.
It is my preference (and I think this was part of all three proposed solutions)
that the AMOs can
operate on ALL shared addresses. Any restriction of use would be at the user's
discretion and NOT
be specified by either the hinting or handle functions.
Ideally, I think it would be good for the less performance-minded users if
there was either a
default handle that supported all specified types and operations in an
implementation-defined way
or some default parameter that returns "universal" handle.
Please add or correct anything here that I may have misrepresented (and please
respond with your
comments and feedback!).
From this discussion, I'll draft up another AMO proposal, which will be posted
for comments and
a follow-up telecon discussion (for whoever is interested).
Original comment by nspark.w...@gmail.com
on 3 Aug 2012 at 10:17
All "brand new" library proposals are targeted for starting in the "Optional"
library document. Promotion to the "Required" document comes later after at
least 6 months residence in the ratified Optional document, and other
conditions described in the Appendix A spec process.
Original comment by danbonachea
on 17 Aug 2012 at 5:53
Set default Consensus to "Low".
Original comment by gary.funck
on 19 Aug 2012 at 11:26
At the last call, Bill asked for someone to pick a "big issue" for discussion
at the next call.
Considering the importance of AMOs to many of the UPC users, and the very
undecided state that it
was left in after the call-before-last, Gary, Yili, and I thought it would be
good to discuss
AMOs on the next call. Here is something that I hope will re-light the fire of
discussion.
Based on my summary of the last discussion in Comment 43, I propose the
following based on
Option 3 with Query (but no Priority):
### A Proposed Usage Scenario ###
A user creates an atomicity domain object by specifying a set of operations, a
set of types, and
an implementation mode. This object is a handle to some AMO implementation.
Depending on the
specified mode, the implementation may be in hardware or software.
A user makes AMO calls as either upc_amo_relaxed() or upc_amo_shared() that
otherwise look very
much like Troy's upc_amo() proposal in Comment 38. Atomicity is only
guaranteed for accesses
using the same domain.
The library also provides upc_amo_query() so that a user can test whether a set
of operations
and types is supported for a given mode. There will be a default mode that
supports all of a
Spec-specified set of ops and types.
### A Bit More Detail ###
* Define a type (e.g., upc_amo_domain_t) to represent an atomicity domain, which specifies a set
of operations and datatypes over which access to a memory location in a given synchronization
phase is guaranteed to be atomic if and only if no other mechanisms or atomicity domains are
used to access the same memory location in the same synchronization phase.
* Define a type (e.g., upc_amo_mode_t) that a user can use to indicate the AMO implementation
"mode" desired with the following acceptable constants UPC_AMO_DEFAULT, UPC_AMO_HARDWARE,
UPC_AMO_LATENCY, UPC_AMO_BANDWIDTH.
* UPC_AMO_DEFAULT would be an implementation-defined default mode that will support ALL of a
specified set of types and operations for AMOs. This would almost surely be a software-
based implementation. Using this mode, AMOs would always be portable, but not necessarily
high performance.
* UPC_AMO_HARDWARE would force the use of hardware-supported AMOs. It is likely that every
implementation would vary in the set of types and operations supported for AMOs under
this mode. I suspect that users who use this mode would do so out of performance reasons
and would not, in general, expect cross-platform compatibility.
* UPC_AMO_LATENCY and UPC_AMO_BANDWIDTH would indicate a user preference for low-latency or
high-bandwidth (is "throughput" a better term here?) atomics, respectively. IIRC, on the
last AMO-centric call, either Steven or Troy noted at least a few times that a user
favoring high throughput of atomic memory accesses may not necessarily want the hardware
implementation.
* Initially, I envisioned atomicity domain initialization (and destruction) as happening
similarly to how its done with UPC locks; so I had upc_amo_domain_alloc/free() functions.
Yili (and a user) suggested making this a static initializer and strongly encouraging
compiler optimizations. I think that this static initialization approach makes more sense.
### Issues or Problems That I Foresee ###
* This is definitely more complex than what most of my users would like to see. They would
most likely only need a short list of operations on 64-bit integer types and only ever use
one domain.
* In the scheme above, what happens with the static initialization call when a user specifies
the hardware-supported mode, but specifies types or operations not supported by the hardware?
* Code Portability: Performance-minded users might likely use the UPC_AMO_HARDWARE mode,
however, this code would almost surely NOT be portable. One user strongly encouraged
identifying a common subset of types/operations that could be supported across the
implementation space (but there were concerns, at least initially in our discussions, that
this intersection is the empty set). Speaking to vendors' capabilities, Cray has posted what
they support; IBM has expressed (I think) that they're quite flexible in support; and it's
not clear to me what SGI, HP, and InfiniBand support as hardware atomics.
* Implementer/Implementation Burden: This proposal likely requires quite a bit of work on the
part of the implementers to implement software and, where applicable, hardware support, as
well as a whole new API.
-----------------------
Hopefully this is something to get the discussion restarted. I have a few more
thoughts and
comments from users that I'll post tomorrow.
Original comment by nspark.w...@gmail.com
on 28 Aug 2012 at 9:29
I apologize for being late and stupidly responding to the email list instead of
posting here. I will learn.
My comments regarding the last post are below. I tried to read everything said
already but might have missed a few details. It is not my intent to repeat
previously stated points.
1. For remote atomics, we need to be more explicit about what we mean by
hardware vs. software. Here is a partial spectrum of options:
- a single packet injected into the network induces a hardware atomic operation
in remote memory (example: Blue Gene/Q can do this, at least in one direction)
- an active message that executes atomically in software on the remote side
does an atomic instruction in hardware (example: Blue Gene/P DCMF_Send w/
handler that does lwarx+stwcx)
- an active message that does not execute atomically in software grabs a lock,
does the update, then releases the lock
- three active messages acquire a remote lock, performance the update, then
release the lock
I'm not saying that the good options here are portable or that the bad ones
made sense in any situation, but I hope to convince people that our terminology
is inadequate.
2. UPC_AMO_HARDWARE should not be an option. It is meaningless to talk about
hardware or software implementations on their own. What a user wants to know
is fast or slow for a particular usage. Whether or not the implementation uses
hardware for that shouldn't be visible to the user.
I'd like to know the use-case where the user specifying UPC_AMO_LATENCY or
UPC_AMO_BANDWIDTH is actually going to help. Shouldn't the compiler have enough
semantic information already to
know whether or not to dump a bunch of nonblocking atomics into the network and
flush them all at once versus using a blocking atomic operation and waiting on
the round-trip?
3. Regarding any software-based implementations of AMOs, I don't see how
portable and slow has any benefit over the user doing atomics themselves with
the existing UPC lock functionality.
Furthermore, while it's rather ambiguous what is meant by a software
implementation, I would love to know what reasonable architecture can do remote
atomics as active messages faster than it can in hardware, just from a network
architecture curiosity perspective, as noted previously by someone on a call
that I wasn't part of.
4. Regarding Yili's desire for static allocation only...
Can someone describe how the compiler can make use of this? I'm not aware of
any architecture where statically allocating memory for remote atomics has any
performance benefit. Requiring static
allocation is really unpleasant from a usability perspective. It reminds me of
FORTRAN77, which is hardly a language we want to hold in high esteem, except
perhaps for writing math libraries.
5. Here is my proposal:
Why not just define upc_atomic_int_t without specifying the size? The header
can define UPC_ATOMIC_INT_MAX appropriately. On a 32b system like BGP,
UPC_ATOMIC_INT_MAX=2^31 while on BGQ or anywhere else that is x86_64 or IA64,
it's going to be >2^50 (One can imagine a few bits being reserved to enable
hardware atomics). The operations that are provided by reasonable hardware
have already been discussed at length.
If these and only these operations were supported, UPC would feel a lot like C,
which seems like a reasonable guiding principle. Using upc_atomic_int_t solves
all of major issues with 32b vs. 64b. I think that supporting 128b atomics is
not a good idea because it's getting ahead of where the hardware is and forces
a slow implementation in software.
I really don't understand why UPC can't support just these operations and let
the user implement others in software on their own. The performance hit
associated with forcing software emulation of simple operations when hardware
exists is far greater than the benefit of allowing non-simple operations.
Original comment by jeff.science@gmail.com
on 30 Aug 2012 at 3:43
Here are some inlined comments/responses to both Comment 47 and Jeff's previous
email, which had
a few additional comments.
Jeff said (in Comment 47):
> 1. For remote atomics, we need to be more explicit about what we mean by
hardware vs. software.
> Here is a partial spectrum of options: [...]
I'm sure this only illustrates my overly-simplified understanding of things,
but I see the options
you present as representing one "hardware" implementation (to me, this means
that the NIC does the
work) and three "software" solutions (the exact implementation of which
shouldn't be specified by
the UPC Spec). In general, I think that my users want to be able to get to the
NIC's atomic
capabilities.
Jeff said (in his initial email):
> So we're talking about atomics that exceed the atomicity of the particular
hardware, thereby
> forcing a general implementation to use software despite the fact that all
reasonable
> supercomputing hardware provides a reasonable set of remote hardware atomics
or at least enables
> something close to them, with the caveat that not all capability is available
on all systems
> (e.g. BGP can only do 32b atomics in hardware, while DMAPP only exposes 64b
atomics).
I think the caveat you present is the basis for the challenge in defining a UPC
atomics
specification. From the last call and the discussion on Issue 7, I think it's
reasonably clear
that the goals of (1) a "robust" AMO library that supports a wide range of
types and operations
and (2) a "hardware-accelerated" (again, I see this as the NIC doing the work)
generally pull in
opposition to each other.
I think you are misunderstanding the intent of the "atomicity domain" (or I did
a terrible job
explaining my intent). The use of atomicity domains is intended such that a
user is *not* forced
into a low-performance software implementation, but has the option to use the
network hardware to
accelerate shared-pointer AMOs if the requested types/operations are supported
by the hardware.
Jeff said (email):
> I disagree that trying to use hardware as much as possible leads to lack of
portability. There
> is a set of atomics that are widely portable. One need only inspect Intel
TBB's atomics or GCC
> atomic extensions for examples. TBB atomics are not just available on Intel
processors. We've
> ported TBB to PPC32, PPC64 and POWER7 and I believe my collaborator who did
the PPC ports
> previously did DEC Alpha, Sparc and a whole bunch of weirdo processors with
non-x86 memory models.
My interest in the/a UPC AMO specification is to have AMOs on UPC shared
pointers. I think the
field of network hardware support for atomics is a little less even. In the
long term, I would love
to see a standard base of network atomics supported by all the major HPC
vendors. The intersection
may be non-empty at present, but specifying the minimal common set doesn't
quite satisfy the
interested parties that want to also provide a robust AMO API in UPC.
Jeff said (email):
> Does anyone know of hardware that does not support atomic load, store,
fetch-and-add and
> compare-and-swap for either 32b or 64b integers? Do we want features beyond
these?
I definitely want to see fetch-and-{and, or, xor} added to your list.
Otherwise, I don't see the
point in defining an AMO spec. I also want to see the non-fetch versions;
i.e., atomic-{add, and,
or, xor}. You've already noted that DMAPP only exposes 64-bit atomics. I
would be okay with only
64-bit atomics, but I'm not sure everyone else would.
Jeff said (Comment 47):
> 4. Regarding Yili's desire for static allocation only... [...]
It's funny how perspectives differ between users. One of my users said he
*wouldn't* want to
dynamically allocate domains. He wants a statically-allocated domain in a
library he'll use across
his applications that supports exactly what he wants on the hardware he uses.
I'm not a compiler person, so I probably make too many assumptions about what
magic can happen behind
the scenes, however, I can imagine a scenario in which the compiler, after
seeing that only one
"hardware-only" domain is used, does something like replace the upc_amo() calls
with inlined calls to
the network API. I suspect this would provide some performance benefit on some
systems where
pass-through function calls might be costly, especially if one was making a lot
of atomic accesses.
Jeff said (Comment 47):
> Why not just define upc_atomic_int_t without specifying the size? The header
can define
> UPC_ATOMIC_INT_MAX appropriately. On a 32b system like BGP,
UPC_ATOMIC_INT_MAX=2^31 while on BGQ or
> anywhere else that is x86_64 or IA64, it's going to be >2^50 (One can imagine
a few bits being
> reserved to enable hardware atomics). The operations that are provided by
reasonable hardware have
> already been discussed at length.
I don't think that restricting an atomic datatype to something less than the
full 64-bits is a good
idea. I'm not pushing for 128-bit atomics, but I think 64-bit and 32-bit
atomics make sense. If your
solution is to give users access to ~50 bits of a 64-bit integer (and you seem
to feel strongly about
exposing hardware performance), how does the NIC properly handle bitwise
operations on only the
"user bits"? And how much of an application already using a vendor's
extensions for 64-bit AMOs would
have to be rewritten to only use the "user bits." I don't see how your
solution /wouldn't/ force a
software implementation.
Jeff said (Comment 47):
> I really don't understand why UPC can't support just these operations and let
the user implement
> others in software on their own. The performance hit associated with forcing
software emulation of
> simple operations when hardware exists is far greater than the benefit of
allowing non-simple
> operations.
Again, by creating a domain with the UPC_AMO_HARDWARE flag, the intent is that
one would *not* be
forced into a software implementation, but be explicitly selecting a
hardware-only implementation.
The last thing that I want is to force software emulation of simple AMOs in all
cases. But I also
appreciate the view that a UPC AMO library could benefit from providing a
vendor-independent, wide
base of types/ops for users who might not demand the highest level of
performance.
Jeff said (email):
> Another issue here is ordering. What's the ordering requirement in UPC? So
I dump 100 atomics
> into the network targeting the same remote address. Are they going to
complete in-order? That's
> probably a much bigger performance hit on Cray Gemini or PERCS than anything
entailed by latency
> vs. bandwidth. Forgive me for being a UPC noob if there's something in the
language spec. about
> ordering of load-store already, but I would be surprised if UPC can say more
than C, which is going
> to depend on the architecture for ordering semantics. Requiring order in
remote atomics when
> load-store aren't necessary ordered on various processors seems unreasonable.
Ordering would necessarily be controlled by whether one makes a relaxed atomic
access or a strict
atomic access. I expect that my users would almost only ever use relaxed
atomic accesses, so no
ordering of the atomics would be expected (until one uses a upc_fence).
Original comment by nspark.w...@gmail.com
on 30 Aug 2012 at 5:05
Not sure if anyone else has looked at it for inspiration yet, but the way that
C++11 provides atomics is that there are separate atomic types (e.g.,
std::atomic<int>) and each type provides an is_lock_free() member function to
query whether AMOs on that type are implemented using locks or not. An
implementation could have is_lock_free() return true for an int type but false
for a long type. The fact that one uses locks and the other doesn't use locks
is not an issue because int and long are essentially in separate atomicity
domains, to use the terminology of our recent discussion. It differs from our
atomicity domain concept in that is_lock_free() covers all required atomic
operations on the type, so if there is hardware support for an atomic add but
an atomic xor requires a lock, then the implementation would report that the
atomic type is not lock free and it would use locks for both. Thus, the user
that cares only about fast atomic adds is out of luck. The idea of creating an
atomicity domain in UPC and specifying which operations actually will be used
in the program is one way out of this problem, but it is more complicated. So
I'm undecided whether a UPC atomicity domain concept should cover just the type
or the type and the possible operations.
Original comment by johnson....@gmail.com
on 6 Sep 2012 at 5:59
This is in response to Jeff's comment 47:
"4. Regarding Yili's desire for static allocation only...
Can someone describe how the compiler can make use of this? I'm not aware of
any architecture where statically allocating memory for remote atomics has any
performance benefit. Requiring static
allocation is really unpleasant from a usability perspective. It reminds me of
FORTRAN77, which is hardly a language we want to hold in high esteem, except
perhaps for writing math libraries."
YZ: I guess there is a confusion between Atomic Domains and Atomic Variables.
And this is mostly because we haven't formally defined what is an Atomic Domain
yet :-). Nick had a draft document about the definition of Atomic Domains and
the related API but it's not published yet.
The suggested static allocation/initialization is only for Atomic Domains. An
atomic domain is not an atomic variable nor an address to such a variable.
Roughly speaking, an atomic domain defines a domain in which the atomic
operations are atomic with respect to each other in the same domain.
The motivation of Atomic Domains was inspired by some observations about
practical usage of atomic operations and earlier discussions on the topic.
1) There can be more than one hardware components in a compute node that can
access/modify the same memory. An example would be a Cray XK6 node with a
Gemini NIC and a GPU -- the CPU, the NIC (connected by HyperTransport) and the
GPU (connected by PCI) all support "hardware atomic operations" to the host
memory but the atomic operations issued from one component are not atomic with
respect to those from other components.
2) In many practical cases, an app only needs a particular atomic op type in a
small region of code or data. For example, one may need compare-and-swap when
implementing a lock-free queue. Another may need fetch-and-add for implementing
a Particle-In-Cell code. But it's uncommon to apply both compare-and-swap and
fetch-and-add to a single memory location.
Therefore, we attempt a provide a mechanism (through Atomic Domains) to users
to specify the intended usage pattern of atomic ops and thus gives the
implementation more information and opportunities to optimize performance.
Proposed API for Atomic Domains (by Nick):
upc_domain_t *upc_global_domain_alloc(upc_op_t ops, upc_type_t types,
upc_amo_mode_t mode);
upc_domain_t *upc_all_domain_alloc(upc_op_t ops, upc_type_t types,
upc_amo_mode_t mode);
void upc_domain_free(upc_domain_t *ptr);
void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr, upc_op_t op,
upc_type_t type, shared void *target, void *operand1, void *operand2);
void upc_amo_relaxed(upc_domain_t *domain, void *fetch_ptr, upc_op_t op,
upc_type_t type, shared void *target, void *operand1, void *operand2);
What I suggested for static allocation/initialization for Atomic Domains is
something like:
upc_domain_t global_domain = DOMAIN_INITIALIZER(UPC_CSWAP, UPC_INT64 |
UPC_UINT64);
This permits but doesn't require compiler optimization because the intended
atomic usage pattern is known at compile time. It also makes common usage
easier to write (no worry about alloc and free).
Original comment by yzh...@lbl.gov
on 6 Sep 2012 at 6:08
Original issue reported on code.google.com by
nspark.w...@gmail.com
on 14 Mar 2012 at 3:40