Closed GoogleCodeExporter closed 9 years ago
What seems to be shaping up is a situation analogous to MPI communicators and
MPI collective implementation choices. Atomic domains are in some sense similar
to MPI communicators - certainly from a syntax point of view. Building on the
analogy:
* Have we decided what happens when the same shared pointer is used in two
different atomic operations in two different domains? are the results supposed
to be defined?
(My answer: heck no, you are on your own)
* Is there an ordering constraint between atomic operations executed on
different domains?
(My answer: usual UPC ordering semantics regardless of domains - unlike MPI
matching semantics w.r.t. communicators)
* Picking a domain implementation policy is akin to picking a collective
implementation in PAMI. The QoS description of each collective is hairy, and
the matrix of "best implementation for a particular situation" is a complex
one. Some examples that come to mind:
"-> Implementation 1 is HW only, requires no progress on the receiver, has
reasonable latency and BW but has nonscalable behavior beyond 1000 tasks
exerting pressure on the same pointer.
-> Implementation 2 is SW only, completely scalable to the limit of the system
and with reasonable latency, but uses twice the bandwidth of implementation 1
and requires an active message handler to run on the thread affine to the
pointer."
These kinds of implementations defy a short description. Facing a similar
situation with collectives (accelerated vs. restricted to tori vs restricted to
fixed point operations etc etc), the PAMI designers have opted for giving the
users complete choice by providing a listing of names on demand, and allowing
the user to make choices. In addition, "choice 0" is always the most general
choice. Can we do something similar here?
Original comment by ga10...@gmail.com
on 7 Sep 2012 at 2:58
As mentioned in an earlier comment by Troy, the C11 spec adds a complete AMO
library that exposes functionality that is very similar to what we're trying to
expose with our library. I think we need some high-level community discussion
regarding whether C11 compatibility is ever a goal for UPC; but regardless of
how that matter is resolved (and assuming C11 gains wide acceptance and
implementation), I think whatever we come up with for AMOs will generate a very
obvious question from every C11-savvy user. Namely, "Why doesn't UPC just
support C11 AMO's with a pointer-to-shared?" and "Why can't I use C11 and UPC
AMOs together on the same locations?". To play devil's advocate, a minimalistic
way we could define AMO's for UPC would be to simply insert (shared _Atomic
int) and (shared _Atomic long) into the atomic types specified in C11 7.17.6,
and import that library wholesale. I'm NOT necessarily advocating that we
should do this, but assuming that we intentionally disregard C11's
<stdatomic.h> and define a whole new library with a wildly different interface,
we should at least have a convincing "story" of why we diverged from an
already-specified library with a nearly-identical target usage in our "base
language" and motivate that doing so is necessary for UPC.
Towards this end I've done a careful audit of the C11 <stdatomic.h> and
compiled a summary of interesting observations relative to the UPC AMO ideas
floating around:
- C11 AMO's operate only on types specifically designed for atomic access, eg
atomic_int and atomic_long, which can represent the same values as their base
type, but are NOT guaranteed to have the same size or representation as their
base type (eg they may include a lock data structure). They include all the
signed and unsigned integral types and boolean, no floating point types.
- The atomic operations they support:
swap and compare-swap
fetch-and-... add, sub, bitwise or, bitwise xor, bitwise and
store (aka write) and fetch (aka read)
test-and-set and clear (only for a special atomic_flag boolean type)
static and dynamic init (non-atomic)
Notably OMITS min/max.
- ALL the atomic operations are supported on ALL the atomic types (excepting
non-meaningful combinations), although not necessarily guaranteed to be
implemented in hardware
- Macros query whether all the AMOs on a particular TYPE are "lock free" (ie
fast hardware implementation), where the answer can be "always", "never" or
"location dependent"
- A runtime query function can ask whether all AMO's on a particular location
are "lock free" (yes or no)
- The actual AMO calls are declared using C11's new type-generic facility
(_Generic), so the function the user invokes does NOT contain the type name -
they invoke a type-independent symbol like:
result = atomic_fetch_add(&my_amo, 2)
and the implementation magically does the "right thing" based on the type of
my_amo. If UPC ever decides to adopt any part C11, this simple feature could
make both the AMO and collective library interfaces look much cleaner to the
user.
- The "consistency" of the AMO's (what we would call strict and relaxed) is an
optional final enum argument to each operation, where the default basically
behaves like UPC strict (AMO includes acquire and release fences). This defines
the ordering constraints of AMO's wrt other AMO's and unrelated operations. The
enum defines six different consistency "modes" for the AMO, of which at most
four seem to have distinct meanings for any given AMO (relaxed, acquire,
release, acquire-and-release).
- AMO's atomicity is guaranteed at the actual memory location, and in
particular still "works" even if you mmap the location into several places in
virtual memory and invoke AMO's on all of them. I haven't heard anyone mention
this issue wrt UPC AMOs, but it's one we need to somehow address.
- They include both "strong" and "weak" versions of compare-swap, where the
latter may fail spuriously, eg on load-linked/store-conditional architectures.
I *think* the issue here is that in cases with poor cache alignment it's
possible to construct cases that livelock on these architectures if you blindly
spin on LL/SC to implement a "strong" compare-swap on an unlucky location. I
haven't heard anyone mention this issue wrt UPC AMOs, but its one we may need
to address.
- AMO's and regular C reads and writes do not conflict in memory because this
is prohibited by the type system.
- Similarly, atomicity is on a typed basis - one cannot "carve up" an atomic
long type using a cast or union and try to call AMOs on the constituent bytes.
(We should probably prohibit that as well).
A few misc comments:
The C11 AMO library provides most/all of the functionality we're trying to
specify, although they may have chosen slightly different points in the
requirements/guarantees design space than we'd have tailor-made for UPC. Which
of these do we feel are important to change for our library?
IMHO the most significant divergence is C11 defines T and atomic_T as separate,
incompatible types. How much does this matter to UPC programmers?
C11 Implementations are "encouraged" (but not required) to define atomic_T to
have the same size and representation as T, which would permit casting them to
regular types for non-atomic access in a separate synchronization phase. If we
decided this feature was sufficiently important to UPC programmers, we could
presumably still adopt C11 AMO's in UPC and add this additional implementation
restriction for all UPC implementations (ie change "encouraged" to "required").
The implementation "trick" that allows that representation restriction to work
for types requiring software emulation is to stash the hidden lock
datastructure(s) in the runtime system rather than in user memory space. This
could be as stupid as a "One big lock" protecting all software-emulated AMO's
(of a given type), or could be tuned for improved concurrency by introducing a
(probably fixed-size) table of locks indexed by a hash on the AMO location
address. The latter would need to hash on physical addresses to retain the mmap
atomicity guarantee.
The other major way in which our current UPC discussion regarding AMOs differs
is the introduction of more detailed performance queries and hints regarding
the implementation of given AMO/type combinations. We should consider if it's
possible to get a similar effect by adding some query/hint functions to what
C11 specifies.
Incidentally, C11 also standardizes a thread library <threads.h> that is
basically POSIX threads in a thin disguise. Dynamic fork/join parallelism in a
fully-shared memory space, mutexes, condition variables, thread-local storage.
The memory model is roughly UPC relaxed, with fences provided by the AMO's and
sync ops in <threads.h>. Dynamic threading doesn't play particularly well with
UPC features, but luckily C11 threading is an optional feature, so if we ever
adopted C11 we could take the easy out and simply recommend UPC implementations
define __STDC_NO_THREADS__, to disable it entirely.
Original comment by danbonachea
on 7 Sep 2012 at 4:48
George: I think AMO domains are a lot more like MPI_Win than MPI_Comm. Your
first example sounds the same as a memory location being in two different
windows.
In any case, the conclusions do not change.
Regarding whether or not UPC should be like PAMI, the answer should be no for
many reasons. If someone wants all the dials that PAMI has, then they should
just use PAMI. You should be able to manually switch UPC implementation
semantics in software any more than you should be able to switch between the
x86 and PowerPC memory model in a C program.
Dan: I completely agree with your point that doing anything deviant from C11
requires a very strong justification. I certainly hope that UPC grows to be a
superset of C11 (perhaps without optional features) some day.
Does anyone know what compilers (if any) implement C11 atomics?
Original comment by jeff.science@gmail.com
on 7 Sep 2012 at 5:05
"- AMO's atomicity is guaranteed at the actual memory location, and in
particular still "works" even if you mmap the location into several places in
virtual memory and invoke AMO's on all of them. I haven't heard anyone mention
this issue wrt UPC AMOs, but it's one we need to somehow address."
Is mmap() added to C11? I thought that came from POSIX. Does plain C11
(without any extensions such as POSIX) actually permit access to a single
object via pointers that don't compare as equal?
"The latter would need to hash on physical addresses to retain the mmap
atomicity guarantee."
This will be *very* difficult to implement on many common systems, as
user-level software generally doesn't have knowledge of the physical address
(for good reasons), and UPC implementations aren't generally done in the kernel.
Original comment by sdvor...@cray.com
on 7 Sep 2012 at 5:05
Correction: "You should <NOT> be able to manually switch UPC implementation
semantics in software any more than you should be able to switch between the
x86 and PowerPC memory model in a C program."
Original comment by jeff.science@gmail.com
on 7 Sep 2012 at 5:06
A part of the talk of C11 makes me uneasy. The UPC AMO started out as a
*library only*.
My $0.02: can we define a near-term goal of just coming up with a usable
library? without prejudicing a larger effort that will blend atomic types
seemlessly into the language? I have of course no objection to looking at C11
for inspiration. I just don't feel the urgency to resolve the whole problem
*right now*.
Done ranting. As to the substance of "being like PAMI", the choice is fairly
obvious:
* no allowance for decision whatsoever. You get a single choice for
implementation, and you live with it. This is the simplest, but I have the
feeling that most of us are not in this camp.
* *some* allowance for a decision, but without clear performance guarantees.
Thus, the choice is by design a bit nebulous "if you pick me you *may* get HW
to execute your AMOs. Nasty.
* An raw list of capabilities without commentary on what each capability is
actually good for. This is the "PAMI way", and I can see why Jeff likes it not:
the capability list is by definition non-portable.
* Expert system: the system picks an implementation based on criteria you state
when you create the domain. I'd love to have such a thing. We tried to do this
for PAMI - there are endless published papers about it - just not a system that
I trust to work. Yet.
* Nirvana: the system guesses/autotunes the implementation you really wanted,
no questions asked.
Original comment by ga10...@gmail.com
on 7 Sep 2012 at 5:50
> Is mmap() added to C11? I thought that came from POSIX. Does plain C11
(without any extensions such as POSIX) actually permit access to a single
object via pointers that don't compare as equal?
Here is the exact C11 verbiage I'm referring to:
"NOTE Operations that are lock-free should also be address-free. That is,
atomic operations on the same memory location via two different addresses will
communicate atomically. The implementation should not depend on any per-process
state. This restriction enables communication via memory mapped into a process
more than once and memory shared between two processes."
AFAIK there is no way within strictly-conforming C11 to actually accomplish
this mapping, so I think this is just a nod towards POSIX. We could conceivably
"solve" this in UPC by explicitly prohibiting such user-level mapping in the
first place.
Original comment by danbonachea
on 7 Sep 2012 at 6:01
"The latter would need to hash on physical addresses to retain the mmap
atomicity guarantee."
I am not certain that we want to allow the UPC *user* to mmap() the shared heap
multiple times, making the statement above a "read herring" from the point of
view of a UPC-only implementation.
Multiple mappings of the UPC shared heap for use of shared memory within a
multi-core node would be fully controlled by the UPC runtime and thus easier to
address (no pun intended) than a fully general case. In particular I doubt
that arbitrary alignments are used, meaning that hashing on the offset within
the page would be sufficient to correctly hash any/all pointers to a given
object to the same slot in the table of locks (or equivalent).
I agree w/ George that we should NOT look to pickup C11 language features
(_Generic in particular could be problematic), and I don't think that was Dan's
intent. I think we should look at C11 for a "design" that our future users may
already know and which a larger (I hope) community of experts has already
vetted. However the "binding" of that design for UPC would need to be a
LIBRARY - hopefully one that was nearly trivial to implement in terms of C11's
stdatomic.h when available.
Original comment by phhargr...@lbl.gov
on 7 Sep 2012 at 6:03
I think an atomics library-only implementation is fine as long as this permits
a typedef atomic_T-like entity rather than requiring C89 build-ins and some
registration garbage. It is effectively impossible to do atomics portably
without a typedef because of alignment issues that we cannot assume from an
arbitrary C compiler (within the src2src model of BUPC, for example).
Sometimes one has to ram the stupid int in a struct to make atomics work in a
reasonable way (see the BGP atomics API, for example).
Original comment by jeff.science@gmail.com
on 7 Sep 2012 at 9:02
"It is effectively impossible to do atomics portably without a typedef because
of alignment issues that we cannot assume from an arbitrary C compiler "
This is an excellent point, and probably another reason why C11's library
requires the use of typedefs like atomic_int, rather than operating on objects
declared with raw basic types as we seem to be leaning towards. If we go with a
raw types design, this is a problem we will need to address. For example, if
the UPC user does something like this:
typedef struct {
char c;
int64_t i;
} S_t;
shared S_t *S = upc_alloc(sizeof(S_t));
upc_amo_fetchincrement{&(S->i));
There are ABIs where the long field won't be 8-byte aligned, and where issuing
a hardware atomic instruction on such a field could generate a bus error. One
possible solution would be to place the burden upon the user and state that if
the pointer argument to the AMO is not suitably aligned, behavior is undefined.
This resolution dodges the issue, but doesn't provide a solution if the user
really needs to perform AMOs on struct elements. The only option open to such a
user would be to use non-portable (but widely available) alignment built-ins
like __align(), which is incidentally standardized in C11-6.7.5 with the
_Alignas() specifier. If we're concerned that this will be a frequent
portability problem for our AMO users, we could decide to specify some typedefs
for suitably-aligned versions of the basic types (with identical size and
representation) and recommend their (optional) usage when declaring struct
fields that will be passed to the AMO library.
Alternatively, we could just abandon the effort to support AMO on raw basic
types and take the C11 approach of only supporting AMOs on the provided
typedefs. This also solves the alignment portability problem, but makes it more
tedious to incrementally add AMO's to existing applications, and arguably
burdens the common usage to prevent problems in uncommon usage.
Original comment by danbonachea
on 8 Sep 2012 at 10:08
As a follow-up to the discussion relating to C11 support for atomics and how/if
this might apply to UPC AMO's, I found an interesting discussion here:
"Emulating C11 compiler features with gcc: _Atomic"
http://gustedt.wordpress.com/2012/01/17/emulating-c11-compiler-features-with-gcc
-_atomic/
In C11, _Atomic comes in two flavors, as a type qualifier (like const or
volatile) or as type specifier (like in struct or array declarations). [...]
The author goes on to describe how he handles atomic types in his P99 macro
pre-processor package:
http://p99.gforge.inria.fr/
"P99 - Preprocessor macros and functions for C99"
P99 is a suite of macro and function definitions that ease the programming in
modern C, aka C99. By using new tools from C99 we implement default arguments
for functions, scope bound resource management, transparent allocation and
initialization, ...
By using special features of some compilers and operating systems, we also are
able to provide an emulation of a large part of the new C standard, C11.
P99 heavily depends on a decent support for C99 of compilers. We have set up a
test program that may be used as a first indication if a compiler is compatible
with that. Please see the directory c99-conformance for some results of such
tests."
Although it is unlikely that we will adopt the P99 approach re: AMO's or C11
atomics, I am passing along the reference for consideration. There are a
number of other potentially useful aspects of the P99 project, unrelated to
AMO's.
1) C99 conformance test results, for example:
http://p99.gforge.inria.fr/c99-conformance/c99-conformance-gcc-4.5.html
2) The P99 macros use C99 features in some novel ways. Given that UPC is
derived from C99, it is both possible and likely that current UPC programmers
do not make use of some C99 features that might improvement the maintainability
and reliability of their UPC programs, and that might perhaps improve the
productivity of programming in UPC.
Original comment by gary.funck
on 8 Sep 2012 at 6:38
"One possible solution would be to place the burden upon the user and state
that if the pointer argument to the AMO is not suitably aligned, behavior is
undefined."
How does one write portable UPC that meets the alignment requirements of all
possible ABIs? Is this not significantly more onerous than using a new type?
What if UPC is going to run in a heterogenous environment such as may exist in
Cray Cascade? Do we think that alignment alone is going to solve the problems
of atomics from distributed UPC threads that live on Intel Xeon, Intel MIC
and/or NVIDIA processors? I don't recall what CUDA does for atomics but I can
imagine in some future system that the atomics live in special hardware (e.g.
Cray Gemini atomics living in the NIC) that requires a typedef.
I really don't see the use of atomic_int, for example, as "tedious to
incrementally add AMO's to existing applications". Is the tedium coming from
codes that want to replace lock/update/unlock with atomic operations and it is
not obvious to the user which declarations correspond to variables that will
require this treatment?
Perhaps it could be permitted - only when possible - to cast T to atomic_T.
This allows the common case where typedef is required to support the evolution
of existing codes, although it would not be portable.
Original comment by jeff.science@gmail.com
on 9 Sep 2012 at 5:49
"How does one write portable UPC that meets the alignment requirements of all
possible ABIs? Is this not significantly more onerous than using a new type?"
Not really. I believe the alignment issue only arises when the target of the
AMO is located inside an aggregate type (a struct field), which may be an
uncommon or less important case for typical HPC apps. For the simple case of
AMO's on elements of an array of longs (the common case?), alignment should not
be a problem on any reasonable architecture. If we believe that is the common
case and the common case does not suffer from alignment problems, then it would
be unfortunate to burden the common case with additional syntax that only
affects the portability of the uncommon case.
"What if UPC is going to run in a heterogenous environment ... I can imagine in
some future system that the atomics live in special hardware (e.g. Cray Gemini
atomics living in the NIC) that requires a typedef"
This is a red herring. I believe hardware that *requires* AMO targets to live
in special memory locations (where "special" means something more than UPC
shared heap) are firmly beyond the scope of what we're trying to handle with
the UPC AMO library. The C11 AMO library also does not handle implementations
that require memory to be allocated "specially", although it does permit AMO's
to be "faster" ("lock-free") on certain locations. In any case, a typedef alone
is not enough to implement "special" memory, because it does not handle dynamic
allocation.
"I really don't see the use of atomic_int, for example, as "tedious to
incrementally add AMO's to existing applications". Is the tedium coming from
codes that want to replace lock/update/unlock with atomic operations and it is
not obvious to the user which declarations correspond to variables that will
require this treatment?"
It's tedious because large portions of the code may already written that
allocate and pass around long-lived application data structures as base types
(eg arrays of longs). Adding an AMO operation on an element in a
performance-critical loop can be a localized and incremental change if the AMO
supports basic types, otherwise adding the AMO may require adjusting many type
expressions globally throughout the application to accommodate the AMO call
being added to one corner of the program.
"Perhaps it could be permitted - only when possible - to cast T to atomic_T.
This allows the common case where typedef is required to support the evolution
of existing codes, although it would not be portable."
My point is that the typedef is NOT required in the common case. We've already
decided we don't want to allow AMO library metadata inside user data space, so
we're only talking about alignment requirements here. Many architectures never
have alignment problems for AMOs, and the rest only have a problem for struct
fields.
I was suggesting that the AMO's statically accept memory arguments of either
atomic_T or T, and in the latter case be permitted to abort at runtime in the
rare situation that alignment requirements are violated. This supports easy
(cast-free) incremental insertion of AMO operations, but also provides the
mechanism to write fully-portable programs with AMOs that are guaranteed to
satisfy alignment constraints.
If we feel supporting both atomic_T and T makes the API too "wide", we could
just support the former and recommend the use of casts as you suggest, which is
basically how C11 AMO's work. If we go this route then no special verbiage is
requires for those casts, because that's already covered by C99-6.3.2.3: "A
pointer to an object or incomplete type may be converted to a pointer to a
different
object or incomplete type. If the resulting pointer is not correctly aligned
for the pointed-to type, the behavior is undefined."
Original comment by danbonachea
on 10 Sep 2012 at 12:05
Dan wrote:
> Not really. I believe the alignment issue only arises when the target of the
AMO is located inside an aggregate
> type (a struct field), which may be an uncommon or less important case for
typical HPC apps. For the simple
> case of AMO's on elements of an array of longs (the common case?), alignment
should not be a problem on
> any reasonable architecture.
AIX on PPC64 is not reasonable by Dan's definition, because the ABI ensures
8-byte alignment for "double", but *not* for (void *) or long. The rules for
8-byte types inside structs on AIX is bizarre and not worth describing here.
Original comment by phhargr...@lbl.gov
on 10 Sep 2012 at 12:53
"AIX on PPC64 is not reasonable by Dan's definition, because the ABI ensures
8-byte alignment for "double", but *not* for (void *) or long. The rules for
8-byte types inside structs on AIX is bizarre and not worth describing here."
I don't think we're proposing to allow UPC AMO's on (void *), because
pointer-to-local living in shared space is already a "dicey" programming
construct. I haven't heard anyone mention AMO's on pointer-to-shared, but the
alignment of that type is already under the exclusive control of the UPC
compiler.
The lack of alignment guarantees for arrays of longs in AIX-PPC64 (and possibly
for other arrays of basic types elsewhere) is not likely to be an issue for
UPC, because the AMO library already requires the target to live in shared
space. In the case of static allocation of a shared array of longs, the UPC
implementation can (easily?) provide 8-byte alignment of that shared object to
make this problem go away (but this would not be required by the AMO library
spec). In the case of dynamic allocation, upc_alloc and friends are already
guaranteed to provide the necessary alignment for arrays of longs due to ISO C
7.20.3: "The pointer returned if the allocation succeeds is suitably aligned so
that it may be assigned to a pointer to any type of object and then used to
access such an object or an array of such objects in the space allocated",
because "any type of object" includes atomic_long.
Now that we've broached the subject, I think we should consider eventually
supporting compare-and-swap on a pointer-to-shared, since that would enable UPC
programmers to portably implement lock-free data structures in shared memory.
This usage case wasn't mentioned by our UPC users on the call, but I believe
this is an important class of AMO clients, judging by the enormous literature
on lock-free programming for non-UPC languages. This is something that could
possibly be added as an enhancement in 1.4.
I'm aware that many UPC platforms use a 128-bit PTS representation and
therefore probably cannot provide atomics on PTS exclusively in
instruction-level hardware. However, there are also platforms using a 64-bit
PTS representation, and also many cases where an implementation could do
significantly better than upc_lock/pointer_swap/upc_unlock, even for 128-bit
PTS. It's also worth noting that an implementation of the UPC AMO library
should be permitted to use software-based atomics for one type (eg PTS) and
fully-hardware atomics for a different, incompatible type (eg long) in the same
"domain" without creating a problem. Hopefully the AMO spec being drafted will
already prohibit touching the same memory location using two differently-typed
AMO operations within a synchronization phase.
Original comment by danbonachea
on 10 Sep 2012 at 2:14
"I don't think we're proposing to allow UPC AMO's on (void *)..."
So no one ever does pointer arithmetic with atomics? I don't know that "shared
pointer to local" is the only use case here either. I am a UPC noob but isn't
"shared pointer to shared" legal (and potentially useful)?
Original comment by jeff.science@gmail.com
on 10 Sep 2012 at 2:17
"I am a UPC noob but isn't "shared pointer to shared" legal (and potentially
useful)?"
Yes it is, which is the topic of my last two paragraphs that recommend
including AMO library support for compare-and-swap on a pointer-to-shared
(PTS). I wouldn't push for full-blown PTS arithmetic using AMOs, not because it
would be impossible but because this would complicate the interface with the
blocksize information required to perform PTS arithmetic, and lock-free
algorithms generally only require pointer CAS, not pointer arithmetic (and the
latter can be implemented using the former with an additional read). Adding PTS
CAS support would subsume the need for AMOs on (void *) in shared space, which
as I mentioned are only meaningful in very restricted cases (and I don't think
anyone is pushing for them).
Original comment by danbonachea
on 10 Sep 2012 at 4:15
As promised, and hopefully without too much delay, here's is a straw proposal
for an
"atomicity domain"-based AMO library.
-- upc_domain_t --
An object of type upc_domain_t represents an atomicity domain, which specifies
a set of
operations and datatypes over which access to a memory location in a given
synchronization
phase is guaranteed to be atomic if and only if no other mechanisms or
atomicity domains
are used to access the same memory location in the same synchronization phase.
-- upc_amo_mode_t --
The following constants of type upc_amo_mode_t shall be defined to allow
user-provided
indication of a preferred usage mode to the implementation.
* UPC_AMO_DEFAULT, An implementation-defined default mode
* UPC_AMO_HARDWARE, Use only hardware-supported atomic memory operations
-- Supported Operations & Types --
The atomicity mode UPC_AMO_DEFAULT will support the following combination of
operations
and data types:
AND OR XOR NOT SET CSWAP ADD MIN MAX
I, I32 X X X X X X X X X
L, I64 X X X X X X X X X
UI, U32 X X X X X X X X X
UL, U64 X X X X X X X X X
F32 X X X
F64 X X X
The atomicity mode UPC_AMO_HARDWARE will provide support for an
implementation-defined
subset of the data types and operations supported by UPC_AMO_DEFAULT.
-- Domain Initialization --
upc_domain_t UPC_DOMAIN_INIT(upc_op_t ops, upc_type_t types, upc_amo_mode_t mode);
A static initializer that creates a upc_domain_t object that supports the
operations
and types specified by ops and types, respectively, in the AMO implementation
mode, mode.
int upc_all_domain_init(upc_domain_t *domain, upc_op_t ops, upc_type_t types,
upc_amo_mode_t mode);
A dynamic initializer for a upc_domain_t object that initializes the specified
object
to support the operations and types specified by ops and types, respectively,
in the AMO
implementation mode, mode.
-- Support Query --
int upc_amo_query(upc_amo_mode_t mode, upc_op_t ops, upc_type_t types);
Returns 1 if the AMO implementation mode, mode, supports all pair-wise
combinations of
types and operations specified by types and ops, respectively; returns 0
otherwise.
-- Atomic Operations --
void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr,
upc_op_t op, upc_type_t type,
shared void *target, void *operand1, void *operand2);
void upc_amo_relaxed(upc_domain_t *domain, void *fetch_ptr,
upc_op_t op, upc_type_t type,
shared void *target, void *operand1, void *operand2);
where:
- *target = *target OP *operand1, for OP in {AND, OR, XOR, ADD, MIN, MAX}
- *target = (*target == *operand1) ? *operand2 : *target, for CSWAP
- *target = *operand1, for SET
- *target = ~(*target), for NOT
and the original value of target is stored in fetch_ptr if and only if
fetch_ptr != NULL.
Original comment by nspark.w...@gmail.com
on 10 Sep 2012 at 4:35
"and also many cases where an implementation could do significantly better than
upc_lock/pointer_swap/upc_unlock, even for 128-bit PTS."
Could you be more specific? I'm having trouble coming up with anything better,
given that a programmer could use the lock to protect more than just the
pointer swap. Also, note that implementations that have 128-bit PTS
representation probably also have the restriction that 128-bit reads and writes
are not atomic at the hardware level, so you'll need 128-bit load and store
atomics as well, and these wouldn't play nice with strict reads/writes...
Original comment by sdvor...@cray.com
on 10 Sep 2012 at 4:56
Why no atomic SET for FP types? Don't we need to support atomic SET/GET for
all types? Load/store atomicity is not necessarily at word granularity.
Original comment by jeff.science@gmail.com
on 10 Sep 2012 at 5:33
"Why no atomic SET for FP types? Don't we need to support atomic SET/GET for
all types? Load/store atomicity is not necessarily at word granularity."
Note that the memory model implies sequential consistency for strict
references. I'm pretty sure that requires that strict memory operations are
atomic (or at least appear to be). Admittedly, the spec is extremely vague on
the mapping from source expression to memory model operations, so this may not
be the full story. I'll defer to others more knowledgeable on the intricacies
of the memory model.
Original comment by sdvor...@cray.com
on 10 Sep 2012 at 5:49
"Why no atomic SET for FP types? Don't we need to support atomic SET/GET for
all types? Load/store atomicity is not necessarily at word granularity."
It certainly wouldn't bother me to have atomic SET for floating-point types.
In general, I think a lot less about floating-point types and what one would do
with them. If we also need an atomic GET, we could probably just add a NO-OP
operation, so you'd do a fetching NO-OP and get the value without any
modification.
Original comment by nspark.w...@gmail.com
on 10 Sep 2012 at 5:53
I assume that AMOs are going to work in relaxed mode, too, which is why atomic
GET/SET are required, or should a user instead declare these references as
strict? It might at least be convenient to define atomic GET/SET, if only as
another way to declare a reference to be strict.
Original comment by jeff.science@gmail.com
on 10 Sep 2012 at 5:54
The more I think about the straw proposal in Comment 68, the more I'd like to
coalesce the domain and "mode" types and have an implementation provide
"everything" (UPC_AMO_DEFAULT, likely in software) or just what the hardware
can support (UPC_AMO_HARDWARE). Regarding all the different ways one could
implement atomics "in hardware", I think that would fall in as one of the
"implementation defined" aspects of the hardware-only domain. For Cray and
IBM, I would /expect/ that they would let their NICs do the work, but any
implementation could do whatever the implementation chose to do.
I also think it would be worth re-working the query to have separate types/ops
parameters for integer and floating-point types. The support matrix I
suggested doesn't work in the pair-wise matching of types and ops, if integer
and FP types are considered together.
Original comment by nspark.w...@gmail.com
on 10 Sep 2012 at 5:59
Comment 73 was about comment 71, btw.
Regarding comment 72, FETCH-AND-OP with NO-OP is exactly how MPI-3 does atomic
GET, btw.
I wonder if there is an advantage to having a separation. Unless the compiler
can remove both the operation and the input operand, there is going to be
packet overhead in doing GET via FETCH-AND-OP. Some systems have special
packets for 8-byte communication, which could work for atomic GET (assuming
perhaps that all GETs were atomic and thus nothing special was required to
achieve this). In this case we want to force to user to declare their intent
to do GET specifically.
The other issue is that the compiler can probably do more to optimize if it
doesn't have to inspect the input operand of FETCH-AND-OP to distinguish a
"GET" from a GET.
Original comment by jeff.science@gmail.com
on 10 Sep 2012 at 6:00
"access to a memory location in a given synchronization phase is guaranteed to
be atomic if and only if NO OTHER MECHANISMS or atomicity domains are used to
access the same memory location in the same synchronization phase."
This part of the straw spec answers all the questions about concurrent strict
accesses - basically, touching the same memory location with an AMO and a
strict operation in the same synchronization phase means atomicity is no longer
guaranteed. I would go even stronger and state that behavior is completely
undefined if you do this. Every atomic type needs to have GET and SET via the
AMO library, which is the "right" way to perform concurrent read/writes during
an AMO synchronization phase.
"Note that the memory model implies sequential consistency for strict
references. I'm pretty sure that requires that strict memory operations are
atomic (or at least appear to be)."
As discussed elsewhere in other threads, the memory model does NOT guarantee
freedom from word-tearing, which is a related but orthogonal issue. It would be
impossible to implement strict if we required strict operations to be
tearing-free, because you can perform a "single" strict write of arbitrarily
large size using a struct type. Programs with race conditions need to be aware
of the word-tearing issue, even if the races use only strict operations. Most
architectures won't word tear for 1,2,4,8 byte accesses, but this is
platform-specific.
In any case, we are specifically prohibiting concurrent access with non-AMO
read/writes (strict or otherwise), both to ensure atomicity can be implemented
when hardware support is absent, and to prevent word-tearing from being an
issue.
In resp to comment 69:
""and also many cases where an implementation could do significantly better
than upc_lock/pointer_swap/upc_unlock, even for 128-bit PTS."
Could you be more specific? I'm having trouble coming up with anything better,
given that a programmer could use the lock to protect more than just the
pointer swap."
An active message is sent, upon arrival software grabs a lock, does the update,
then releases the lock and sends a reply. For a OP-and-fetch that's one network
round-trip of latency. The naive UPC-level implementation of
upc_lock/pointer-read/pointer-write/upc_unlock in general will induce four
network round trips of latency, for approximately a 4x slowdown on a
latency-dominated network.
Original comment by danbonachea
on 10 Sep 2012 at 8:17
A few comments on the actual proposal:
int upc_all_domain_init(upc_domain_t *domain, upc_op_t ops, upc_type_t types,
upc_amo_mode_t mode);
This function should not return a value, because it should never fail (unless
the arguments are totally bogus values, which should be a fatal error). The
user is stating he wants this set of ops and types, give me a domain with the
"fastest" implementation that supports all the pairwise combos.
The mode argument makes no sense here - the user always wants the "fastest"
implementation you can provide. I would change that argument to be a
upc_amohint_t (or something similar) where the values are something like
UPC_AMO_LATENCY, UPC_AMO_THROUGHPUT and allow for additional
implementation-defined hints.
Finally, if we're going to support statically-initialized domains, I don't see
a reason to also provide a dynamic one. However, we DO need to think about how
these domain_t types are handled by the user, specifically whether they can be
passed around by value (like a upc_handle_t) or whether they must be handed by
reference (like a upc_lock_t *). The latter allows the implementation to store
state directly in the upc_domain_t, the former requires a level of indirection
in the implementation and opens the door to memory leaks inside the
implementation lacking a destructor, and potentially makes it more difficult to
implement the static constructor. Whichever we settle on should be used
consistently throughout - if constructors return a domain_t by value, then the
AMO's should also take the domain by value. Otherwise drop the static
initializer and work everything by reference. Finally, we should state that
domain construction is collective and domain handles are thread-specific, so we
don't have domains handles being passed around in shared memory.
"I also think it would be worth re-working the query to have separate types/ops
parameters for integer and floating-point types. The support matrix I
suggested doesn't work in the pair-wise matching of types and ops, if integer
and FP types are considered together."
An easy way to fix this is change it so each atomicity domain is only over a
SINGLE type. We already have to prohibit concurrent AMO access to a given
memory location using two different C-types within a synchronization phase, so
might as well fold this requirement directly into the domain abstraction. This
means every argument of (upc_type_t types) should be (upc_type_t type), and the
type argument can be dropped from the actual AMO call because it is redundant
and stored in the domain.
We haven't discussed the spelling of upc_type_t values and the exact C types
they correspond to, but we need to specify those.
Combining all of the above, we end up with a spec like:
upc_domain_t UPC_DOMAIN_INIT(upc_op_t ops, upc_type_t type, upc_amohint_t
hints);
void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr, upc_op_t op,
shared void *target, void *operand1, void *operand2);
Example usage for an atomic add:
upc_domain_t mydom = UPC_DOMAIN_INIT((UPC_ADD|UPC_CSWAP), UPC_TYPE_INT32,
UPC_AMO_LATENCY);
shared int32_t val;
upc_amo_strict(&mydom, 0, UPC_ADD, &val, 42, 0);
Original comment by danbonachea
on 10 Sep 2012 at 8:48
"An active message is sent, upon arrival software grabs a lock, does the
update, then releases the lock and sends a reply. For a OP-and-fetch that's one
network round-trip of latency. The naive UPC-level implementation of
upc_lock/pointer-read/pointer-write/upc_unlock in general will induce four
network round trips of latency, for approximately a 4x slowdown on a
latency-dominated network."
This is true for most atomics. However, compare-and-swap is usually done in a
spin-loop until it succeeds. Most uses of compare-and-swap that I've seen look
something like this:
do {
tmp = val
new_val = f(tmp)
} while( cas( &val, tmp, new_val ) != tmp )
Instead of compare-and-swap, users can just use locks to prevent anyone from
touching the value until you've updated it.
lock( val_lock )
val = f( val )
unlock( val_lock )
This avoids the retry loop entirely by changing the algorithm. This is
something only the user can do. That said, some users just want to run their
lock-free code everywhere, regardless of whether or not it is a good idea on a
given platform, so let's consider the two software approaches mentioned.
With a software atomic implementation naively based on locks, the minimum
latency is dramatically increased due to the extra network round-trips. The
maximum latency is hopefully capped by a fair locking algorithm though.
With a software atomic implementation based on active messages, you avoid 3
extra network round-trips, but are dependent on someone actively processing
messages on the remote end, meaning the maximum latency is potentially
unbounded. Additionally, since we're talking about systems where the shared
pointer doesn't fit into 64 bits, there are scalability issues to consider. In
particular, full message queues become much more common. Thus, I question if
this is always better than naively using locks.
In general, the problem with software atomics is that there are many ways of
implementing them, but which implementation to choose can be extremely
problem-dependent. Users know more about the problem they are trying to solve,
and are thus better equipped to make that decision. Implementers are unlikely
to provide multiple implementations for the user to choose from (too
expensive), so the user is tough out of luck if the implementation's choice
doesn't match well with their problem. I suppose someone could write a bunch
of reference implementations using different algorithms in UPC to get around
this though...
Original comment by sdvor...@cray.com
on 10 Sep 2012 at 11:17
"This avoids the retry loop entirely by changing the algorithm. This is
something only the user can do. That said, some users just want to run their
lock-free code everywhere... Users know more about the problem they are trying
to solve, and are thus better equipped to make that decision."
My argument for inclusion of PTS CAS was to support lock-free algorithms.
Lock-free algorithms require a pointer CAS AMO. The question of whether or not
the user SHOULD be using a lock-free algorithm on a given platform for a given
situation is orthogonal - there are cases where lock-free data structures are
probably faster (possibly much faster), and also cases where upc_lock-ing with
a traditional data structure may be faster. This is part of application tuning,
and it is not our place to dictate that decision. If we fail to provide PTS CAS
then we are taking that decision away from the user and effectively prohibiting
the use of lock-free data structures.
Original comment by danbonachea
on 11 Sep 2012 at 12:21
"My argument for inclusion of PTS CAS was to support lock-free algorithms.
Lock-free algorithms require a pointer CAS AMO. The question of whether or not
the user SHOULD be using a lock-free algorithm on a given platform for a given
situation is orthogonal - there are cases where lock-free data structures are
probably faster (possibly much faster), and also cases where upc_lock-ing with
a traditional data structure may be faster. This is part of application tuning,
and it is not our place to dictate that decision. If we fail to provide PTS CAS
then we are taking that decision away from the user and effectively prohibiting
the use of lock-free data structures."
Yes, I agree that it should be included for completeness. My intent was merely
to point out that the choice of which software implementation is best is
usually more dependent on the problem at hand than on the platform details, and
implementations are not likely to provide many choices given the development
cost. Therefore, a generic software atomics library, with multiple algorithms,
that worked on a wide variety of conforming UPC implementations would be
useful--more so if such a library could continue to make use of hardware
atomics provided by an implementation.
Original comment by sdvor...@cray.com
on 11 Sep 2012 at 12:56
Some last-minute follow up...
--- Initializers and Modes/Hints ---
Dan said:
> This function should not return a value, because it should never fail (unless
the arguments
> are totally bogus values, which should be a fatal error). The user is
stating he wants this
> set of ops and types, give me a domain with the "fastest" implementation that
supports all
> the pairwise combos.
>
> The mode argument makes no sense here - the user always wants the "fastest"
implementation
> you can provide. I would change that argument to be a upc_amohint_t (or
something similar)
> where the values are something like UPC_AMO_LATENCY, UPC_AMO_THROUGHPUT and
allow for
> additional implementation-defined hints.
>
> [...]
>
> An easy way to fix this is change it so each atomicity domain is only over a
SINGLE type.
This approach makes sense to me. Turning the mode into a hint, which was the
intent all along
(I think), makes more sense to me with the per-type domain (from Dan, below).
I think it eases
the ability to "mix-and-match" types and ops and hints, knowing that you'll get
the fastest
version available as long as you don't over-specify the ops.
I think it simplifies AMO domain creation, unless you wanted AMOs on *lots* of
types. What it
complicates is whether we want to provide a default domain that supports
"everything". We'd
now have to provide "everything" domains on a per-type basis. That said,
people looking into
AMOs would likely be fine creating what they need and wouldn't necessarily need
to have an
"everything" domain.
Dan said:
> If we're going to support statically-initialized domains, I don't see a
reason to also
> provide a dynamic one.
When I've talked to my users about the domain idea, they've always said they'd
use statically-
initialized domains. I think I put the dynamic initializer in there for
completeness. Jeff
did previously say that "Requiring static allocation is really unpleasant from
a usability
perspective."
I think Dan's example at the end of Comment 77 looks pretty reasonable.
--- AMO Types ---
Dan said:
> We haven't discussed the spelling of upc_type_t values and the exact C types
they correspond to,
> but we need to specify those.
For completenss with the types supported by the UPC Collectives, I would have
upc_type_t support:
> UPC_CHAR, UPC_SHORT, UPC_INT, UPC_LONG,
> UPC_UCHAR, UPC_USHORT, UPC_UINT, UPC_ULONG,
> UPC_INT8, UPC_INT16, UPC_INT32, UPC_INT64,
> UPC_UINT8, UPC_UINT16, UPC_UINT32, UPC_UINT64,
> UPC_FLOAT, UPC_DOUBLE, UPC_LDOUBLE,
> UPC_PTS
although not all of these would be supported by AMOs or UPC Collectives. I
think it would be a
complete-enough set for use in other (future) libraries, too.
--- AMO OPs ---
Continuing along the lines of the AMO types, I would like to see upc_op_t
(7.3.2.1) pulled out of
the UPC Collectives library and made a common type for AMOs, Collectives, and
whatever else may
also want it. In addition to what is in 7.3.2.1, we would add UPC_GET,
UPC_SET, UPC_NOT, and
UPC_CSWAP. Both UPC Collectives and UPC Atomic would each have to specify
which operations are
supported.
--- Supported Types & Operations ---
Revising the table from before...
CSWAP AND OR XOR NOT GET SET ADD MIN MAX
I, I32 X X X X X X X X X X
L, I64 X X X X X X X X X X
UI, U32 X X X X X X X X X X
UL, U64 X X X X X X X X X X
F32 X X X X X
F64 X X X X X
PTS X
Original comment by nspark.w...@gmail.com
on 26 Sep 2012 at 7:51
SHORT VERSION: Static initializers cannot be supported by software-emulated AMO
domains, so this feature should be dropped.
AMO's by definition imply coordination between multiple threads. We've
previously discussed the need for domain creation to be a "collective" thing,
with participation from all the threads which intend to use the domain to
perform "conflicting" AMOs. Eventually that may be collective over a team, for
now it means all the threads. The reason for this collective requirement is to
allow the implementation of the library to setup whatever metadata it needs to
ensure atomicity.
In the case of a domain where all the requested OPs can be supported via
hardware instructions, probably nothing special needs to be done during domain
creation - except probably record the type size provided by the user, and
possibly stash away the other arguments for debug checking (all of which can be
done locally with no coordination or computation). The actual AMO ops in this
case boil down to a thin wrapper around a hardware instruction, prefaced by a
check that this domain is hardware-supported.
However in the case of a domain that needs to perform locking in software, more
probably needs to happen. Assume we are interested in a "good" implementation
where each software-emulated domain is implemented to operate independently of
other software domains (rather than a "dumb" approach of just using a single
big lock for all software-emulated domains, which would lead to artificial
serialization contention). In this implementation each domain needs its own
dedicated "lock" resource, which might be a upc_lock_t, an OS mutex, or
something similar. The important point is that domain creation involves
allocating this lock resource, and very importantly also some form of
COLLECTIVE coordination (probably a broadcast) to ensure all threads
participating in the domain creation agree upon the identity/location of the
lock resource protecting this domain.
For dynamic domain creation, this is not a problem - we can require the domain
allocation function to be called collectively in the user's program, and the
library can perform whatever allocation and coordination it requires inside
that call. However, I'm now realizing that permitting static allocation of AMO
domains would create serious problems for a software implementation. Consider
the "simplest" case of a file-scope statically-initialized domain. Our base C99
language does not support running arbitrary code to initialize static objects
(as for example in C++ or Java), the only support which is guaranteed to exist
in the compiler is static initialization to a constant value. This means the
static initializer implementation basically must be a macro, and at best it can
simply stuff the arguments into some kind of struct. This does not allow the
software implementation to dynamically allocate any lock resources it may need
(especially if those locks need to live in the shared heap), nor does it permit
collective coordination to uniquely associate the corresponding per-thread
domain objects with each other and/or that dedicated lock resource. It might be
possible to play some games with delayed initialization of lock resources to
skirt the first issue, but I believe the second is fundamental - if the
application statically creates multiple domains with similar arguments that
must be emulated in software, there is no way for the library to reliably
"match up" the domain resources across threads.
I believe this argues strongly for dropping the idea of static initialization
of AMO domains entirely. AMO domain creation should be accomplished by a call
to a collective initializer library function, which creates the domain and
passes back a pointer or handle to it. It means a line of code per domain in
the user program, but the alternative design severely curtails software
implementation, and by extension the entire AMO library (since software
implementation is our "fall back" position for anything not natively supported
in hardware).
Original comment by danbonachea
on 27 Sep 2012 at 8:24
Updated interface proposal:
--------------------------
TYPE
upc_amodomain_t
The type upc_amodomain_t is an opaque UPC type. upc_amodomain_t is a shared
datatype with incomplete type (as defined in [ISO/IEC00 Sec 6.2.5]). Objects
of type upc_amodomain_t may therefore only be manipulated through pointers.
Two pointers to upc_amodomain_t that reference the same AMO domain object will compare
as equal. The results of applying upc_phaseof(), upc_threadof(), and
upc_addrfield() to such pointers are undefined.
INITIALIZATION
upc_amodomain_t *upc_all_amodomain_alloc(upc_op_t ops, upc_type_t type, upc_amohint_t hints);
The upc_all_amodomain_alloc function dynamically allocates an AMO domain and returns
a pointer to it.
upc_all_amodomain_alloc is a collective function. The return value on every
thread points to the same AMO domain object.
The AMO domain created supports AMO calls to operate on objects of a unique type,
specified by the "type" parameter. The upc_type_t values and the corresponding type
they specify are listed in table [ref].
The "ops" parameter specifies the atomic operations to be supported by the domain.
The valid upc_op_t values and their meanings are listed in table [ref].
Multiple upc_op_t values can be combined by using the bitwise OR operator (|),
and each value has a unique bitwise representation
that can be unambiguously tested using the bitwise AND operator(&).
The "ops" parameter shall only specify operations within the set permitted for "type"
(as defined in table [ref]), otherwise behavior is undefined.
The "hints" parameter provides a performance hint to the implementation. It shall be
equal to one of the following values:
0 : default behavior
UPC_AMO_HINT_LATENCY : requests the implementation to minimize latency of AMO operations
UPC_AMO_HINT_THROUGHPUT : requests the implementation to maximize throughput of AMO operations
UPC_AMO_HINT_* : Implementation-defined additional hint values
The implementation is free to ignore the "hints" parameter.
USAGE
void upc_amo_strict(upc_amodomain_t *domain, void *fetch_ptr, upc_op_t op,
shared void *target, void *operand1, void *operand2);
Description to be added...
Example usage for an atomic add:
-------------------------------
shared int32_t val;
int main() {
upc_amodomain_t *mydom = upc_all_amodomain_alloc((UPC_ADD|UPC_GET|UPC_SET), UPC_TYPE_INT32, UPC_AMO_HINT_LATENCY);
upc_amo_strict(mydom, 0, UPC_ADD, &val, 42, 0);
}
Notes:
------
This is not the only possible interface, but I'm intentionally mimicing the
upc_lock_t interface, because it is a well-proven interface which is
established and familiar to UPC users.
The UPC I/O library has an elaborate system for providing implementation hints
(and including implementation-defined hints), but for now I'm assuming we don't
need that much complexity and only expect a few possible hints.
We should consider whether we need a domain destructor, ie: void
upc_all_amodomain_free(upc_amodomain_t *d). If we're convinced that all
possible clients will only have a small constant number of domains, then
perhaps that's unnecessary.
Original comment by danbonachea
on 27 Sep 2012 at 8:25
Nick said:
"We'd now have to provide "everything" domains on a per-type basis. That said,
people looking into AMOs would likely be fine creating what they need and
wouldn't necessarily need to have an "everything" domain."
I don't think we should provide any pre-defined "everything" domains. Such a
domain would by definition include all the ops (for a given type) and therefore
would activate the most general (read "slowest") implementation for that type.
The whole point of domains is for the user to explicitly tell the library which
ops his module needs, so that the fastest possible implementation meeting those
restricted needs can be provided. Pre-defining fully general domains would
undermine that design goal.
Nick said:
"Continuing along the lines of the AMO types, I would like to see upc_op_t
(7.3.2.1) pulled out of the UPC Collectives library and made a common type for
AMOs, Collectives, and whatever else may also want it. In addition to what is
in 7.3.2.1, we would add UPC_GET, UPC_SET, UPC_NOT, and UPC_CSWAP. Both UPC
Collectives and UPC Atomic would each have to specify which operations are
supported."
Issue 10 discusses this factorization, and I agree this should be a high
priority to complement the inclusion of AMO's in 1.3. UPC_GET and UPC_SET are
definitely required for every type, since we prohibit direct access during AMO
phases, and these are our guaranteed "tear-free" reads and writes. I assume
your "UPC_NOT" is a bitwise complement, otherwise it should be named
"UPC_LOGNOT" to match the established convention in the collectives. I haven't
heard anyone argue for atomic logic operations (UPC_LOGAND, UPC_LOGOR), so
let's leave those out of the supported AMO list for now.
Nick said:
"For completenss with the types supported by the UPC Collectives, I would have
upc_type_t support:
> UPC_CHAR, UPC_SHORT, UPC_INT, UPC_LONG,"
I don't recall any committee discussion regarding AMO's on types smaller than
32-bits. I don't personally object to that, and I agree it would be nice to
provide it for completeness and consistency with the UPC collectives and C11
AMOs. For software-based implementations it should only be a small additional
burden to support additional types, and I suspect every implementation will
need a fall-back software implementation anyhow.
Original comment by danbonachea
on 28 Sep 2012 at 11:42
Dan said:
> I don't think we should provide any pre-defined "everything" domains.
I thought there were people (not me) that wanted to provide that as a
convenience factor. Maybe I was mistaken. Either way, I don't think my users
would use an "everything" domain for the very reason you note. They want the
fastest for what they know they need. I'm happy to scratch the "everything
domain" idea.
Dan said:
> I don't recall any committee discussion regarding AMO's on types smaller than
32-bits.
I am not at all advocating for AMOs on <32-bit types. I would probably be
happy with 64-bit types only, but considering some notes about BG/P's AMO
support (I think), 32-bit AMOs make sense, too. I noted that "although not all
of these would be supported by AMOs or UPC Collectives" and I would consider
AMOs to only apply to the 32- and 64-bit types we've discussed so far, as well
as the pointer-to-shared type.
Original comment by nspark.w...@gmail.com
on 1 Oct 2012 at 9:12
Since "we" (I'm not sure how many people are really concerned about the AMO
discussion) seem to be considering AMO domains on a per-type basis (i.e., only
guaranteeing atomicity on a per-type basis), I was wondering whether it would
be worth revisiting an "old style" proposal (e.g., upc_amo_xor_U64_r()) with a
restricted operation set (defined on a per-type basis). My biggest concern
with the domain approach, especially given that domains must be dynamically
allocated, is the general user acceptance with such a "heavy" API (c.f.,
existing AMO extensions).
If we went with a subset of the operations noted in Comment 38 and Comment 81,
could we define an easier-to-use API that is supported by Cray, IBM, SGI, and
the IB-based solutions (HP, SGI), even if it isn't as robust (i.e., no MIN or
MAX)? Would fetching and non-fetching versions of {ADD, AND, OR, XOR, GET,
SET, CSWAP} for integer types be a supportable, minimal set across
implementations?
Maybe we leave operations like MIN and MAX for a future update when vendor
support is more widespread. Floating-point support wouldn't be an issue
because of the per-type atomicity; we wouldn't include bit-wise ops for FP
types. And most FP AMOs would probably be done in software anyway.
Then, we could let an implementation define macros like
UPC_AMO_[[TYPE]]_IS_FAST to provide some information to the user about the
"fastness" of the AMO.
Original comment by nspark.w...@gmail.com
on 3 Oct 2012 at 2:22
"Floating-point support wouldn't be an issue because of the per-type atomicity;
we wouldn't include bit-wise ops for FP types."
What's the hangup for bit-wise operations? These operations just work on bits
and don't need to know whether those bits happen to be an integer or a floating
point value (or anything else for that matter). Not including them will just
cause users to play dirty games with unions or bad pointer aliasing. We might
as well do that inside the library instead of forcing users to do it.
Original comment by sdvor...@cray.com
on 3 Oct 2012 at 2:34
"What's the hangup for bit-wise operations? [...]"
I just didn't think many people would care to AND, OR, or XOR two
floating-point values. I guess that I just don't see what's helpful about
that. I think that ADD, GET, SET, CSWAP all make sense; I just can't see how
you'd use the others in a meaningful way. Also, I don't necessarily feel like
the UPC AMO library needs to provide everything under the sun; I think it
should provide those things that are meaningful or helpful to users.
Original comment by nspark.w...@gmail.com
on 3 Oct 2012 at 2:39
I support Comment 86 by Nick. I like the idea of starting off from a small and
simple API and enhancing it over time as hardware evolves. Based on Comment
38, it seems int64 is the only type we need to support now if we decide to go
with this direction.
Original comment by yzh...@lbl.gov
on 3 Oct 2012 at 2:46
"I just didn't think many people would care to AND, OR, or XOR two
floating-point values."
This is very true. I'd expect bit-wise ops to take appropriately sized integer
values as masks. ;)
"I guess that I just don't see what's helpful about that. I think that ADD,
GET, SET, CSWAP all make sense; I just can't see how you'd use the others in a
meaningful way."
Just like with integers, they're extremely useful for directly manipulating
various portions of the raw representation. For instance, with an IEEE
representation, adding XOR allows users to atomically negate a floating point
value, and adding AND allows users to atomically find the absolute value.
On an unrelated note, are you including SWAP under SET? SWAP is very important
in many lock-free algorithms.
"Also, I don't necessarily feel like the UPC AMO library needs to provide
everything under the sun; I think it should provide those things that are
meaningful or helpful to users."
Agreed.
Original comment by sdvor...@cray.com
on 3 Oct 2012 at 2:52
"Based on Comment 38, it seems int64 is the only type we need to support now if
we decide to go with this direction."
I disagree. I think we should still support [u]int(32|64), float/double, and
PTS, but (according to Comment 38) Cray would only define UPC_AMO_INT64_IS_FAST
== 1. I think the big benefit of the per-type atomicity is that it lets a
vendor, like Cray, tell the user which AMO types are fast and which are not,
while still providing reasonably broad support across the library.
Original comment by nspark.w...@gmail.com
on 3 Oct 2012 at 2:52
"Just like with integers, they're extremely useful for directly manipulating
various portions of the raw representation. For instance, with an IEEE
representation, adding XOR allows users to atomically negate a floating point
value, and adding AND allows users to atomically find the absolute value."
I hadn't thought of that, but that is a neat use for bit-wise ops on FP types.
Since most NIC-based AMOs seem to be atomic across types, I would imagine that
your future FP support (noted in Comment 38) wouldn't be tied down by UPC
allowing bit-wise FP ops.
"On an unrelated note, are you including SWAP under SET? SWAP is very
important in many lock-free algorithms."
Yeah, I was considering SWAP = FETCH + SET. If we go the "explicit" route
(like the current BUPC and Cray UPC AMO extensions), then we'll probably have
two functions like: "void upc_amo_set()" and "TYPE upc_amo_swap()" (I'm
intentionally leaving out the type and relaxed/shared stuff here).
Original comment by nspark.w...@gmail.com
on 3 Oct 2012 at 2:58
For whatever it's worth to this discussion, I talked to our Fortran rep about
how their standard's committee was settling on which AMO operations to include
for coarrays in the next standard (since Fortran currently has only an atomic
read and an atomic write). He said that their approach is to require a set of
operations that all implementations will be expected to support (add, and, or,
xor, swap, compare-and-swap) and they basically assume that any "reputable
hardware" will support them all efficiently. The idea of bothering with
domains, a query mechanism for what is fast or slow, or worrying about some
poor vendor that had hardware support for only a subset that was forced into
software atomics seemed ridiculous to him. Then again, they also have a
specific implementation-defined integer data type for atomics.
Original comment by johnson....@gmail.com
on 3 Oct 2012 at 3:17
"Also, I don't necessarily feel like the UPC AMO library needs to provide
everything under the sun; I think it should provide those things that are
meaningful or helpful to users."
I think the problem is that while any specific user is likely to only consider
a small set of AMO types and ops "meaningful and helpful", that set will differ
between users/programs/modules, and the union of those requirements ends up
being large. The domain API allows users to express which subset they care
about for the current algorithm and the implementation to provide the best
possible performance within those constraints. On the flip side, it allows
implementations to expose whatever hardware AMO support they have for a
particular algorithm, without requiring the hardware to support the complete
list of ops in the UPC spec (most of which won't be used by any single
algorithm).
A secondary, more subtle feature of domains is they can improve concurrency by
directly expressing the independence of AMOs from different modules or program
entities, which need not contend for the same atomicity resources (software
locks in the case of software-implemented AMOs). Imagine a massively-parallel
application that divides its threads into a large number of teams, each of
which creates an AMO domain for access to each team's shared data. If the data
type in question happens to require a software AMO (eg atomic add on a
floating-point accumulator), then members of each team contend for the team's
lock while performing the AMO, but AMO's from different teams can always
proceed independently. Without domains, then EVERY thread in the application
potentially contends for ONE lock, even for non-conflicting updates, which
could make a massive difference in overall performance. Once UPC officially
grows teams, it also gives us a natural way to ensure the locks associated with
the domain are located "near" the team using them, which may mean the
difference between network communication and shared-memory locking, another
potentially enormous performance win. One could even imagine team domain
creation that activates a fully hardware implementation when the team members
are all local to a shared-memory node, but falls back to a software
implementation when the team includes members who are remote over a commodity
network lacking AMO features. C11 doesn't worry about such issues because its
target platform is fully shared-memory threading with relatively small-scale
concurrency, where these issues have far less impact. I suspect the same is
true for the majority of the Fortran community (those not using co-array
features). It's our job to ensure the design we specify allows good scaling on
large-scale systems (high thread concurrency), and large-scale applications
(high module concurrency), and I think domains provide that.
From a programability standpoint, I don't think the domain proposal is overly
cumbersome or "heavy", relative to a "domain-less" API. The domain is created
once at module initiation, with one line of code. Thereafter for AMO operations
the difference we're talking about is whether the user writes something like:
upc_amo_strict(int64dom, 0, UPC_ADD, &val, 42, 0);
as opposed to something like:
upc_amo_strict_int64(0, UPC_ADD, &val, 42, 0);
ie the only effective difference at the point of use is whether the operand
type is expressed using the domain argument or in the spelling of the function
name. In my code above the difference is 2 characters, and could be less if I
named my domain variable differently. How is this a major burden? The tiny
additional complexity (for the programmer) seems well worth the potential
performance gain.
"I think the big benefit of the per-type atomicity is that it lets a vendor,
like Cray, tell the user which AMO types are fast and which are not, while
still providing reasonably broad support across the library."
We seem to have consensus that per-type atomicity is a mandatory design
feature. It allows the implementation (and performance) to vary based on the
operand type, and it prohibits nasty interactions of atomics of different type
widths conflicting on the same memory locations, which seems like a Good Thing.
Also, C11 atomics take the same position, so it will be a familiar restriction.
""On an unrelated note, are you including SWAP under SET? SWAP is very
important in many lock-free algorithms."
Yeah, I was considering SWAP = FETCH + SET. "
To clarify, most lock-free algorithms I've seen require an atomic pointer
COMPARE and swap, not just an unconditional swap. This is NOT equivalent to a
fetch and set. For reference, C11 DOES provide atomic pointer CAS (via
uintptr_t) and even atomic pointer increment (which I'm not advocating for
UPC), but does NOT provide any atomic ops on floating point types (although
given that many HPC codes are so heavily dominated by FLOPs it seems reasonable
for UPC to add support for atomics on floats).
"Then, we could let an implementation define macros like
UPC_AMO_[[TYPE]]_IS_FAST to provide some information to the user about the
"fastness" of the AMO."
This seems like a nice usability feature. C11 has something similar and
slightly more sophisticated. I think we can choose to provide this
independently of any other design decisions.
Original comment by danbonachea
on 4 Oct 2012 at 4:41
Nick - I'm in the process of drafting the spec for upc_type_t, based on your
comment 81. I noticed that C11 defines the following types for AMOs that aren't
mentioned in your list, so I wanted to float these past you (no pun intended):
These I suspect are irrelevant for the majority of UPC codes:
_Bool
wchar_t
These are for issuing AMOs that perform local pointer artihmetic, which perhaps
we don't care about?:
(u)intptr_t - integer type convertable to/from (void *)
ptrdiff_t - the result of subtracting two pointers
These seem potentially more applicable to UPC codes:
(u)intmax_t - the widest available integer type
size_t - the size of objects
Finally, the C11 AMOs use the stdint minimum-width types:
(u)int_least{8,16,32,64}_t and (u)int_fast{8,16,32,64}_t
instead of the fixed-width types stated in comment 81:
(u)int_{8,16,32,64}_t
The main difference is the minimum-width types may be larger than requested and
are required for C99 compliance, whereas the fixed-width types must be
exact-sized and are technically optional by C99/C11. In practice every current
HPC system I've encountered (all the ones supporting GASNet) provide all the
exact sized types, so perhaps this is a non-issue for the platforms of
interest. However to be safe we might want to add a clause to the AMO spec
stating that support for AMOs on the fixed-width types is
implementation-specified.
Original comment by danbonachea
on 7 Oct 2012 at 8:38
I have worked on the reference implementation today, in the faint hope that I'd
flush out a few things that were not obvious in the documentation.
a) I had to redefine all types, and I had to make them bitmask-able. Something
like this:
enum {
UPC_AMO_CHAR = (1<<0),
UPC_AMO_SHORT = (1<<1),
UPC_AMO_INT = (1<<2),
UPC_AMO_LONG = (1<<3),
UPC_AMO_UCHAR = (1<<4),
UPC_AMO_USHORT = (1<<5),
... etc ...
Same goes for the operations. I don't know whether the original UPC definitions
of upc_type_t are asking for bit-maskable values. Something to consider.
b) There is no good error reporting mechanism in UPC. What if I give the domain
allocator a hint that I want the superfast HW implementation, but then ask for
support for types that are not supported by HW?
-- should the domain creation fail?
-- should the domain creation succeed, but subsequent operation fail?
-- should failures be fatal (i.e. kill the UPC program), or should we have
status codes returned by upc_amo_strict and upc_amo_relaxed?
I implemented upc_amo_query() in any case to check whether a particular domain
implements a particular combination of types and operations.
c) the domain constructors do not return shared objects, and therefore the
_all_ and _global_ protocol does not apply. This is a pity, because it breaks
consistency with the existing lock and memory allocators.
-- should we follow the consistency rules of memory and lock allocators at all
costs?
If we talk on the phone tomorrow, I will have some code. Suggestions as to
where to post the code? even if it's throwaway stuff (as it most likely is), it
might provide a further platform for discussion.
Original comment by ga10...@gmail.com
on 8 Oct 2012 at 9:26
"I have worked on the reference implementation today, in the faint hope that
I'd flush out a few things that were not obvious in the documentation."
Excellent - this exercise should definitely provide additional insights.
"I had to redefine all types, and I had to make them bitmask-able."
I don't think there is any reason for the values of upc_type_t to each have a
dedicated bit. These are never OR'd together in the proposed interfaces we've
tossed around. Note each domain should only be for a SINGLE unique type - I
believe we have consensus that we should not combine multiple types within an
atomicity domain (mostly because it sidesteps word tearing issues). The
upc_types.h header proposal I drew up for issue 10 only requires the type
macros to have distinct values (and also spells them differently than what you
pasted).
"What if I give the domain allocator a hint that I want the superfast HW
implementation, but then ask for support for types that are not supported by
HW? "
There should be no "superfast HW" hint - the hints are of the form "tune for
latency", "tune for throughput" etc, but in all cases "do the best you can for
this combination of type and ops" is implied. In any case the hint value should
not cause an error - it's a hint, not a demand. There should not be a hint
value to select a particular vendor-specific implementation, because that would
create portability problems.
If the user specifies a prohibited combination of type and op (eg UPC_PTS with
UPC_MULT), or type/op values that don't exist, then that should be a violation
of a "shall" constraint in the AMO spec - which means behavior is undefined,
but this should be a fatal error in a high-quality implementation. I see no
motivation to provide graceful failure for such programming errors.
"the domain constructors do not return shared objects, and therefore the _all_
and _global_ protocol does not apply. This is a pity, because it breaks
consistency with the existing lock and memory allocators. "
Please see comment 83 - I intentionally followed the form of the upc_lock_t
interface, which I think addresses your comment...
Original comment by danbonachea
on 8 Oct 2012 at 9:42
"If the user specifies a prohibited combination of type and op (eg UPC_PTS with
UPC_MULT), or type/op values that don't exist, then that should be a violation
of a "shall" constraint in the AMO spec - which means behavior is undefined,
but this should be a fatal error in a high-quality implementation. I see no
motivation to provide graceful failure for such programming errors."
The user specifying type/op combinations that aren't supported by the
implementation should be a fatal error (or undefined behavior), and an
implementation should support all required combinations in the spec. However,
I think we would want to explicitly permit implementations to support a
documented super-set of the required combinations so they can introduce new
(non-portable) functionality for users to experiment with. This way
implementations wouldn't need to have a duplicate AMO API and makes satisfying
the "two independent implementations" criteria for future extensions a bit
easier.
Original comment by sdvor...@cray.com
on 8 Oct 2012 at 9:56
In lieu of a better solution I'll spam all of you with the tar file of
the current implementation. Please comment. I expect most of the code
to change.
> I don't think there is any reason for the values of upc_type_t to
each have a dedicated bit. These are never OR'd together in the
proposed interfaces we've tossed around. Note each domain should
only be for a SINGLE unique type - I believe we have consensus that
we should not combine multiple types within an atomicity domain
(mostly because it sidesteps word tearing issues). The upc_types.h
header proposal I drew up for issue 10 only requires the type macros
to have distinct values (and also spells them differently than what
you pasted).
Misunderstanding on my part, then. Looking purely at the API proposed
somewhere above in this thread - I interpreted the type and op
arguments for domain creation as bitmasks of desired operation. I may
agree with Dan about the type, but it certainly makes no sense to me
to create one domain for each operation. So my argument will still
apply to the operations.
> There should be no "superfast HW" hint - the hints are of the form
"tune for latency", "tune for throughput" etc, but in all cases "do
the best you can for this combination of type and ops" is
implied. In any case the hint value should not cause an error - it's
a hint, not a demand. There should not be a hint value to select a
particular vendor-specific implementation, because that would create
portability problems.
OK. No errors upon domain creation. I can still ask to execute an AMO
on an un-implemented operation on a domain. So AMOs can still issue
"not implemented" errors. Hence my choice to return a status code from
the AMO. Needless to say I'm not very strongly attached to this code,
so I can be talked out of this choice.
> Please see comment 83 - I intentionally followed the form of the
upc_lock_t interface, which I think addresses your comment...
I guess the question in my mind is this: is an AMO
domain a shared object? A lock is obviously a shared object.
If the AMO domain is shared, a single thread can free it, and the
whole alloc/free protocol translates directly from memory allocation
and lock allocation.
If the AMO domain is local, then the protocol does not apply.
Original comment by ga10...@gmail.com
on 9 Oct 2012 at 12:53
"In lieu of a better solution I'll spam all of you with the tar file of
the current implementation."
You can click the link next to the paper clip icon with the label "Attach a
file" to add the file to a ticket.
Original comment by gary.funck
on 9 Oct 2012 at 1:10
Original issue reported on code.google.com by
nspark.w...@gmail.com
on 14 Mar 2012 at 3:40