Intrepid / upc-specification

Automatically exported from code.google.com/p/upc-specification
0 stars 1 forks source link

Library: Atomic Memory Operations (AMO) #7

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The 1.2 UPC Spec does not currently specify atomic memory operations, however, 
UPC AMO extensions have been implemented by LBL, MTU, and some vendors.  All 
the implementations vary, so I have linked some of the available documentation 
below.

The addition of UPC AMO extensions to the 1.3 Spec would be a big productivity 
enhancement and a big step toward having a standard set of AMOs that are 
supported across the reference and vendor implementations.

I think adoption of the BUPC AMO extensions would be sufficient for my 
interests.

References:
  - BUPC AMO Extensions:
    http://upc.lbl.gov/docs/user/index.shtml#atomics
  - MTU UPC AMO Extensions:
    http://www.cug.org/5-publications/proceedings_attendee_lists/2005CD/S05_Proceedings/pages/Authors/Merkey/Merkey_paper.pdf
  - Cray AMOs:
    http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=upc;f=man/xt_libintm/80/cat3/amo.3i.html

Original issue reported on code.google.com by nspark.w...@gmail.com on 14 Mar 2012 at 3:40

GoogleCodeExporter commented 9 years ago
What seems to be shaping up is a situation analogous to MPI communicators and 
MPI collective implementation choices. Atomic domains are in some sense similar 
to MPI communicators - certainly from a syntax point of view. Building on the 
analogy:

* Have we decided what happens when the same shared pointer is used in two 
different atomic operations in two different domains? are the results supposed 
to be defined?

(My answer: heck no, you are on your own)

* Is there an ordering constraint between atomic operations executed on 
different domains?

(My answer: usual UPC ordering semantics regardless of domains - unlike MPI 
matching semantics w.r.t. communicators)

* Picking a domain implementation policy is akin to picking a collective 
implementation in PAMI. The QoS description of each collective is hairy, and 
the matrix of "best implementation for a particular situation" is a complex 
one. Some examples that come to mind:

"-> Implementation 1 is HW only, requires no progress on the receiver, has 
reasonable latency and BW but has nonscalable behavior beyond 1000 tasks 
exerting pressure on the same pointer.
-> Implementation 2 is SW only, completely scalable to the limit of the system 
and with reasonable latency, but uses twice the bandwidth of implementation 1 
and requires an active message handler to run on the thread affine to the 
pointer."

These kinds of implementations defy a short description. Facing a similar 
situation with collectives (accelerated vs. restricted to tori vs restricted to 
fixed point operations etc etc), the PAMI designers have opted for giving the 
users complete choice by providing a listing of names on demand, and allowing 
the user to make choices. In addition, "choice 0" is always the most general 
choice. Can we do something similar here?

Original comment by ga10...@gmail.com on 7 Sep 2012 at 2:58

GoogleCodeExporter commented 9 years ago
As mentioned in an earlier comment by Troy, the C11 spec adds a complete AMO 
library that exposes functionality that is very similar to what we're trying to 
expose with our library. I think we need some high-level community discussion 
regarding whether C11 compatibility is ever a goal for UPC; but regardless of 
how that matter is resolved (and assuming C11 gains wide acceptance and 
implementation), I think whatever we come up with for AMOs will generate a very 
obvious question from every C11-savvy user. Namely, "Why doesn't UPC just 
support C11 AMO's with a pointer-to-shared?" and "Why can't I use C11 and UPC 
AMOs together on the same locations?". To play devil's advocate, a minimalistic 
way we could define AMO's for UPC would be to simply insert (shared _Atomic 
int) and (shared _Atomic long) into the atomic types specified in C11 7.17.6, 
and import that library wholesale. I'm NOT necessarily advocating that we 
should do this, but assuming that we intentionally disregard C11's 
<stdatomic.h> and define a whole new library with a wildly different interface, 
we should at least have a convincing "story" of why we diverged from an 
already-specified library with a nearly-identical target usage in our "base 
language" and motivate that doing so is necessary for UPC.

Towards this end I've done a careful audit of the C11 <stdatomic.h> and 
compiled a summary of interesting observations relative to the UPC AMO ideas 
floating around:

- C11 AMO's operate only on types specifically designed for atomic access, eg 
atomic_int and atomic_long, which can represent the same values as their base 
type, but are NOT guaranteed to have the same size or representation as their 
base type (eg they may include a lock data structure). They include all the 
signed and unsigned integral types and boolean, no floating point types.

- The atomic operations they support:
     swap and compare-swap
     fetch-and-...  add, sub, bitwise or, bitwise xor, bitwise and
     store (aka write) and fetch (aka read)   
     test-and-set and clear (only for a special atomic_flag boolean type)
     static and dynamic init (non-atomic)
  Notably OMITS min/max.

- ALL the atomic operations are supported on ALL the atomic types (excepting 
non-meaningful combinations), although not necessarily guaranteed to be 
implemented in hardware

- Macros query whether all the AMOs on a particular TYPE are "lock free" (ie 
fast hardware implementation), where the answer can be "always", "never" or 
"location dependent"

- A runtime query function can ask whether all AMO's on a particular location 
are "lock free" (yes or no)

- The actual AMO calls are declared using C11's new type-generic facility 
(_Generic), so the function the user invokes does NOT contain the type name - 
they invoke a type-independent symbol like: 
   result = atomic_fetch_add(&my_amo, 2) 
and the implementation magically does the "right thing" based on the type of 
my_amo. If UPC ever decides to adopt any part C11, this simple feature could 
make both the AMO and collective library interfaces look much cleaner to the 
user.

- The "consistency" of the AMO's (what we would call strict and relaxed) is an 
optional final enum argument to each operation, where the default basically 
behaves like UPC strict (AMO includes acquire and release fences). This defines 
the ordering constraints of AMO's wrt other AMO's and unrelated operations. The 
enum defines six different consistency "modes" for the AMO, of which at most 
four seem to have distinct meanings for any given AMO (relaxed, acquire, 
release, acquire-and-release).

- AMO's atomicity is guaranteed at the actual memory location, and in 
particular still "works" even if you mmap the location into several places in 
virtual memory and invoke AMO's on all of them. I haven't heard anyone mention 
this issue wrt UPC AMOs, but it's one we need to somehow address.

- They include both "strong" and "weak" versions of compare-swap, where the 
latter may fail spuriously, eg on load-linked/store-conditional architectures. 
I *think* the issue here is that in cases with poor cache alignment it's 
possible to construct cases that livelock on these architectures if you blindly 
spin on LL/SC to implement a "strong" compare-swap on an unlucky location. I 
haven't heard anyone mention this issue wrt UPC AMOs, but its one we may need 
to address.

- AMO's and regular C reads and writes do not conflict in memory because this 
is prohibited by the type system. 

- Similarly, atomicity is on a typed basis - one cannot "carve up" an atomic 
long type using a cast or union and try to call AMOs on the constituent bytes. 
(We should probably prohibit that as well).

A few misc comments:
The C11 AMO library provides most/all of the functionality we're trying to 
specify, although they may have chosen slightly different points in the 
requirements/guarantees design space than we'd have tailor-made for UPC. Which 
of these do we feel are important to change for our library?

IMHO the most significant divergence is C11 defines T and atomic_T as separate, 
incompatible types. How much does this matter to UPC programmers?
C11 Implementations are "encouraged" (but not required) to define atomic_T to 
have the same size and representation as T, which would permit casting them to 
regular types for non-atomic access in a separate synchronization phase. If we 
decided this feature was sufficiently important to UPC programmers, we could 
presumably still adopt C11 AMO's in UPC and add this additional implementation 
restriction for all UPC implementations (ie change "encouraged" to "required"). 

The implementation "trick" that allows that representation restriction to work 
for types requiring software emulation is to stash the hidden lock 
datastructure(s) in the runtime system rather than in user memory space. This 
could be as stupid as a "One big lock" protecting all software-emulated AMO's 
(of a given type), or could be tuned for improved concurrency by introducing a 
(probably fixed-size) table of locks indexed by a hash on the AMO location 
address. The latter would need to hash on physical addresses to retain the mmap 
atomicity guarantee.

The other major way in which our current UPC discussion regarding AMOs differs 
is the introduction of more detailed performance queries and hints regarding 
the implementation of given AMO/type combinations. We should consider if it's 
possible to get a similar effect by adding some query/hint functions to what 
C11 specifies.

Incidentally, C11 also standardizes a thread library <threads.h> that is 
basically POSIX threads in a thin disguise. Dynamic fork/join parallelism in a 
fully-shared memory space, mutexes, condition variables, thread-local storage. 
The memory model is roughly UPC relaxed, with fences provided by the AMO's and 
sync ops in <threads.h>. Dynamic threading doesn't play particularly well with 
UPC features, but luckily C11 threading is an optional feature, so if we ever 
adopted C11 we could take the easy out and simply recommend UPC implementations 
define __STDC_NO_THREADS__, to disable it entirely.

Original comment by danbonachea on 7 Sep 2012 at 4:48

GoogleCodeExporter commented 9 years ago
George: I think AMO domains are a lot more like MPI_Win than MPI_Comm.  Your 
first example sounds the same as a memory location being in two different 
windows.

In any case, the conclusions do not change.

Regarding whether or not UPC should be like PAMI, the answer should be no for 
many reasons.  If someone wants all the dials that PAMI has, then they should 
just use PAMI.  You should be able to manually switch UPC implementation 
semantics in software any more than you should be able to switch between the 
x86 and PowerPC memory model in a C program.

Dan: I completely agree with your point that doing anything deviant from C11 
requires a very strong justification.  I certainly hope that UPC grows to be a 
superset of C11 (perhaps without optional features) some day.

Does anyone know what compilers (if any) implement C11 atomics?

Original comment by jeff.science@gmail.com on 7 Sep 2012 at 5:05

GoogleCodeExporter commented 9 years ago
"- AMO's atomicity is guaranteed at the actual memory location, and in 
particular still "works" even if you mmap the location into several places in 
virtual memory and invoke AMO's on all of them. I haven't heard anyone mention 
this issue wrt UPC AMOs, but it's one we need to somehow address."

Is mmap() added to C11?  I thought that came from POSIX.  Does plain C11 
(without any extensions such as POSIX) actually permit access to a single 
object via pointers that don't compare as equal?

"The latter would need to hash on physical addresses to retain the mmap 
atomicity guarantee."

This will be *very* difficult to implement on many common systems, as 
user-level software generally doesn't have knowledge of the physical address 
(for good reasons), and UPC implementations aren't generally done in the kernel.

Original comment by sdvor...@cray.com on 7 Sep 2012 at 5:05

GoogleCodeExporter commented 9 years ago
Correction: "You should <NOT> be able to manually switch UPC implementation 
semantics in software any more than you should be able to switch between the 
x86 and PowerPC memory model in a C program."

Original comment by jeff.science@gmail.com on 7 Sep 2012 at 5:06

GoogleCodeExporter commented 9 years ago
A part of the talk of C11 makes me uneasy. The UPC AMO started out as a 
*library only*.

My $0.02: can we define a near-term goal of just coming up with a usable 
library?  without prejudicing a larger effort that will blend atomic types 
seemlessly into the language? I have of course no objection to looking at C11 
for inspiration. I just don't feel the urgency to resolve the whole problem 
*right now*.

Done ranting. As to the substance of "being like PAMI", the choice is fairly 
obvious:

* no allowance for decision whatsoever. You get a single choice for 
implementation, and you live with it. This is the simplest, but I have the 
feeling that most of us are not in this camp.

* *some* allowance for a decision, but without clear performance guarantees. 
Thus, the choice is by design a bit nebulous "if you pick me you *may* get HW 
to execute your AMOs. Nasty.

* An raw list of capabilities without commentary on what each capability is 
actually good for. This is the "PAMI way", and I can see why Jeff likes it not: 
the capability list is by definition non-portable.

* Expert system: the system picks an implementation based on criteria you state 
when you create the domain. I'd love to have such a thing. We tried to do this 
for PAMI - there are endless published papers about it - just not a system that 
I trust to work. Yet.

* Nirvana: the system guesses/autotunes the implementation you really wanted, 
no questions asked.

Original comment by ga10...@gmail.com on 7 Sep 2012 at 5:50

GoogleCodeExporter commented 9 years ago
> Is mmap() added to C11?  I thought that came from POSIX.  Does plain C11 
(without any extensions such as POSIX) actually permit access to a single 
object via pointers that don't compare as equal?

Here is the exact C11 verbiage I'm referring to:

"NOTE Operations that are lock-free should also be address-free. That is, 
atomic operations on the same memory location via two different addresses will 
communicate atomically. The implementation should not depend on any per-process 
state. This restriction enables communication via memory mapped into a process 
more than once and memory shared between two processes."

AFAIK there is no way within strictly-conforming C11 to actually accomplish 
this mapping, so I think this is just a nod towards POSIX. We could conceivably 
"solve" this in UPC by explicitly prohibiting such user-level mapping in the 
first place.

Original comment by danbonachea on 7 Sep 2012 at 6:01

GoogleCodeExporter commented 9 years ago
"The latter would need to hash on physical addresses to retain the mmap 
atomicity guarantee."

I am not certain that we want to allow the UPC *user* to mmap() the shared heap 
multiple times, making the statement above a "read herring" from the point of 
view of a UPC-only implementation.

Multiple mappings of the UPC shared heap for use of shared memory within a 
multi-core node would be fully controlled by the UPC runtime and thus easier to 
address (no pun intended) than a fully general case.  In particular I doubt 
that arbitrary alignments are used, meaning that hashing on the offset within 
the page would be sufficient to correctly hash any/all pointers to a given 
object to the same slot in the table of locks (or equivalent).

I agree w/ George that we should NOT look to pickup C11 language features 
(_Generic in particular could be problematic), and I don't think that was Dan's 
intent.  I think we should look at C11 for a "design" that our future users may 
already know and which a larger (I hope) community of experts has already 
vetted.  However the "binding" of that design for UPC would need to be a 
LIBRARY - hopefully one that was nearly trivial to implement in terms of C11's 
stdatomic.h when available.

Original comment by phhargr...@lbl.gov on 7 Sep 2012 at 6:03

GoogleCodeExporter commented 9 years ago
I think an atomics library-only implementation is fine as long as this permits 
a typedef atomic_T-like entity rather than requiring C89 build-ins and some 
registration garbage.  It is effectively impossible to do atomics portably 
without a typedef because of alignment issues that we cannot assume from an 
arbitrary C compiler (within the src2src model of BUPC, for example).  
Sometimes one has to ram the stupid int in a struct to make atomics work in a 
reasonable way (see the BGP atomics API, for example).

Original comment by jeff.science@gmail.com on 7 Sep 2012 at 9:02

GoogleCodeExporter commented 9 years ago
"It is effectively impossible to do atomics portably without a typedef because 
of alignment issues that we cannot assume from an arbitrary C compiler "

This is an excellent point, and probably another reason why C11's library 
requires the use of typedefs like atomic_int, rather than operating on objects 
declared with raw basic types as we seem to be leaning towards. If we go with a 
raw types design, this is a problem we will need to address. For example, if 
the UPC user does something like this:

  typedef struct {
     char     c;
     int64_t  i;
  }  S_t;
  shared S_t *S = upc_alloc(sizeof(S_t));
  upc_amo_fetchincrement{&(S->i));

There are ABIs where the long field won't be 8-byte aligned, and where issuing 
a hardware atomic instruction on such a field could generate a bus error. One 
possible solution would be to place the burden upon the user and state that if 
the pointer argument to the AMO is not suitably aligned, behavior is undefined. 

This resolution dodges the issue, but doesn't provide a solution if the user 
really needs to perform AMOs on struct elements. The only option open to such a 
user would be to use non-portable (but widely available) alignment built-ins 
like __align(), which is incidentally standardized in C11-6.7.5 with the 
_Alignas() specifier. If we're concerned that this will be a frequent 
portability problem for our AMO users, we could decide to specify some typedefs 
for suitably-aligned versions of the basic types (with identical size and 
representation) and recommend their (optional) usage when declaring struct 
fields that will be passed to the AMO library. 

Alternatively, we could just abandon the effort to support AMO on raw basic 
types and take the C11 approach of only supporting AMOs on the provided 
typedefs. This also solves the alignment portability problem, but makes it more 
tedious to incrementally add AMO's to existing applications, and arguably 
burdens the common usage to prevent problems in uncommon usage.

Original comment by danbonachea on 8 Sep 2012 at 10:08

GoogleCodeExporter commented 9 years ago
As a follow-up to the discussion relating to C11 support for atomics and how/if 
this might apply to UPC AMO's, I found an interesting discussion here:

"Emulating C11 compiler features with gcc: _Atomic"
http://gustedt.wordpress.com/2012/01/17/emulating-c11-compiler-features-with-gcc
-_atomic/

In C11, _Atomic comes in two flavors, as a type qualifier (like const or 
volatile) or as type specifier (like in struct or array declarations). [...]

The author goes on to describe how he handles atomic types in his P99 macro 
pre-processor package:
http://p99.gforge.inria.fr/
"P99 - Preprocessor macros and functions for C99"

P99 is a suite of macro and function definitions that ease the programming in 
modern C, aka C99. By using new tools from C99 we implement default arguments 
for functions, scope bound resource management, transparent allocation and 
initialization, ...

By using special features of some compilers and operating systems, we also are 
able to provide an emulation of a large part of the new C standard, C11.

P99 heavily depends on a decent support for C99 of compilers. We have set up a 
test program that may be used as a first indication if a compiler is compatible 
with that. Please see the directory c99-conformance for some results of such 
tests."

Although it is unlikely that we will adopt the P99 approach re: AMO's or C11 
atomics, I am passing along the reference for consideration.  There are a 
number of other potentially useful aspects of the P99 project, unrelated to 
AMO's.
1) C99 conformance test results, for example:
http://p99.gforge.inria.fr/c99-conformance/c99-conformance-gcc-4.5.html
2) The P99 macros use C99 features in some novel ways.  Given that UPC is 
derived from C99, it is both possible and likely that current UPC programmers 
do not make use of some C99 features that might improvement the maintainability 
and reliability of their UPC programs, and that might perhaps improve the 
productivity of programming in UPC.

Original comment by gary.funck on 8 Sep 2012 at 6:38

GoogleCodeExporter commented 9 years ago
"One possible solution would be to place the burden upon the user and state 
that if the pointer argument to the AMO is not suitably aligned, behavior is 
undefined."

How does one write portable UPC that meets the alignment requirements of all 
possible ABIs?  Is this not significantly more onerous than using a new type?

What if UPC is going to run in a heterogenous environment such as may exist in 
Cray Cascade?  Do we think that alignment alone is going to solve the problems 
of atomics from distributed UPC threads that live on Intel Xeon, Intel MIC 
and/or NVIDIA processors?  I don't recall what CUDA does for atomics but I can 
imagine in some future system that the atomics live in special hardware (e.g. 
Cray Gemini atomics living in the NIC) that requires a typedef.

I really don't see the use of atomic_int, for example, as "tedious to 
incrementally add AMO's to existing applications".  Is the tedium coming from 
codes that want to replace lock/update/unlock with atomic operations and it is 
not obvious to the user which declarations correspond to variables that will 
require this treatment?

Perhaps it could be permitted - only when possible - to cast T to atomic_T.  
This allows the common case where typedef is required to support the evolution 
of existing codes, although it would not be portable.

Original comment by jeff.science@gmail.com on 9 Sep 2012 at 5:49

GoogleCodeExporter commented 9 years ago
"How does one write portable UPC that meets the alignment requirements of all 
possible ABIs?  Is this not significantly more onerous than using a new type?"

Not really. I believe the alignment issue only arises when the target of the 
AMO is located inside an aggregate type (a struct field), which may be an 
uncommon or less important case for typical HPC apps. For the simple case of 
AMO's on elements of an array of longs (the common case?), alignment should not 
be a problem on any reasonable architecture. If we believe that is the common 
case and the common case does not suffer from alignment problems, then it would 
be unfortunate to burden the common case with additional syntax that only 
affects the portability of the uncommon case.

"What if UPC is going to run in a heterogenous environment ... I can imagine in 
some future system that the atomics live in special hardware (e.g. Cray Gemini 
atomics living in the NIC) that requires a typedef"

This is a red herring. I believe hardware that *requires* AMO targets to live 
in special memory locations (where "special" means something more than UPC 
shared heap) are firmly beyond the scope of what we're trying to handle with 
the UPC AMO library. The C11 AMO library also does not handle implementations 
that require memory to be allocated "specially", although it does permit AMO's 
to be "faster" ("lock-free") on certain locations. In any case, a typedef alone 
is not enough to implement "special" memory, because it does not handle dynamic 
allocation.

"I really don't see the use of atomic_int, for example, as "tedious to 
incrementally add AMO's to existing applications".  Is the tedium coming from 
codes that want to replace lock/update/unlock with atomic operations and it is 
not obvious to the user which declarations correspond to variables that will 
require this treatment?"

It's tedious because large portions of the code may already written that 
allocate and pass around long-lived application data structures as base types 
(eg arrays of longs). Adding an AMO operation on an element in a 
performance-critical loop can be a localized and incremental change if the AMO 
supports basic types, otherwise adding the AMO may require adjusting many type 
expressions globally throughout the application to accommodate the AMO call 
being added to one corner of the program.

"Perhaps it could be permitted - only when possible - to cast T to atomic_T.  
This allows the common case where typedef is required to support the evolution 
of existing codes, although it would not be portable."

My point is that the typedef is NOT required in the common case. We've already 
decided we don't want to allow AMO library metadata inside user data space, so 
we're only talking about alignment requirements here. Many architectures never 
have alignment problems for AMOs, and the rest only have a problem for struct 
fields.

I was suggesting that the AMO's statically accept memory arguments of either 
atomic_T or T, and in the latter case be permitted to abort at runtime in the 
rare situation that alignment requirements are violated. This supports easy 
(cast-free) incremental insertion of AMO operations, but also provides the 
mechanism to write fully-portable programs with AMOs that are guaranteed to 
satisfy alignment constraints. 

If we feel supporting both atomic_T and T makes the API too "wide", we could 
just support the former and recommend the use of casts as you suggest, which is 
basically how C11 AMO's work. If we go this route then no special verbiage is 
requires for those casts, because that's already covered by C99-6.3.2.3: "A 
pointer to an object or incomplete type may be converted to a pointer to a 
different
object or incomplete type. If the resulting pointer is not correctly aligned 
for the pointed-to type, the behavior is undefined."

Original comment by danbonachea on 10 Sep 2012 at 12:05

GoogleCodeExporter commented 9 years ago
Dan wrote:
> Not really. I believe the alignment issue only arises when the target of the 
AMO is located inside an aggregate
> type (a struct field), which may be an uncommon or less important case for 
typical HPC apps. For the simple
> case of AMO's on elements of an array of longs (the common case?), alignment 
should not be a problem on
> any reasonable architecture.

AIX on PPC64 is not reasonable by Dan's definition, because the ABI ensures 
8-byte alignment for "double", but *not* for (void *) or long.   The rules for 
8-byte types inside structs on AIX is bizarre and not worth describing here.

Original comment by phhargr...@lbl.gov on 10 Sep 2012 at 12:53

GoogleCodeExporter commented 9 years ago
"AIX on PPC64 is not reasonable by Dan's definition, because the ABI ensures 
8-byte alignment for "double", but *not* for (void *) or long.   The rules for 
8-byte types inside structs on AIX is bizarre and not worth describing here."

I don't think we're proposing to allow UPC AMO's on (void *), because 
pointer-to-local living in shared space is already a "dicey" programming 
construct. I haven't heard anyone mention AMO's on pointer-to-shared, but the 
alignment of that type is already under the exclusive control of the UPC 
compiler. 

The lack of alignment guarantees for arrays of longs in AIX-PPC64 (and possibly 
for other arrays of basic types elsewhere) is not likely to be an issue for 
UPC, because the AMO library already requires the target to live in shared 
space. In the case of static allocation of a shared array of longs, the UPC 
implementation can (easily?) provide 8-byte alignment of that shared object to 
make this problem go away (but this would not be required by the AMO library 
spec). In the case of dynamic allocation, upc_alloc and friends are already 
guaranteed to provide the necessary alignment for arrays of longs due to ISO C 
7.20.3: "The pointer returned if the allocation succeeds is suitably aligned so 
that it may be assigned to a pointer to any type of object and then used to 
access such an object or an array of such objects in the space allocated", 
because "any type of object" includes atomic_long.

Now that we've broached the subject, I think we should consider eventually 
supporting compare-and-swap on a pointer-to-shared, since that would enable UPC 
programmers to portably implement lock-free data structures in shared memory. 
This usage case wasn't mentioned by our UPC users on the call, but I believe 
this is an important class of AMO clients, judging by the enormous literature 
on lock-free programming for non-UPC languages. This is something that could 
possibly be added as an enhancement in 1.4.

I'm aware that many UPC platforms use a 128-bit PTS representation and 
therefore probably cannot provide atomics on PTS exclusively in 
instruction-level hardware. However, there are also platforms using a 64-bit 
PTS representation, and also many cases where an implementation could do 
significantly better than upc_lock/pointer_swap/upc_unlock, even for 128-bit 
PTS. It's also worth noting that an implementation of the UPC AMO library 
should be permitted to use software-based atomics for one type (eg PTS) and 
fully-hardware atomics for a different, incompatible type (eg long) in the same 
"domain" without creating a problem. Hopefully the AMO spec being drafted will 
already prohibit touching the same memory location using two differently-typed 
AMO operations within a synchronization phase.

Original comment by danbonachea on 10 Sep 2012 at 2:14

GoogleCodeExporter commented 9 years ago
"I don't think we're proposing to allow UPC AMO's on (void *)..."

So no one ever does pointer arithmetic with atomics?  I don't know that "shared 
pointer to local" is the only use case here either.  I am a UPC noob but isn't 
"shared pointer to shared" legal (and potentially useful)?

Original comment by jeff.science@gmail.com on 10 Sep 2012 at 2:17

GoogleCodeExporter commented 9 years ago
"I am a UPC noob but isn't "shared pointer to shared" legal (and potentially 
useful)?"

Yes it is, which is the topic of my last two paragraphs that recommend 
including AMO library support for compare-and-swap on a pointer-to-shared 
(PTS). I wouldn't push for full-blown PTS arithmetic using AMOs, not because it 
would be impossible but because this would complicate the interface with the 
blocksize information required to perform PTS arithmetic, and lock-free 
algorithms generally only require pointer CAS, not pointer arithmetic (and the 
latter can be implemented using the former with an additional read). Adding PTS 
CAS support would subsume the need for AMOs on (void *) in shared space, which 
as I mentioned are only meaningful in very restricted cases (and I don't think 
anyone is pushing for them).

Original comment by danbonachea on 10 Sep 2012 at 4:15

GoogleCodeExporter commented 9 years ago
As promised, and hopefully without too much delay, here's is a straw proposal 
for an
"atomicity domain"-based AMO library.

-- upc_domain_t --
An object of type upc_domain_t represents an atomicity domain, which specifies 
a set of
operations and datatypes over which access to a memory location in a given 
synchronization
phase is guaranteed to be atomic if and only if no other mechanisms or 
atomicity domains
are used to access the same memory location in the same synchronization phase.

-- upc_amo_mode_t --
The following constants of type upc_amo_mode_t shall be defined to allow 
user-provided
indication of a preferred usage mode to the implementation.
  * UPC_AMO_DEFAULT, An implementation-defined default mode
  * UPC_AMO_HARDWARE, Use only hardware-supported atomic memory operations

-- Supported Operations & Types --
The atomicity mode UPC_AMO_DEFAULT will support the following combination of 
operations
and data types:

          AND   OR   XOR   NOT   SET   CSWAP   ADD   MIN   MAX
I, I32     X    X     X     X     X      X      X     X     X
L, I64     X    X     X     X     X      X      X     X     X
UI, U32    X    X     X     X     X      X      X     X     X
UL, U64    X    X     X     X     X      X      X     X     X
F32                                             X     X     X
F64                                             X     X     X

The atomicity mode UPC_AMO_HARDWARE will provide support for an 
implementation-defined
subset of the data types and operations supported by UPC_AMO_DEFAULT.

-- Domain Initialization --
    upc_domain_t UPC_DOMAIN_INIT(upc_op_t ops, upc_type_t types, upc_amo_mode_t mode);

A static initializer that creates a upc_domain_t object that supports the 
operations
and types specified by ops and types, respectively, in the AMO implementation 
mode, mode.

    int upc_all_domain_init(upc_domain_t *domain, upc_op_t ops, upc_type_t types,
                            upc_amo_mode_t mode);

A dynamic initializer for a upc_domain_t object that initializes the specified 
object
to support the operations and types specified by ops and types, respectively, 
in the AMO
implementation mode, mode.

-- Support Query --
    int upc_amo_query(upc_amo_mode_t mode, upc_op_t ops, upc_type_t types);

Returns 1 if the AMO implementation mode, mode, supports all pair-wise 
combinations of
types and operations specified by types and ops, respectively; returns 0 
otherwise.

-- Atomic Operations --
    void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr,
                        upc_op_t op, upc_type_t type,
                        shared void *target, void *operand1, void *operand2);

    void upc_amo_relaxed(upc_domain_t *domain, void *fetch_ptr,
                         upc_op_t op, upc_type_t type,
                         shared void *target, void *operand1, void *operand2);

where:
  - *target = *target OP *operand1, for OP in {AND, OR, XOR, ADD, MIN, MAX}
  - *target = (*target == *operand1) ? *operand2 : *target, for CSWAP
  - *target = *operand1, for SET
  - *target = ~(*target), for NOT
and the original value of target is stored in fetch_ptr if and only if 
fetch_ptr != NULL.

Original comment by nspark.w...@gmail.com on 10 Sep 2012 at 4:35

GoogleCodeExporter commented 9 years ago
"and also many cases where an implementation could do significantly better than 
upc_lock/pointer_swap/upc_unlock, even for 128-bit PTS."

Could you be more specific?  I'm having trouble coming up with anything better, 
given that a programmer could use the lock to protect more than just the 
pointer swap.  Also, note that implementations that have 128-bit PTS 
representation probably also have the restriction that 128-bit reads and writes 
are not atomic at the hardware level, so you'll need 128-bit load and store 
atomics as well, and these wouldn't play nice with strict reads/writes...

Original comment by sdvor...@cray.com on 10 Sep 2012 at 4:56

GoogleCodeExporter commented 9 years ago
Why no atomic SET for FP types?  Don't we need to support atomic SET/GET for 
all types?  Load/store atomicity is not necessarily at word granularity.

Original comment by jeff.science@gmail.com on 10 Sep 2012 at 5:33

GoogleCodeExporter commented 9 years ago
"Why no atomic SET for FP types?  Don't we need to support atomic SET/GET for 
all types?  Load/store atomicity is not necessarily at word granularity."

Note that the memory model implies sequential consistency for strict 
references.  I'm pretty sure that requires that strict memory operations are 
atomic (or at least appear to be).  Admittedly, the spec is extremely vague on 
the mapping from source expression to memory model operations, so this may not 
be the full story.  I'll defer to others more knowledgeable on the intricacies 
of the memory model.

Original comment by sdvor...@cray.com on 10 Sep 2012 at 5:49

GoogleCodeExporter commented 9 years ago
"Why no atomic SET for FP types?  Don't we need to support atomic SET/GET for 
all types?  Load/store atomicity is not necessarily at word granularity."

It certainly wouldn't bother me to have atomic SET for floating-point types.  
In general, I think a lot less about floating-point types and what one would do 
with them.  If we also need an atomic GET, we could probably just add a NO-OP 
operation, so you'd do a fetching NO-OP and get the value without any 
modification.

Original comment by nspark.w...@gmail.com on 10 Sep 2012 at 5:53

GoogleCodeExporter commented 9 years ago
I assume that AMOs are going to work in relaxed mode, too, which is why atomic 
GET/SET are required, or should a user instead declare these references as 
strict?  It might at least be convenient to define atomic GET/SET, if only as 
another way to declare a reference to be strict.

Original comment by jeff.science@gmail.com on 10 Sep 2012 at 5:54

GoogleCodeExporter commented 9 years ago
The more I think about the straw proposal in Comment 68, the more I'd like to 
coalesce the domain and "mode" types and have an implementation provide 
"everything" (UPC_AMO_DEFAULT, likely in software) or just what the hardware 
can support (UPC_AMO_HARDWARE).  Regarding all the different ways one could 
implement atomics "in hardware", I think that would fall in as one of the 
"implementation defined" aspects of the hardware-only domain.  For Cray and 
IBM, I would /expect/ that they would let their NICs do the work, but any 
implementation could do whatever the implementation chose to do.

I also think it would be worth re-working the query to have separate types/ops 
parameters for integer and floating-point types.  The support matrix I 
suggested doesn't work in the pair-wise matching of types and ops, if integer 
and FP types are considered together.

Original comment by nspark.w...@gmail.com on 10 Sep 2012 at 5:59

GoogleCodeExporter commented 9 years ago
Comment 73 was about comment 71, btw.

Regarding comment 72, FETCH-AND-OP with NO-OP is exactly how MPI-3 does atomic 
GET, btw. 

I wonder if there is an advantage to having a separation.  Unless the compiler 
can remove both the operation and the input operand, there is going to be 
packet overhead in doing GET via FETCH-AND-OP.  Some systems have special 
packets for 8-byte communication, which could work for atomic GET (assuming 
perhaps that all GETs were atomic and thus nothing special was required to 
achieve this).  In this case we want to force to user to declare their intent 
to do GET specifically.

The other issue is that the compiler can probably do more to optimize if it 
doesn't have to inspect the input operand of FETCH-AND-OP to distinguish a 
"GET" from a GET.

Original comment by jeff.science@gmail.com on 10 Sep 2012 at 6:00

GoogleCodeExporter commented 9 years ago
"access to a memory location in a given synchronization phase is guaranteed to 
be atomic if and only if NO OTHER MECHANISMS or atomicity domains are used to 
access the same memory location in the same synchronization phase."

This part of the straw spec answers all the questions about concurrent strict 
accesses - basically, touching the same memory location with an AMO and a 
strict operation in the same synchronization phase means atomicity is no longer 
guaranteed. I would go even stronger and state that behavior is completely 
undefined if you do this. Every atomic type needs to have GET and SET via the 
AMO library, which is the "right" way to perform concurrent read/writes during 
an AMO synchronization phase.

"Note that the memory model implies sequential consistency for strict 
references.  I'm pretty sure that requires that strict memory operations are 
atomic (or at least appear to be)."

As discussed elsewhere in other threads, the memory model does NOT guarantee 
freedom from word-tearing, which is a related but orthogonal issue. It would be 
impossible to implement strict if we required strict operations to be 
tearing-free, because you can perform a "single" strict write of arbitrarily 
large size using a struct type. Programs with race conditions need to be aware 
of the word-tearing issue, even if the races use only strict operations. Most 
architectures won't word tear for 1,2,4,8 byte accesses, but this is 
platform-specific.

In any case, we are specifically prohibiting concurrent access with non-AMO 
read/writes (strict or otherwise), both to ensure atomicity can be implemented 
when hardware support is absent, and to prevent word-tearing from being an 
issue.

In resp to comment 69:
""and also many cases where an implementation could do significantly better 
than upc_lock/pointer_swap/upc_unlock, even for 128-bit PTS."

Could you be more specific?  I'm having trouble coming up with anything better, 
given that a programmer could use the lock to protect more than just the 
pointer swap."

An active message is sent, upon arrival software grabs a lock, does the update, 
then releases the lock and sends a reply. For a OP-and-fetch that's one network 
round-trip of latency. The naive UPC-level implementation of 
upc_lock/pointer-read/pointer-write/upc_unlock in general will induce four 
network round trips of latency, for approximately a 4x slowdown on a 
latency-dominated network.

Original comment by danbonachea on 10 Sep 2012 at 8:17

GoogleCodeExporter commented 9 years ago
A few comments on the actual proposal:

    int upc_all_domain_init(upc_domain_t *domain, upc_op_t ops, upc_type_t types,
                            upc_amo_mode_t mode);

This function should not return a value, because it should never fail (unless 
the arguments are totally bogus values, which should be a fatal error). The 
user is stating he wants this set of ops and types, give me a domain with the 
"fastest" implementation that supports all the pairwise combos.

The mode argument makes no sense here - the user always wants the "fastest" 
implementation you can provide. I would change that argument to be a 
upc_amohint_t (or something similar) where the values are something like 
UPC_AMO_LATENCY, UPC_AMO_THROUGHPUT and allow for additional 
implementation-defined hints.

Finally, if we're going to support statically-initialized domains, I don't see 
a reason to also provide a dynamic one. However, we DO need to think about how 
these domain_t types are handled by the user, specifically whether they can be 
passed around by value (like a upc_handle_t) or whether they must be handed by 
reference (like a upc_lock_t *). The latter allows the implementation to store 
state directly in the upc_domain_t, the former requires a level of indirection 
in the implementation and opens the door to memory leaks inside the 
implementation lacking a destructor, and potentially makes it more difficult to 
implement the static constructor. Whichever we settle on should be used 
consistently throughout - if constructors return a domain_t by value, then the 
AMO's should also take the domain by value. Otherwise drop the static 
initializer and work everything by reference. Finally, we should state that 
domain construction is collective and domain handles are thread-specific, so we 
don't have domains handles being passed around in shared memory.

"I also think it would be worth re-working the query to have separate types/ops 
parameters for integer and floating-point types.  The support matrix I 
suggested doesn't work in the pair-wise matching of types and ops, if integer 
and FP types are considered together."

An easy way to fix this is change it so each atomicity domain is only over a 
SINGLE type. We already have to prohibit concurrent AMO access to a given 
memory location using two different C-types within a synchronization phase, so 
might as well fold this requirement directly into the domain abstraction. This 
means every argument of (upc_type_t types) should be (upc_type_t type), and the 
type argument can be dropped from the actual AMO call because it is redundant 
and stored in the domain.

We haven't discussed the spelling of upc_type_t values and the exact C types 
they correspond to, but we need to specify those.

Combining all of the above, we end up with a spec like:

upc_domain_t UPC_DOMAIN_INIT(upc_op_t ops, upc_type_t type, upc_amohint_t 
hints);

void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr, upc_op_t op, 
                        shared void *target, void *operand1, void *operand2);

Example usage for an atomic add:

upc_domain_t mydom = UPC_DOMAIN_INIT((UPC_ADD|UPC_CSWAP), UPC_TYPE_INT32, 
UPC_AMO_LATENCY);
shared int32_t val;
upc_amo_strict(&mydom, 0, UPC_ADD, &val, 42, 0);

Original comment by danbonachea on 10 Sep 2012 at 8:48

GoogleCodeExporter commented 9 years ago
"An active message is sent, upon arrival software grabs a lock, does the 
update, then releases the lock and sends a reply. For a OP-and-fetch that's one 
network round-trip of latency. The naive UPC-level implementation of 
upc_lock/pointer-read/pointer-write/upc_unlock in general will induce four 
network round trips of latency, for approximately a 4x slowdown on a 
latency-dominated network."

This is true for most atomics.  However, compare-and-swap is usually done in a 
spin-loop until it succeeds.  Most uses of compare-and-swap that I've seen look 
something like this:

do {
 tmp = val
 new_val = f(tmp)
} while( cas( &val, tmp, new_val ) != tmp )

Instead of compare-and-swap, users can just use locks to prevent anyone from 
touching the value until you've updated it.

lock( val_lock )
val = f( val )
unlock( val_lock )

This avoids the retry loop entirely by changing the algorithm.  This is 
something only the user can do.  That said, some users just want to run their 
lock-free code everywhere, regardless of whether or not it is a good idea on a 
given platform, so let's consider the two software approaches mentioned.

With a software atomic implementation naively based on locks, the minimum 
latency is dramatically increased due to the extra network round-trips.  The 
maximum latency is hopefully capped by a fair locking algorithm though.

With a software atomic implementation based on active messages, you avoid 3 
extra network round-trips, but are dependent on someone actively processing 
messages on the remote end, meaning the maximum latency is potentially 
unbounded.  Additionally, since we're talking about systems where the shared 
pointer doesn't fit into 64 bits, there are scalability issues to consider.  In 
particular, full message queues become much more common.  Thus, I question if 
this is always better than naively using locks.

In general, the problem with software atomics is that there are many ways of 
implementing them, but which implementation to choose can be extremely 
problem-dependent.  Users know more about the problem they are trying to solve, 
and are thus better equipped to make that decision.  Implementers are unlikely 
to provide multiple implementations for the user to choose from (too 
expensive), so the user is tough out of luck if the implementation's choice 
doesn't match well with their problem.  I suppose someone could write a bunch 
of reference implementations using different algorithms in UPC to get around 
this though...

Original comment by sdvor...@cray.com on 10 Sep 2012 at 11:17

GoogleCodeExporter commented 9 years ago
"This avoids the retry loop entirely by changing the algorithm.  This is 
something only the user can do.  That said, some users just want to run their 
lock-free code everywhere...  Users know more about the problem they are trying 
to solve, and are thus better equipped to make that decision."

My argument for inclusion of PTS CAS was to support lock-free algorithms. 
Lock-free algorithms require a pointer CAS AMO. The question of whether or not 
the user SHOULD be using a lock-free algorithm on a given platform for a given 
situation is orthogonal - there are cases where lock-free data structures are 
probably faster (possibly much faster), and also cases where upc_lock-ing with 
a traditional data structure may be faster. This is part of application tuning, 
and it is not our place to dictate that decision. If we fail to provide PTS CAS 
then we are taking that decision away from the user and effectively prohibiting 
the use of lock-free data structures.

Original comment by danbonachea on 11 Sep 2012 at 12:21

GoogleCodeExporter commented 9 years ago
"My argument for inclusion of PTS CAS was to support lock-free algorithms. 
Lock-free algorithms require a pointer CAS AMO. The question of whether or not 
the user SHOULD be using a lock-free algorithm on a given platform for a given 
situation is orthogonal - there are cases where lock-free data structures are 
probably faster (possibly much faster), and also cases where upc_lock-ing with 
a traditional data structure may be faster. This is part of application tuning, 
and it is not our place to dictate that decision. If we fail to provide PTS CAS 
then we are taking that decision away from the user and effectively prohibiting 
the use of lock-free data structures."

Yes, I agree that it should be included for completeness.  My intent was merely 
to point out that the choice of which software implementation is best is 
usually more dependent on the problem at hand than on the platform details, and 
implementations are not likely to provide many choices given the development 
cost.  Therefore, a generic software atomics library, with multiple algorithms, 
that worked on a wide variety of conforming UPC implementations would be 
useful--more so if such a library could continue to make use of hardware 
atomics provided by an implementation.

Original comment by sdvor...@cray.com on 11 Sep 2012 at 12:56

GoogleCodeExporter commented 9 years ago
Some last-minute follow up...

--- Initializers and Modes/Hints ---
Dan said:
> This function should not return a value, because it should never fail (unless 
the arguments
> are totally bogus values, which should be a fatal error).  The user is 
stating he wants this
> set of ops and types, give me a domain with the "fastest" implementation that 
supports all
> the pairwise combos.
>
> The mode argument makes no sense here - the user always wants the "fastest" 
implementation
> you can provide. I would change that argument to be a upc_amohint_t (or 
something similar)
> where the values are something like UPC_AMO_LATENCY, UPC_AMO_THROUGHPUT and 
allow for
> additional implementation-defined hints.
>
> [...]
>
> An easy way to fix this is change it so each atomicity domain is only over a 
SINGLE type.

This approach makes sense to me.  Turning the mode into a hint, which was the 
intent all along
(I think), makes more sense to me with the per-type domain (from Dan, below).  
I think it eases
the ability to "mix-and-match" types and ops and hints, knowing that you'll get 
the fastest
version available as long as you don't over-specify the ops.

I think it simplifies AMO domain creation, unless you wanted AMOs on *lots* of 
types.  What it
complicates is whether we want to provide a default domain that supports 
"everything".  We'd
now have to provide "everything" domains on a per-type basis.  That said, 
people looking into
AMOs would likely be fine creating what they need and wouldn't necessarily need 
to have an
"everything" domain.

Dan said:
> If we're going to support statically-initialized domains, I don't see a 
reason to also
> provide a dynamic one.

When I've talked to my users about the domain idea, they've always said they'd 
use statically-
initialized domains.  I think I put the dynamic initializer in there for 
completeness.  Jeff
did previously say that "Requiring static allocation is really unpleasant from 
a usability
perspective."

I think Dan's example at the end of Comment 77 looks pretty reasonable.

--- AMO Types ---
Dan said:
> We haven't discussed the spelling of upc_type_t values and the exact C types 
they correspond to,
> but we need to specify those.

For completenss with the types supported by the UPC Collectives, I would have 
upc_type_t support:
> UPC_CHAR,  UPC_SHORT,  UPC_INT,    UPC_LONG,
> UPC_UCHAR, UPC_USHORT, UPC_UINT,   UPC_ULONG,
> UPC_INT8,  UPC_INT16,  UPC_INT32,  UPC_INT64,
> UPC_UINT8, UPC_UINT16, UPC_UINT32, UPC_UINT64,
> UPC_FLOAT, UPC_DOUBLE, UPC_LDOUBLE,
> UPC_PTS
although not all of these would be supported by AMOs or UPC Collectives.  I 
think it would be a
complete-enough set for use in other (future) libraries, too.

--- AMO OPs ---
Continuing along the lines of the AMO types, I would like to see upc_op_t 
(7.3.2.1) pulled out of
the UPC Collectives library and made a common type for AMOs, Collectives, and 
whatever else may
also want it.  In addition to what is in 7.3.2.1, we would add UPC_GET, 
UPC_SET, UPC_NOT, and
UPC_CSWAP.  Both UPC Collectives and UPC Atomic would each have to specify 
which operations are
supported.

--- Supported Types & Operations ---
Revising the table from before...

          CSWAP   AND   OR   XOR   NOT   GET   SET   ADD   MIN   MAX
I, I32      X      X    X     X     X     X     X     X     X     X
L, I64      X      X    X     X     X     X     X     X     X     X
UI, U32     X      X    X     X     X     X     X     X     X     X
UL, U64     X      X    X     X     X     X     X     X     X     X
F32                                       X     X     X     X     X
F64                                       X     X     X     X     X
PTS         X

Original comment by nspark.w...@gmail.com on 26 Sep 2012 at 7:51

GoogleCodeExporter commented 9 years ago
SHORT VERSION: Static initializers cannot be supported by software-emulated AMO 
domains, so this feature should be dropped.

AMO's by definition imply coordination between multiple threads. We've 
previously discussed the need for domain creation to be a "collective" thing, 
with participation from all the threads which intend to use the domain to 
perform "conflicting" AMOs. Eventually that may be collective over a team, for 
now it means all the threads. The reason for this collective requirement is to 
allow the implementation of the library to setup whatever metadata it needs to 
ensure atomicity. 

In the case of a domain where all the requested OPs can be supported via 
hardware instructions, probably nothing special needs to be done during domain 
creation - except probably record the type size provided by the user, and 
possibly stash away the other arguments for debug checking (all of which can be 
done locally with no coordination or computation). The actual AMO ops in this 
case boil down to a thin wrapper around a hardware instruction, prefaced by a 
check that this domain is hardware-supported.

However in the case of a domain that needs to perform locking in software, more 
probably needs to happen. Assume we are interested in a "good" implementation 
where each software-emulated domain is implemented to operate independently of 
other software domains (rather than a "dumb" approach of just using a single 
big lock for all software-emulated domains, which would lead to artificial 
serialization contention). In this implementation each domain needs its own 
dedicated "lock" resource, which might be a upc_lock_t, an OS mutex, or 
something similar. The important point is that domain creation involves 
allocating this lock resource, and very importantly also some form of 
COLLECTIVE coordination (probably a broadcast) to ensure all threads 
participating in the domain creation agree upon the identity/location of the 
lock resource protecting this domain.

For dynamic domain creation, this is not a problem - we can require the domain 
allocation function to be called collectively in the user's program, and the 
library can perform whatever allocation and coordination it requires inside 
that call. However, I'm now realizing that permitting static allocation of AMO 
domains would create serious problems for a software implementation. Consider 
the "simplest" case of a file-scope statically-initialized domain. Our base C99 
language does not support running arbitrary code to initialize static objects 
(as for example in C++ or Java), the only support which is guaranteed to exist 
in the compiler is static initialization to a constant value. This means the 
static initializer implementation basically must be a macro, and at best it can 
simply stuff the arguments into some kind of struct. This does not allow the 
software implementation to dynamically allocate any lock resources it may need 
(especially if those locks need to live in the shared heap), nor does it permit 
collective coordination to uniquely associate the corresponding per-thread 
domain objects with each other and/or that dedicated lock resource. It might be 
possible to play some games with delayed initialization of lock resources to 
skirt the first issue, but I believe the second is fundamental - if the 
application statically creates multiple domains with similar arguments that 
must be emulated in software, there is no way for the library to reliably 
"match up" the domain resources across threads. 

I believe this argues strongly for dropping the idea of static initialization 
of AMO domains entirely. AMO domain creation should be accomplished by a call 
to a collective initializer library function, which creates the domain and 
passes back a pointer or handle to it. It means a line of code per domain in 
the user program, but the alternative design severely curtails software 
implementation, and by extension the entire AMO library (since software 
implementation is our "fall back" position for anything not natively supported 
in hardware).

Original comment by danbonachea on 27 Sep 2012 at 8:24

GoogleCodeExporter commented 9 years ago
Updated interface proposal:
--------------------------

TYPE 
  upc_amodomain_t

  The type upc_amodomain_t is an opaque UPC type. upc_amodomain_t is a shared
  datatype with incomplete type (as defined in [ISO/IEC00 Sec 6.2.5]). Objects
  of type upc_amodomain_t may therefore only be manipulated through pointers.

  Two pointers to upc_amodomain_t that reference the same AMO domain object will compare
  as equal. The results of applying upc_phaseof(), upc_threadof(), and
  upc_addrfield() to such pointers are undefined.

INITIALIZATION
  upc_amodomain_t *upc_all_amodomain_alloc(upc_op_t ops, upc_type_t type, upc_amohint_t hints);

  The upc_all_amodomain_alloc function dynamically allocates an AMO domain and returns
  a pointer to it. 

  upc_all_amodomain_alloc is a collective function. The return value on every
  thread points to the same AMO domain object.

  The AMO domain created supports AMO calls to operate on objects of a unique type, 
  specified by the "type" parameter. The upc_type_t values and the corresponding type 
  they specify are listed in table [ref].

  The "ops" parameter specifies the atomic operations to be supported by the domain. 
  The valid upc_op_t values and their meanings are listed in table [ref]. 
  Multiple upc_op_t values can be combined by using the bitwise OR operator (|), 
  and each value has a unique bitwise representation
  that can be unambiguously tested using the bitwise AND operator(&).

  The "ops" parameter shall only specify operations within the set permitted for "type" 
  (as defined in table [ref]), otherwise behavior is undefined.

  The "hints" parameter provides a performance hint to the implementation. It shall be 
  equal to one of the following values:
    0                       : default behavior
    UPC_AMO_HINT_LATENCY    : requests the implementation to minimize latency of AMO operations
    UPC_AMO_HINT_THROUGHPUT : requests the implementation to maximize throughput of AMO operations
    UPC_AMO_HINT_*          : Implementation-defined additional hint values
  The implementation is free to ignore the "hints" parameter.

USAGE
  void upc_amo_strict(upc_amodomain_t *domain, void *fetch_ptr, upc_op_t op, 
                        shared void *target, void *operand1, void *operand2);

   Description to be added...

Example usage for an atomic add:
-------------------------------
shared int32_t val;
int main() {
  upc_amodomain_t *mydom = upc_all_amodomain_alloc((UPC_ADD|UPC_GET|UPC_SET), UPC_TYPE_INT32, UPC_AMO_HINT_LATENCY);
  upc_amo_strict(mydom, 0, UPC_ADD, &val, 42, 0);
}

Notes:
------
This is not the only possible interface, but I'm intentionally mimicing the 
upc_lock_t interface, because it is a well-proven interface which is 
established and familiar to UPC users. 
The UPC I/O library has an elaborate system for providing implementation hints 
(and including implementation-defined hints), but for now I'm assuming we don't 
need that much complexity and only expect a few possible hints.
We should consider whether we need a domain destructor, ie: void 
upc_all_amodomain_free(upc_amodomain_t *d). If we're convinced that all 
possible clients will only have a small constant number of domains, then 
perhaps that's unnecessary.

Original comment by danbonachea on 27 Sep 2012 at 8:25

GoogleCodeExporter commented 9 years ago
Nick said:
"We'd now have to provide "everything" domains on a per-type basis.  That said, 
people looking into AMOs would likely be fine creating what they need and 
wouldn't necessarily need to have an "everything" domain."

I don't think we should provide any pre-defined "everything" domains. Such a 
domain would by definition include all the ops (for a given type) and therefore 
would activate the most general (read "slowest") implementation for that type. 
The whole point of domains is for the user to explicitly tell the library which 
ops his module needs, so that the fastest possible implementation meeting those 
restricted needs can be provided. Pre-defining fully general domains would 
undermine that design goal.

Nick said:
"Continuing along the lines of the AMO types, I would like to see upc_op_t 
(7.3.2.1) pulled out of the UPC Collectives library and made a common type for 
AMOs, Collectives, and whatever else may also want it.  In addition to what is 
in 7.3.2.1, we would add UPC_GET, UPC_SET, UPC_NOT, and UPC_CSWAP.  Both UPC 
Collectives and UPC Atomic would each have to specify which operations are 
supported."

Issue 10 discusses this factorization, and I agree this should be a high 
priority to complement the inclusion of AMO's in 1.3. UPC_GET and UPC_SET are 
definitely required for every type, since we prohibit direct access during AMO 
phases, and these are our guaranteed "tear-free" reads and writes. I assume 
your "UPC_NOT" is a bitwise complement, otherwise it should be named 
"UPC_LOGNOT" to match the established convention in the collectives. I haven't 
heard anyone argue for atomic logic operations (UPC_LOGAND, UPC_LOGOR), so 
let's leave those out of the supported AMO list for now.

Nick said:
"For completenss with the types supported by the UPC Collectives, I would have 
upc_type_t support:
> UPC_CHAR,  UPC_SHORT,  UPC_INT,    UPC_LONG,"

I don't recall any committee discussion regarding AMO's on types smaller than 
32-bits. I don't personally object to that, and I agree it would be nice to 
provide it for completeness and consistency with the UPC collectives and C11 
AMOs. For software-based implementations it should only be a small additional 
burden to support additional types, and I suspect every implementation will 
need a fall-back software implementation anyhow.

Original comment by danbonachea on 28 Sep 2012 at 11:42

GoogleCodeExporter commented 9 years ago
Dan said:
> I don't think we should provide any pre-defined "everything" domains.

I thought there were people (not me) that wanted to provide that as a 
convenience factor.  Maybe I was mistaken.  Either way, I don't think my users 
would use an "everything" domain for the very reason you note.  They want the 
fastest for what they know they need.  I'm happy to scratch the "everything 
domain" idea.

Dan said:
> I don't recall any committee discussion regarding AMO's on types smaller than 
32-bits.

I am not at all advocating for AMOs on <32-bit types.  I would probably be 
happy with 64-bit types only, but considering some notes about BG/P's AMO 
support (I think), 32-bit AMOs make sense, too.  I noted that "although not all 
of these would be supported by AMOs or UPC Collectives" and I would consider 
AMOs to only apply to the 32- and 64-bit types we've discussed so far, as well 
as the pointer-to-shared type.

Original comment by nspark.w...@gmail.com on 1 Oct 2012 at 9:12

GoogleCodeExporter commented 9 years ago
Since "we" (I'm not sure how many people are really concerned about the AMO 
discussion) seem to be considering AMO domains on a per-type basis (i.e., only 
guaranteeing atomicity on a per-type basis), I was wondering whether it would 
be worth revisiting an "old style" proposal (e.g., upc_amo_xor_U64_r()) with a 
restricted operation set (defined on a per-type basis).  My biggest concern 
with the domain approach, especially given that domains must be dynamically 
allocated, is the general user acceptance with such a "heavy" API (c.f., 
existing AMO extensions).

If we went with a subset of the operations noted in Comment 38 and Comment 81, 
could we define an easier-to-use API that is supported by Cray, IBM, SGI, and 
the IB-based solutions (HP, SGI), even if it isn't as robust (i.e., no MIN or 
MAX)?  Would fetching and non-fetching versions of {ADD, AND, OR, XOR, GET, 
SET, CSWAP} for integer types be a supportable, minimal set across 
implementations?

Maybe we leave operations like MIN and MAX for a future update when vendor 
support is more widespread.  Floating-point support wouldn't be an issue 
because of the per-type atomicity; we wouldn't include bit-wise ops for FP 
types.  And most FP AMOs would probably be done in software anyway.

Then, we could let an implementation define macros like 
UPC_AMO_[[TYPE]]_IS_FAST to provide some information to the user about the 
"fastness" of the AMO.

Original comment by nspark.w...@gmail.com on 3 Oct 2012 at 2:22

GoogleCodeExporter commented 9 years ago
"Floating-point support wouldn't be an issue because of the per-type atomicity; 
we wouldn't include bit-wise ops for FP types."

What's the hangup for bit-wise operations?  These operations just work on bits 
and don't need to know whether those bits happen to be an integer or a floating 
point value (or anything else for that matter).  Not including them will just 
cause users to play dirty games with unions or bad pointer aliasing.  We might 
as well do that inside the library instead of forcing users to do it.

Original comment by sdvor...@cray.com on 3 Oct 2012 at 2:34

GoogleCodeExporter commented 9 years ago
"What's the hangup for bit-wise operations? [...]"

I just didn't think many people would care to AND, OR, or XOR two 
floating-point values.  I guess that I just don't see what's helpful about 
that.  I think that ADD, GET, SET, CSWAP all make sense; I just can't see how 
you'd use the others in a meaningful way.  Also, I don't necessarily feel like 
the UPC AMO library needs to provide everything under the sun; I think it 
should provide those things that are meaningful or helpful to users.

Original comment by nspark.w...@gmail.com on 3 Oct 2012 at 2:39

GoogleCodeExporter commented 9 years ago
I support Comment 86 by Nick.  I like the idea of starting off from a small and 
simple API and enhancing it over time as hardware evolves.  Based on Comment 
38, it seems int64 is the only type we need to support now if we decide to go 
with this direction.  

Original comment by yzh...@lbl.gov on 3 Oct 2012 at 2:46

GoogleCodeExporter commented 9 years ago
"I just didn't think many people would care to AND, OR, or XOR two 
floating-point values."

This is very true.  I'd expect bit-wise ops to take appropriately sized integer 
values as masks. ;)

"I guess that I just don't see what's helpful about that.  I think that ADD, 
GET, SET, CSWAP all make sense; I just can't see how you'd use the others in a 
meaningful way."

Just like with integers, they're extremely useful for directly manipulating 
various portions of the raw representation.  For instance, with an IEEE 
representation, adding XOR allows users to atomically negate a floating point 
value, and adding AND allows users to atomically find the absolute value.

On an unrelated note, are you including SWAP under SET?  SWAP is very important 
in many lock-free algorithms.

"Also, I don't necessarily feel like the UPC AMO library needs to provide 
everything under the sun; I think it should provide those things that are 
meaningful or helpful to users."

Agreed.

Original comment by sdvor...@cray.com on 3 Oct 2012 at 2:52

GoogleCodeExporter commented 9 years ago
"Based on Comment 38, it seems int64 is the only type we need to support now if 
we decide to go with this direction."

I disagree.  I think we should still support [u]int(32|64), float/double, and 
PTS, but (according to Comment 38) Cray would only define UPC_AMO_INT64_IS_FAST 
== 1.  I think the big benefit of the per-type atomicity is that it lets a 
vendor, like Cray, tell the user which AMO types are fast and which are not, 
while still providing reasonably broad support across the library.

Original comment by nspark.w...@gmail.com on 3 Oct 2012 at 2:52

GoogleCodeExporter commented 9 years ago
"Just like with integers, they're extremely useful for directly manipulating 
various portions of the raw representation.  For instance, with an IEEE 
representation, adding XOR allows users to atomically negate a floating point 
value, and adding AND allows users to atomically find the absolute value."

I hadn't thought of that, but that is a neat use for bit-wise ops on FP types.  
Since most NIC-based AMOs seem to be atomic across types, I would imagine that 
your future FP support (noted in Comment 38) wouldn't be tied down by UPC 
allowing bit-wise FP ops.

"On an unrelated note, are you including SWAP under SET?  SWAP is very 
important in many lock-free algorithms."

Yeah, I was considering SWAP = FETCH + SET.  If we go the "explicit" route 
(like the current BUPC and Cray UPC AMO extensions), then we'll probably have 
two functions like: "void upc_amo_set()" and "TYPE upc_amo_swap()" (I'm 
intentionally leaving out the type and relaxed/shared stuff here).

Original comment by nspark.w...@gmail.com on 3 Oct 2012 at 2:58

GoogleCodeExporter commented 9 years ago
For whatever it's worth to this discussion, I talked to our Fortran rep about 
how their standard's committee was settling on which AMO operations to include 
for coarrays in the next standard (since Fortran currently has only an atomic 
read and an atomic write).  He said that their approach is to require a set of 
operations that all implementations will be expected to support (add, and, or, 
xor, swap, compare-and-swap) and they basically assume that any "reputable 
hardware" will support them all efficiently.  The idea of bothering with 
domains, a query mechanism for what is fast or slow, or worrying about some 
poor vendor that had hardware support for only a subset that was forced into 
software atomics seemed ridiculous to him.  Then again, they also have a 
specific implementation-defined integer data type for atomics.

Original comment by johnson....@gmail.com on 3 Oct 2012 at 3:17

GoogleCodeExporter commented 9 years ago
"Also, I don't necessarily feel like the UPC AMO library needs to provide 
everything under the sun; I think it should provide those things that are 
meaningful or helpful to users."

I think the problem is that while any specific user is likely to only consider 
a small set of AMO types and ops "meaningful and helpful", that set will differ 
between users/programs/modules, and the union of those requirements ends up 
being large. The domain API allows users to express which subset they care 
about for the current algorithm and the implementation to provide the best 
possible performance within those constraints. On the flip side, it allows 
implementations to expose whatever hardware AMO support they have for a 
particular algorithm, without requiring the hardware to support the complete 
list of ops in the UPC spec (most of which won't be used by any single 
algorithm). 

A secondary, more subtle feature of domains is they can improve concurrency by 
directly expressing the independence of AMOs from different modules or program 
entities, which need not contend for the same atomicity resources (software 
locks in the case of software-implemented AMOs). Imagine a massively-parallel 
application that divides its threads into a large number of teams, each of 
which creates an AMO domain for access to each team's shared data. If the data 
type in question happens to require a software AMO (eg atomic add on a 
floating-point accumulator), then members of each team contend for the team's 
lock while performing the AMO, but AMO's from different teams can always 
proceed independently. Without domains, then EVERY thread in the application 
potentially contends for ONE lock, even for non-conflicting updates, which 
could make a massive difference in overall performance. Once UPC officially 
grows teams, it also gives us a natural way to ensure the locks associated with 
the domain are located "near" the team using them, which may mean the 
difference between network communication and shared-memory locking, another 
potentially enormous performance win. One could even imagine team domain 
creation that activates a fully hardware implementation when the team members 
are all local to a shared-memory node, but falls back to a software 
implementation when the team includes members who are remote over a commodity 
network lacking AMO features. C11 doesn't worry about such issues because its 
target platform is fully shared-memory threading with relatively small-scale 
concurrency, where these issues have far less impact. I suspect the same is 
true for the majority of the Fortran community (those not using co-array 
features). It's our job to ensure the design we specify allows good scaling on 
large-scale systems (high thread concurrency), and large-scale applications 
(high module concurrency), and I think domains provide that.

From a programability standpoint, I don't think the domain proposal is overly 
cumbersome or "heavy", relative to a "domain-less" API. The domain is created 
once at module initiation, with one line of code. Thereafter for AMO operations 
the difference we're talking about is whether the user writes something like:

 upc_amo_strict(int64dom, 0, UPC_ADD, &val, 42, 0);

as opposed to something like:

 upc_amo_strict_int64(0, UPC_ADD, &val, 42, 0);

ie the only effective difference at the point of use is whether the operand 
type is expressed using the domain argument or in the spelling of the function 
name. In my code above the difference is 2 characters, and could be less if I 
named my domain variable differently. How is this a major burden? The tiny 
additional complexity (for the programmer) seems well worth the potential 
performance gain.

"I think the big benefit of the per-type atomicity is that it lets a vendor, 
like Cray, tell the user which AMO types are fast and which are not, while 
still providing reasonably broad support across the library."

We seem to have consensus that per-type atomicity is a mandatory design 
feature. It allows the implementation (and performance) to vary based on the 
operand type, and it prohibits nasty interactions of atomics of different type 
widths conflicting on the same memory locations, which seems like a Good Thing. 
Also, C11 atomics take the same position, so it will be a familiar restriction.

""On an unrelated note, are you including SWAP under SET?  SWAP is very 
important in many lock-free algorithms."
Yeah, I was considering SWAP = FETCH + SET. "

To clarify, most lock-free algorithms I've seen require an atomic pointer 
COMPARE and swap, not just an unconditional swap. This is NOT equivalent to a 
fetch and set. For reference, C11 DOES provide atomic pointer CAS (via 
uintptr_t) and even atomic pointer increment (which I'm not advocating for 
UPC), but does NOT provide any atomic ops on floating point types (although 
given that many HPC codes are so heavily dominated by FLOPs it seems reasonable 
for UPC to add support for atomics on floats).

"Then, we could let an implementation define macros like 
UPC_AMO_[[TYPE]]_IS_FAST to provide some information to the user about the 
"fastness" of the AMO."

This seems like a nice usability feature. C11 has something similar and 
slightly more sophisticated. I think we can choose to provide this 
independently of any other design decisions. 

Original comment by danbonachea on 4 Oct 2012 at 4:41

GoogleCodeExporter commented 9 years ago
Nick - I'm in the process of drafting the spec for upc_type_t, based on your 
comment 81. I noticed that C11 defines the following types for AMOs that aren't 
mentioned in your list, so I wanted to float these past you (no pun intended):

These I suspect are irrelevant for the majority of UPC codes:
  _Bool
  wchar_t

These are for issuing AMOs that perform local pointer artihmetic, which perhaps 
we don't care about?:
  (u)intptr_t  - integer type convertable to/from (void *)
  ptrdiff_t    - the result of subtracting two pointers

These seem potentially more applicable to UPC codes:
  (u)intmax_t  - the widest available integer type
  size_t       - the size of objects

Finally, the C11 AMOs use the stdint minimum-width types:
  (u)int_least{8,16,32,64}_t and (u)int_fast{8,16,32,64}_t
instead of the fixed-width types stated in comment 81:
  (u)int_{8,16,32,64}_t
The main difference is the minimum-width types may be larger than requested and 
are required for C99 compliance, whereas the fixed-width types must be 
exact-sized and are technically optional by C99/C11. In practice every current 
HPC system I've encountered (all the ones supporting GASNet) provide all the 
exact sized types, so perhaps this is a non-issue for the platforms of 
interest. However to be safe we might want to add a clause to the AMO spec 
stating that support for AMOs on the fixed-width types is 
implementation-specified.

Original comment by danbonachea on 7 Oct 2012 at 8:38

GoogleCodeExporter commented 9 years ago
I have worked on the reference implementation today, in the faint hope that I'd 
flush out a few things that were not obvious in the documentation.

a) I had to redefine all types, and I had to make them bitmask-able. Something 
like this:

enum {
  UPC_AMO_CHAR     = (1<<0),
  UPC_AMO_SHORT    = (1<<1),
  UPC_AMO_INT      = (1<<2),
  UPC_AMO_LONG     = (1<<3),
  UPC_AMO_UCHAR    = (1<<4),
  UPC_AMO_USHORT   = (1<<5),
  ... etc ...

Same goes for the operations. I don't know whether the original UPC definitions 
of upc_type_t are asking for bit-maskable values. Something to consider.

b) There is no good error reporting mechanism in UPC. What if I give the domain 
allocator a hint that I want the superfast HW implementation, but then ask for 
support for types that are not supported by HW? 

-- should the domain creation fail?
-- should the domain creation succeed, but subsequent operation fail?
-- should failures be fatal (i.e. kill the UPC program), or should we have 
status codes returned by upc_amo_strict and upc_amo_relaxed?

I implemented upc_amo_query() in any case to check whether a particular domain 
implements a particular combination of types and operations.

c) the domain constructors do not return shared objects, and therefore the 
_all_ and _global_ protocol does not apply. This is a pity, because it breaks 
consistency with the existing lock and memory allocators. 

-- should we follow the consistency rules of memory and lock allocators at all 
costs?

If we talk on the phone tomorrow, I will have some code. Suggestions as to 
where to post the code? even if it's throwaway stuff (as it most likely is), it 
might provide a further platform for discussion.

Original comment by ga10...@gmail.com on 8 Oct 2012 at 9:26

GoogleCodeExporter commented 9 years ago
"I have worked on the reference implementation today, in the faint hope that 
I'd flush out a few things that were not obvious in the documentation."

Excellent - this exercise should definitely provide additional insights.

"I had to redefine all types, and I had to make them bitmask-able."

I don't think there is any reason for the values of upc_type_t to each have a 
dedicated bit. These are never OR'd together in the proposed interfaces we've 
tossed around. Note each domain should only be for a SINGLE unique type - I 
believe we have consensus that we should not combine multiple types within an 
atomicity domain (mostly because it sidesteps word tearing issues). The 
upc_types.h header proposal I drew up for issue 10 only requires the type 
macros to have distinct values (and also spells them differently than what you 
pasted).

"What if I give the domain allocator a hint that I want the superfast HW 
implementation, but then ask for support for types that are not supported by 
HW? "

There should be no "superfast HW" hint - the hints are of the form "tune for 
latency", "tune for throughput" etc, but in all cases "do the best you can for 
this combination of type and ops" is implied. In any case the hint value should 
not cause an error - it's a hint, not a demand. There should not be a hint 
value to select a particular vendor-specific implementation, because that would 
create portability problems.

If the user specifies a prohibited combination of type and op (eg UPC_PTS with 
UPC_MULT), or type/op values that don't exist, then that should be a violation 
of a "shall" constraint in the AMO spec - which means behavior is undefined, 
but this should be a fatal error in a high-quality implementation. I see no 
motivation to provide graceful failure for such programming errors.

"the domain constructors do not return shared objects, and therefore the _all_ 
and _global_ protocol does not apply. This is a pity, because it breaks 
consistency with the existing lock and memory allocators. "

Please see comment 83 - I intentionally followed the form of the upc_lock_t 
interface, which I think addresses your comment...

Original comment by danbonachea on 8 Oct 2012 at 9:42

GoogleCodeExporter commented 9 years ago
"If the user specifies a prohibited combination of type and op (eg UPC_PTS with 
UPC_MULT), or type/op values that don't exist, then that should be a violation 
of a "shall" constraint in the AMO spec - which means behavior is undefined, 
but this should be a fatal error in a high-quality implementation. I see no 
motivation to provide graceful failure for such programming errors."

The user specifying type/op combinations that aren't supported by the 
implementation should be a fatal error (or undefined behavior), and an 
implementation should support all required combinations in the spec.  However, 
I think we would want to explicitly permit implementations to support a 
documented super-set of the required combinations so they can introduce new 
(non-portable) functionality for users to experiment with.  This way 
implementations wouldn't need to have a duplicate AMO API and makes satisfying 
the "two independent implementations" criteria for future extensions a bit 
easier.

Original comment by sdvor...@cray.com on 8 Oct 2012 at 9:56

GoogleCodeExporter commented 9 years ago
In lieu of a better solution I'll spam all of you with the tar file of
the current implementation. Please comment. I expect most of the code
to change.

> I don't think there is any reason for the values of upc_type_t to
  each have a dedicated bit. These are never OR'd together in the
  proposed interfaces we've tossed around. Note each domain should
  only be for a SINGLE unique type - I believe we have consensus that
  we should not combine multiple types within an atomicity domain
  (mostly because it sidesteps word tearing issues). The upc_types.h
  header proposal I drew up for issue 10 only requires the type macros
  to have distinct values (and also spells them differently than what
  you pasted).

Misunderstanding on my part, then. Looking purely at the API proposed
somewhere above in this thread - I interpreted the type and op
arguments for domain creation as bitmasks of desired operation. I may
agree with Dan about the type, but it certainly makes no sense to me
to create one domain for each operation. So my argument will still
apply to the operations.

> There should be no "superfast HW" hint - the hints are of the form
  "tune for latency", "tune for throughput" etc, but in all cases "do
  the best you can for this combination of type and ops" is
  implied. In any case the hint value should not cause an error - it's
  a hint, not a demand. There should not be a hint value to select a
  particular vendor-specific implementation, because that would create
  portability problems.

OK. No errors upon domain creation. I can still ask to execute an AMO
on an un-implemented operation on a domain. So AMOs can still issue
"not implemented" errors. Hence my choice to return a status code from
the AMO. Needless to say I'm not very strongly attached to this code,
so I can be talked out of this choice.

> Please see comment 83 - I intentionally followed the form of the
  upc_lock_t interface, which I think addresses your comment...

I guess the question in my mind is this: is an AMO
domain a shared object? A lock is obviously a shared object.

If the AMO domain is shared, a single thread can free it, and the
whole alloc/free protocol translates directly from memory allocation
and lock allocation.

If the AMO domain is local, then the protocol does not apply.

Original comment by ga10...@gmail.com on 9 Oct 2012 at 12:53

GoogleCodeExporter commented 9 years ago
"In lieu of a better solution I'll spam all of you with the tar file of
the current implementation."

You can click the link next to the paper clip icon with the label "Attach a 
file" to add the file to a ticket.

Original comment by gary.funck on 9 Oct 2012 at 1:10