Intrepid / upc-specification

Automatically exported from code.google.com/p/upc-specification
0 stars 1 forks source link

Library: non-blocking memory copy extensions #41

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is to log the UPC non-blocking memory copy library extensions.

For more information, please see
https://sites.google.com/a/lbl.gov/upc-proposals/extending-the-upc-memory-copy-l
ibrary-functions

Original issue reported on code.google.com by yzh...@lbl.gov on 22 May 2012 at 11:41

GoogleCodeExporter commented 9 years ago
I'm glad to hear we're making some progress on teasing out the "signalling-put" 
usage scenario, and perhaps agreeing that it deserves a separate, dedicated 
library call, to avoid conflating it with the desire for pure asynchronous data 
movement free from inter-thread synchronization. If we at least have consensus 
on that goal, then perhaps we can make progress on both extensions 
independently.

However, I'm very dismayed the Cray team is insisting on continuing to claim 
the Berkeley semantics somehow "violate" or "break" the memory model. This 
claim is totally and completely FALSE. Repetition of this unproven and patently 
false claim only demonstrates a lack of understanding of the Berkeley 
semantics, the memory model, or perhaps both, and is not conducive to 
productive discussion. I sincerely HOPE there is an honest lack of 
understanding here, and that this is not some kind of underhanded negotiation 
tactic attempting to discredit an opposing view with disparaging emotional 
claims. In any case I hope we can drop this false preconception and 
unsubstantiated mode of argument and have an open-minded discussion of the 
factual issues in precise technical terms, devoid of sloppy generalizations. 
This will be my last textual attempt to try and describe how the Berkeley 
semantics fit into the memory model. If someone is still not "getting it" then 
we should table the memory model discussion until the call. I have limited 
internet access for the next week until the call anyhow.

The UPC memory model is not some touchy-feely generalization or hand-wavy set 
of properties, nor is it anything like a virtual machine abstraction that one 
might sketch on the standard UPC shared memory diagram. It is a formal 
mathematical model that defines consistent execution traces of shared 
read/write accesses in valid programs. The conditions for validity are very 
explicitly and mathematically defined, and the guaranteed properties of strict 
accesses (and by extension fences) are corollaries that follow from the formal 
definition. The memory model is not automatically "violated" simply because it 
exhibits a property that someone may find surprising or counter-intuitive, nor 
because it may allow behaviors that have no analogue on one's favorite 
architecture under naive translation. This is not a weakness in the definition, 
it reflects an inherent complexity of relaxed memory models, which are by 
nature subtle and often surprising or counter-intuitive. 

Now all that being said, we all obviously strive to present users with language 
and library tools that neatly wrap up useful functionality with an interface 
that is as intuitive and unsurprising as possible, while still efficiently 
providing the desired functionality. However programmers that insist upon 
writing invalid or perverse programs will end up with undefined or unintuitive 
behaviors, respectively. We should not cripple the performance or utility of 
our expected target usage scenarios in order to better accomodate these 
outliers. Parallel programming is hard, and C is a low-level language - 
programmers need considerable sophistication to use UPC correctly and 
effectively, and that goes double for "advanced" features like explicitly 
asynchronous data movement within a relaxed memory model. Debugging tools and 
automated error detection are valuable tools, but ultimately we are dealing 
with a non-type-safe HPC language with numerous "pointy edges", and the 
feasability of automated correctness checking should not take precedence over 
functionality and performance concerns in evaluating new language features.

A VERY important and deliberate property of the memory model is that it does 
not make any guarantees about the possible results of operations that did not 
occur. Stated another way, if the execution trace did not directly "observe" a 
violation of the model, then the execution is consistent, regardless of what 
tricks the implementation may be playing under the covers (whose effects were 
not observed by any application read operations). 

The Berkeley semantics for async are that it leads to UNDEFINED BEHAVIOR for 
any application thread to modify the source buffer or in any way access the 
destination buffers during the transfer interval, between the library 
initiation call and the successful return of library sync call. A program that 
"cheats" and touches these buffers in the forbidden interval is an INVALID UPC 
program, and the memory model does not provide ANY guarantees whatsoever for an 
invalid program. Valid UPC programs Shall Not touch those buffers within the 
transfer interval, and this property makes it IMPOSSIBLE for them to observe 
exactly how those buffers were accessed by the library, and how those accesses 
may or may not have been affected by other non-related fences or 
synchronization constructs. Because all executions of valid programs are 
prohibited from observing any violations, by definition the memory model is 
preserved and the executions are "UPC Consistent". This is the essence of how 
the memory model works - if VALID programs cannot tell the difference, then the 
model is not violated.

Troy wrote:
"I don't think that this approach is valid for extending UPC (at least not in 
the backwards compatible manner that we want for UPC 1.3) because it could 
break the intent of existing code by removing the only mechanism that the 
programmer has to ensure that there is no ongoing communication: upc_barrier.  
If I have a collective function in an existing library, I may have used a 
upc_barrier upon entry and exit to ensure that I can do what I want with any 
memory in between.  Currently this is a foolproof way of guarding against what 
comes before and after my collective library function and the only burden on my 
client is to call the function in a collective context.  With asyncs added, my 
barriers no longer offer complete protection and the burden shifts to the 
library client to ensure that any asyncs touching the data do not cross a call 
to this function somewhere in their call graph."

I agree the property you mention is CURRENTLY true of pure UPC 1.2 programs as 
a consequence of combining the memory model with the highly restricted set of 
communication operations available in 1.2. It's not a direct property of the 
memory model, it just happens to be true based on what you can currently 
express in the 1.2 language. The property is notably NOT preserved once you 
link in anything outside UPC, including MPI operations and even asynchronous 
file I/O, both of which are common-place in enterprise applications. It's 
debatable whether that property was originally intentional or even if its a 
desirable high priority moving forward. In any case the fact that admitting the 
Berkeley semantics into the 1.3 toolbox can remove this guarantee does not 
constitute a violation of the memory model, or any officially stated corollary 
thereof. It's not a backwards compatibility issue because legacy programs don't 
use the new features and therefore remain unaffected. Programmers who add calls 
to the async library are burdened with ensuring the buffers remain untouched 
throughout the transfer interval - this is precisely the additional complexity 
that sophisticated users are accepting when they choose to use any async 
library. If users cannot figure out how to leave the buffers untouched during 
the transfer interval, then they have no business using an asynchronous 
library. 

However regarding this point, I think it's worth separating discussion of the 
property no-asyncs-crossing-barriers (mentioned above) from 
no-asyncs-crossing-fences (which includes barriers as a special case, but MANY 
more operations as well). We could conceivably decide to require the init and 
sync call to appear in the same barrier phase (and this could be easily checked 
in a debug implementation), independently of what we decide regarding fences. 
My inclination is that explicitly async transfer intervals should be permitted 
to span either one (for reasons of composibility), but I would like some user 
feedback about how important Troy's property is to them.

ga10502 said:
"1) asynchronous memory operations have local completion semantics (i.e. 
waiting for an async memput only guarantees that the send buffer is reusable 
after a put). Fences operate on asynchronous operations just like on "normal" 
blocking ones.

2) asynchronous memory operations have global completion semantics (i.e. 
waiting for an async memput guarantees that it is now remotely complete as 
well). Fences do not operate on asynchronous operations - indeed, there is no 
point since just waiting already does the job."

We all appreciate your input, although unfortunately I don't think either of 
your "summaries" above corresponds to the possibilities we're currently 
discussing. The Berkeley semantics are very *close* to option #1, except the 
fences you mention are only relevant to the async operation when they occur 
outside the transfer interval, and fences during the transfer interval are not 
relevant to the completion of the async. Option #2 is close to the Cray 
proposal of several weeks ago, but I think consensus is moving towards 
loosening those built-in fences and instead introducing a dedicated 
signalling-put library to meet the need it was trying to address. Both sides 
should probably summarrize the semantics they are currently proposing - I'll 
post a followup shortly that does this.

"I'm not sure what you will say about processor ordering of asynchronous puts 
to the same remote address. I would love it if you could make yourself say that 
the *start* of the operation is what determines order - not when you wait for 
completion. This, again, would be unsurprising. It can be implemented with some 
effort - I will claim that the effort is the same as we are already making to 
order blocking puts on an unordered network."

This "effort" you mention is one of the primary performance bottlenecks we are 
trying to alleviate by introducing this library. The programmer asserts there 
are no conflicting accesses to the source or destination buffers for the entire 
period delimited by the init/sync calls (the "transfer interval"), and the 
implementation does a better job of communication overlap as a result. If you 
need a stronger ordering guarantee, the existing upc_mem* libraries already 
provide that semantic. Burdening the async library with such a semantic would 
make it nearly indistinguishable from the existing calls and defeat the purpose 
of introducing a new library.

"You spent a lot of time talking about fences and their interaction with 
half-finished asynchronous operations. This seems like a red herring to me - if 
you are a crazy enough programmer to use non-blocking operations - need I 
elaborate on the perils of non-blocking remote operations? - well, in that case 
making sure that there are no pesky strict accesses, barriers and so forth 
between the op start and the wait should be child's play."

Your statement is a very strong argument for Berkeley's view regarding the 
behavior of fences and other strict ops that appear in the transfer interval. 
This is the largest point of remaining contention. However let me clarify that 
neither proposal is currently *prohibiting* fences in the transfer interval - 
the disagreement centers around what effect they have on the async operation in 
progress. If you never place a fence or strict op in the transfer interval, 
then the difference is irrelevant and both sides agree on behavior. However, I 
think both sides want to allow strict ops during the transfer interval for 
reasons of overlap and composibility - ie so that UNRELATED computation and 
communication can continue.

Original comment by danbonachea on 10 Aug 2012 at 4:10

GoogleCodeExporter commented 9 years ago
In preparation for the upcoming telecon discussion, I'm splitting out the key 
sections of the old Berkeley proposal, which included a number of additional 
functions that are not currently under consideration for inclusion. Both Cray 
and Berkeley proposals MOSTLY agree on the use of handle types (and the 
different flavors of handle-based synchronization), and the flavors of 
memput/memget that should be provided - so to focus on the important issues and 
avoid tedious duplication I'll show a single exemplar function. This text 
obviously still needs alot of work in translation to formal "spec-ese", but for 
now I'm tending towards verbosity in the interests of clarity:

#include <upc_nb.h>
upc_handle_t upc_memcpy_nb(shared void * restrict dst, shared const void * 
restrict src, size_t n);

Semantics
----------
1 This operation has the same data movement effect as the corresponding 
upc_memcpy call defined in the UPC Language
Specification section 7.2.5, except it is split-phase. 
2 The specified operation is initiated with a call to the above function which 
returns an explicit handle representing the operation in-flight. 
3 The operation remains "in-flight" until after a successful call to 
upc_waitsync or upc_trysync on the returned handle. 
4 The contents of all affected destination memory is undefined while the 
operation is in-flight, and if
the contents of any source memory changes while the operation is in-flight, the 
result is undefined.
5 The order in which non-blocking operations complete is intentionally 
unspecified - the system is free
to coalesce and/or reorder non-blocking operations with respect to other 
blocking or non-blocking operations,
or operations initiated from a separate thread - the only ordering constraints 
that must be satisfied are those
explicitly enforced using the synchronization functions (i.e. the accesses 
comprising the non-blocking operation are only guaranteed
to be issued somewhere in the interval between initiation and successful 
synchronization on that operation).
6 The effect on conflicting accesses OUTSIDE the transfer interval is AS-IF the 
transfer were performed as a set of relaxed shared reads and relaxed shared 
writes of unspecified size and order, issued at an unspecified time within the 
transfer interval by the calling thread. Conflicting accesses INSIDE the 
transfer interval are prohibited by semantic 4. "INSIDE" and "OUTSIDE" are 
defined by the Precedes() program order for accesses from the calling thread; 
accesses from other threads are considered "INSIDE" unless every possible and 
valid <_strict relationship orders them before the init call or after the sync 
call.

Key properties of the Berkeley async semantics:
----------------------------------------------
1. The program is prohibited from changing the source buffer between the 
initiation call and the return of a successful sync call (the "transfer 
interval").
2. Similarly, the contents of the destination buffer are UNDEFINED during the 
entire transfer interval (until a successful sync call returns). 
3. Fences, barriers, and other operations (to unrelated memory locations) are 
fully permitted during the transfer interval, but the transfer interval is 
unaffected and extends until the matching sync call (stated differently, the 
operation that was explicitly initiated remains "in-flight" until the program 
explicitly calls the sync operation for it).
4. Because the contents of destination memory remain UNDEFINED until the return 
of a sync call, it is impossible for any valid program to "observe" any effect 
on the destination buffers caused by any fences or strict operations issued 
during the transfer interval.
5. After a successful sync call returns, subsequent read/writes issued by the 
thread that invoked the transfer are immediately guaranteed to observe the 
effects of the transfer without any fences. This is exactly the same semantic 
guarantee as the existing upc_mem* calls.
6. There are no strict operations, fences or "global completion" guarantees 
built-in to the library calls. The library performs pure data movement, not 
inter-thread synchronization. If such fences or inter-thread synchronization 
are required (eg to guarantee other threads see the effects of the transfer), 
they can be added around the library calls by the program using existing UPC 
mechanisms. These are exactly the same consistency guarantees as the existing 
upc_mem* calls, and the "best practice" documented in B.3.2.1.
7. The formal semantics of the library are defined in terms of 
minimally-constrained relaxed operations, exactly analogous to the existing 
upc_mem* library, as described in spec v1.2 section B.3.2.1. The size, order 
and temporal position of these operations within the transfer interval is 
deliberately unspecified. However, due to property #2 and #4, such details 
cannot be observed by any valid program anyhow.
8. Because of #6 and #7, the expression wait_sync(upc_mem*_async()) is 
semantically identical to the blocking version upc_mem*(), and is expected to 
exhibit a very similar performance profile in high-quality implementations.

Notable differences from the Cray proposal:
----------------------------------------------
1. In Berkeley asyncs the library transfer interval does not end when a thread 
issues a fence or other strict operation (possibly from within an unrelated 
library call or other callee). This enables deeper communication overlap in 
large, multi-module/multi-author applications.
2. Berkeley asyncs do not assume the application always wants to couple thread 
synchronization with data transfer. In this way they are more primitive and 
also potentially lighter-weight, and designed to efficiently apply to a broader 
set of usage cases. They enable applications to coalesce synchronization 
overheads for multiple data transfers (eg think bulk synchronous HALO 
exchange), rather than incuring thread synchronization costs on a per-operation 
basis.
3. The Berkeley asyncs are semantically very close to the existing upc_mem* 
library. They intentionally add the minimal possible semantics to enable 
explicitly asynchronous data transfer, and remain orthogonal to issues of 
thread synchronization. The memory model formalism used to describe the 
upc_mem* library also applies to the Berkeley asyncs.

Original comment by danbonachea on 10 Aug 2012 at 4:13

GoogleCodeExporter commented 9 years ago
"However, I'm very dismayed the Cray team is insisting on continuing to claim 
the Berkeley semantics somehow "violate" or "break" the memory model."

Let UPC 1.2 have memory model M.  I claim that the BUPC async proposal, by 
adding things to the language that are immune to certain aspects of M, like 
fences, creates memory model M'.  The BUPC async proposal is fully compatible 
with M'.

I believe that BUPC considers M == M' so that when we say that the BUPC asyncs 
violate M, it is interpreted as nonsense, since M' was constructed for the BUPC 
asyncs and they are compatible with it.  I do not believe that M == M' because 
if I try to reason about a program using only M and that program contains BUPC 
asyncs, then I can come to incorrect conclusions about my program's behavior.

I understand your point that the memory model "is not some touchy-feely 
generalization or hand-wavy set of properties," but based on responses to other 
Issues on this site, the UPC 1.3 memory model should remain equivalent to the 
UPC 1.2 memory model for backward compatibility and to avoid confusing users.  
We might be quibbling over what "equivalent" means.  I think you are saying 
that the "formal mathematical model" in the appendix has not changed and the 
new stuff that the BUPC async proposal would add is declared to be independent, 
and therefore M == M'.  I'm saying that I can't necessarily apply M and M' to 
the same program and reach the same conclusions; therefore, to me, as someone 
trying to apply the memory model as a tool for program interpretation, M != M'.

The original Cray proposal and the newer consensus proposal stay completely 
within M.  The half-fence idea is already part of M because upc_fence already 
has two roles in blocking movement of relaxed accesses; there is simply no 
user-accessible syntax for using one of its two roles without the other.  In 
this instance, we needed only one of the two existing roles, similar to the way 
that upc_barrier can be divided into upc_notify and upc_wait.

Hopefully that clears up what I mean by "break" or "violate."

Original comment by johnson....@gmail.com on 10 Aug 2012 at 5:21

GoogleCodeExporter commented 9 years ago
" I do not believe that M == M' because if I try to reason about a program 
using only M and that program contains BUPC asyncs, then I can come to 
incorrect conclusions about my program's behavior."

The memory model has not "changed" in any way. The Berkeley asyncs do not 
propose to change it at all. What is changing is the set of OPERATIONS that you 
can express in the language, and this is true of both proposals. Furthermore, 
the "reasoning" properties you refer to remain true for all VALID UPC programs 
(ie those that do not break the library interface and access undefined values). 
The memory model only defines behaviors for VALID UPC programs and that remains 
true as well.

Pure 1.2 programs do not use the new libraries at all, so trivially meet the 
validity requirement of the library. 1.3 programs that add calls to the new 
libraries must follow the rules of the interface and not touch the buffers 
during the transfer interval, and any properties still remain true. The only 
way to observe the differences you claim would be through an INVALID program 
(ie one that touches buffers in the forbidden interval). I challenge you to 
write a VALID program using the Berkeley library semantics that breaks any 
property of interest. We're not interested in the behavior of INVALID programs, 
ie those that access undefined values or break rules which lead to undefined 
results, because therein lies the road to madness.

Where the two proposals differ is the point at which the transfer interval 
ends, ie the extent of the undefined buffers. Berkeley states it 
unconditionally extends from the explicit init to the explicit sync. Cray 
states it can conditionally end earlier if the calling thread performs a strict 
operation. This difference is a philosophical one and really has NOTHING to do 
with the formal memory model at all - it's simply the declared boundary of the 
asynchronicity of the library we're ADDING. Either option can fit perfectly 
inside the existing memory model without changes. We should focus our 
discussion on the user impact of this difference and the pros and cons of 
either way, instead of wasting time inventing falsehoods about how either way 
"breaks" the memory model.

"The half-fence idea is already part of M because upc_fence already has two 
roles in blocking movement of relaxed accesses; there is simply no 
user-accessible syntax for using one of its two roles without the other. "

I encourage you to carefully re-read the formal memory model. The terms 
"half-fence", "local completion" and "global completion" do not appear anywhere 
in the entire 1.2 spec. Any library spec that wishes to use such terms to 
specify consistency semantics would need to introduce FORMAL mathematical 
definitions that fit within the existing formalism.

I'd also remind you that consistency semantics of upc_fence are clearly defined 
in B.5.3.1:
"A upc fence statement implies a strict write followed by a strict read:
  SW(l_synch, 0) ; SR(l_synch, 0)"
The "half-fence" you're looking for is probably just a strict write: 
SW(l_synch, 0)
but since it sounds like we're probably dropping that semantic anyhow it's not 
really relevant anymore.

Original comment by danbonachea on 10 Aug 2012 at 8:42

GoogleCodeExporter commented 9 years ago
I think we should just wait for the conference call now.  :)  Neither one of us 
is convincing the other and I think that we'll need to have a majority of 
non-BUPC, non-Cray folks help solve this issue, preferably with input on what 
users want.

Original comment by johnson....@gmail.com on 10 Aug 2012 at 8:53

GoogleCodeExporter commented 9 years ago
"I think we should just wait for the conference call now.  :)  Neither one of 
us is convincing the other and I think that we'll need to have a majority of 
non-BUPC, non-Cray folks help solve this issue, preferably with input on what 
users want."

I agree - as I mentioned earlier this textual discussion doesn't seem to be 
progressing us towards resolving anything. I was just hoping to narrow the 
discussion points to the precise technical differences so we can focus on 
resolving what actually matters.

In preparation for the call, I'd like everyone to consider the following 
program:

1: shared int p = 0; // assume no other threads are touching these locations
2: shared int q = -10;
3:   ...
4: if (MYTHREAD == 2) {
5:   #ifdef __UPC_NB_MEM_
6:     upc_handle_t h = upc_memcpy_async(&p,&q,sizeof(int)); // start an async 
copy that eventually writes -10 to p
7:   #endif
8:   p = 20
9:   int l_p = p;
10:  printf("%i",l_p);
11: }

Under UPC 1.2 (without __UPC_NB_MEM_), I can state the following property 
concerning lines 8-9:
"Because no other thread is touching p in this synchronization phase, and this 
thread performs a relaxed write of the value 20 to p on line 8, then the 
immediately subsequent relaxed read of p on line 9 from this thread is 
guaranteed to return 20.". More concisely and intuitively, "No other thread is 
touching p, so when I write p and then immediately read p I get back the value 
I wrote".

When you enable 1.3 __UPC_NB_MEM_ and add the EARLIER call to upc_memcpy_async 
on line 6, this property is no longer true. The spec allows implementations 
which sometimes print 20, sometimes print -10, and sometimes print something 
completely different. This is because the transfer interval for the unsynced 
library call has made the contents of p UNDEFINED, and this program is writing 
and reading that buffer while it has an undefined value, leading to undefined 
behavior. This is an incorrect use of the library, and it has invalidated a 
very simple property derivable from the 1.2 memory model regarding accesses 
elsewhere in the program.

Note moreover that everything stated about the program above is true for BOTH 
the Berkeley and Cray libraries. Under BOTH semantics this program is reading 
undefined values by the addition of the earlier library call and now has 
undefined behavior. This is an example of how the addition of the library 
features proposed by Berkeley OR Cray can invalidate simple reasoning about the 
memory model in programs that use the new libraries incorrectly. This doesn't 
mean the memory model is "broken", it just means there is a new way to write 
programs with undefined behavior that the memory model does not govern. 
Programmers who embark upon using a new library feature are burdened with using 
it CORRECTLY, otherwise they get undefined behavior and commonly derived 
properties of the memory model may not hold true.

Original comment by danbonachea on 10 Aug 2012 at 11:01

GoogleCodeExporter commented 9 years ago
The following is NOT intended to pour gasoline on any fires.

I just wanted to take a moment to restate a key point of the Berkeley proposal 
that might help reduce confusion.  In particular, we keep stating that the 
Berkeley proposal does not "break", "modify" or "circumvent" the existing 
(Appendix B) memory model.  Yet so far the explanations we have offered have 
not been effective at convincing Troy of Steven of that assertion.  So, I want 
to try again (twice in fact).  I hope this will help.

To us (at least Dan and I) it is natural to define an "asynchronous" transfer 
as taking place at an UNdefined point in time (or more than one) between the 
init and sync calls.  This is what naturally happens when the init call queues 
work to the network and the sync call blocks for completion of that work.  
Making that UNdefined nature of the transfer part of the specified semantics of 
a UPC-level async interface leads "naturally" to a situation (descried two ways 
below) in which an implementation can legally perform the async transfer 
entirely without interaction with strict accesses which take place in the 
transfer interval.  So, we believe that this is in no way an "unnatural" 
behavior, even if it may be "unintuitive" to some/many.

Note that the following are just two approaches to explaining the SAME idea, 
not two alternative/competing ideas.

EXPLANATION #1:

The Berkeley async data movement functions do NOT declare themselves as having 
a special exemption from the memory model's requirement that a strict operation 
(fences included) will order all preceding relaxed operations.  However, the 
semantics are constructed such that an implementation may LEGALLY perform the 
transfers without interaction with strict references, as follows...

What they DO is define a "transfer interval" which extends from the start of 
the init call to the end of the sync call, and the define the transfer as being 
a performed by an UNDEFINED sequence of (one or more) relaxed operations which 
may take place at any time (and in any order) for the duration of that interval.

BECAUSE the time(s) at which the relaxed operations occur is NOT defined, there 
is no way to "prove" that any portion of the transfer is the result of a 
relaxed operation which precedes the strict operation.  Therefore, an ordering 
in which all of the relaxed operations FOLLOW the strict operation (which 
itself must PRECEDE the sync call, or we'd not be concerned about it in this 
discussion) is "admissible" (for lack of a better word) under the CURRENT 
memory model.  The constraint that the destination memory is undefined during 
the interval (just as with an MPI_Irecv, FWIW) prohibits a correct program from 
performing reads which could observe any ordering other than one in which the 
transfer occurs entirely after the strict operation(s).

EXPLANATION #2:

1) The specified semantics allow a "move-all-data-at-sync" implementation (of 
course they do not require it) in which the ENTIRE relaxed transfer takes place 
IN the sync function and thus FOLLOWS any strict operations executed WITHIN the 
transfer interval.

2) No legal UPC program (meaning it obeys the specification's prohibition 
against reading the destination or writing the source) may OBSERVE the actual 
order/timing in which the data is written to the destination, NOR may it modify 
the source data (for instance to "sneak" information about the timing of the 
transfer into the destination memory).

3) Therefore, an implementation is permitted to transfer the data without 
"forced completion" by strict operations which occur in the transfer interval 
because the only OBSERVABLE (by legal programs) execution is indistinguishable 
from move-all-data-at-sync (in which the data movement FOLLOWS the strict 
operations).

Original comment by phhargr...@lbl.gov on 11 Aug 2012 at 1:02

GoogleCodeExporter commented 9 years ago
As another interesting data point, the UPC I/O library in spec 1.2 section 
7.4.7 already includes asynchronous modification of memory, and is the closest 
example of explicitly asynchronous library behavior in the current spec:

"For asynchronous read operations, the contents of the destination memory are
undefined until after a successful upc_all_fwait_async or
upc_all_ftest_async on the file handle. For asynchronous write operations,
the source memory may not be safely modified until after a successful
upc_all_fwait_async or upc_all_ftest_async on the file handle."

The behavior of the async I/O library is consistent with the Berkeley approach, 
in that the transfer interval and undefined buffers ends at the library sync 
call, not at earlier arbitrary fences or other strict operations. Using async 
I/O in UPC 1.2 it is ALREADY possible to construct scenarios where even a 
upc_barrier does not "quiesce" all in-flight changes to memory, and a library 
continues to modify destination buffers after a fence or barrier.

Original comment by danbonachea on 11 Aug 2012 at 2:09

GoogleCodeExporter commented 9 years ago
RE Comment 57:

Hi Paul.  I think Dan previously made the argument in your Explanation #1.  
Here's my problem with it...

The existing upc_mem* functions are already collections of relaxed accesses 
that can issue in any order between two fences; however, to obey the 
same-address ordering rule, a upc_mem* implementation generally must wait until 
the accesses fully complete before returning.  It would not be correct for a 
UPC compiler to reorder a upc_mem* call with respect to a upc_fence, either 
before the call or after the call:

                                 upc_memput( ... );
    upc_fence;           --x-->  upc_fence;
    upc_memput( ... );

    upc_memput( ... );
    upc_fence;           --x-->  upc_fence;
                                 upc_memput( ... );

The major utility of a upc_mem*_nb function is that it can start issuing 
accesses and not wait for them to complete before returning, allowing both 
overlapping communication and overlapping communication with computation.  
Ideally the accesses issue as soon as possible, but it would NOT be correct for 
a compiler to issue them before the previous fence:

                                      h = upc_memput_nb( ... );
    upc_fence;                --x-->  upc_fence;
    h = upc_memput_nb( ... );
    sync( h );                        sync( h );

I haven't seen anyone argue that the above is legal using the BUPC async 
functions, and that's good.

Now, going the other direction, it also should not be legal to allow the 
accesses to issue AFTER the next fence.  BUPC claims that for the async 
functions the accesses could all issue after the fence, or even all issue at 
the sync.  This situation is as if the _nb call was made after the fence:

    h = upc_memput_nb( ... );
    upc_fence;                --?-->  upc_fence;
                                      h = upc_memput_nb( ... );
    sync( h );                        sync( h );

For this transformation to be legal with an async call when it is not legal 
with the existing upc_mem* calls is truly bizarre.  If I'm an application 
programmer and I want to incrementally introduce greater overlap, I'd probably 
start by doing this:

    upc_memput( ... );  -->  h = upc_memput_nb( ... );
                             sync( h );

And then move the upc_memput_nb call upward and move the sync call downward in 
the program as far as I could.  Fences are brick walls here and that's fine -- 
that is to be expected.  (In the original Cray proposal, if the sync made it 
all the way to the fence, you could remove it.)  Then, after doing that, I 
might get a bit fancier and start software pipelining, which is where we get to 
the idea of providing an lsync as well as a gsync so that buffers can be reused 
and completion can be guaranteed for individual copies.

To summarize, an asynchronous upc_mem call plus a sync (lsync) that doesn't 
behave like its synchronous counterpart would be severely unintuitive and not 
what we want to offer our users.

Original comment by johnson....@gmail.com on 11 Aug 2012 at 4:07

GoogleCodeExporter commented 9 years ago
"The major utility of a upc_mem*_nb function is that it can start issuing 
accesses and not wait for them to complete before returning, "

No, that's only half the benefit. It also allows the implementation to delay 
issue based on network conditions. RDMA systems usually have a limited-depth 
queue for network injection, and it can be advantageous under contention to 
post operations to a software queue which is serviced later as the network 
drains. Even if the hardware accepts the operation immediately, an RMA offload 
engine may continue to read source buffers after the init returns, resulting in 
injections which are "issued" after init returns and other client code has run. 
In extreme cases of contention, injection might not even occur until within the 
sync call, although this is hopefully not a common case as it would defeat 
overlap for that operation.

I think this may be a core point of philosophical confusion - when we say the 
relaxed operations may be issued ANYWHERE in the transfer interval, we really 
mean that. This is not some semantic dodge, we really envision platforms that 
will exhibit this behavior. Cray needs to loosen its preconception that all 
operations are issued before the init call returns, as this is NOT a 
requirement we want to add - doing so would severely inhibit the overlap 
potential of this library on systems of interest.

"For this transformation to be legal with an async call when it is not legal 
with the existing upc_mem* calls is truly bizarre."

The existing library calls are monolithic and inseparable. The new library 
calls are asynchronous and require two related calls, init and sync. 
Asynchronous calls introduce the concept of a transfer interval, which have no 
analogue in the blocking call. The behavior of this transfer interval is what 
we're discussing, so the "behavior" of a non-existent transfer interval in the 
blocking calls is irrelevant.

"asynchronous upc_mem call plus a sync (lsync) that doesn't behave like its 
synchronous counterpart would be severely unintuitive and not what we want to 
offer our users."

If they "behave" identically, then there's no point in introducing a new 
library. The new library adds the ability to express important semantic 
loosening which an advanced user can exploit to achieve overlap and improved 
performance. This is not a UPC 101 feature - this is a library for advanced 
users who want to maximize the performance of their app and are willing to 
learn the slightly more complicated rules for using them correctly. Perhaps 
Cray's goal in this regard is different.

"want to incrementally introduce greater overlap..move the upc_memput_nb call 
upward and move the sync call downward in the program as far as I could.  
Fences are brick walls"

You're looking at this as a compiler writer performing an automatic 
transformation based on static analysis. I agree that a static optimization 
which attempts to move the init and sync calls would have to stop movement at 
fences, without a great deal more parallel analysis. But this interface is NOT 
a UPC compilation target - implementations have their own internal interfaces 
for that. This is a user API where the programmer has full knowledge of buffer 
access patterns, and can place his init and sync library calls correctly to 
ensure he doesn't touch buffers during the transfer interval where they are 
undefined. He can add fences and other unrelated computation and communication 
during the transfer interval which do not touch the in-flight buffers, and 
everything still works. Even if he takes the approach of starting with 
init();sync(); and slowly spreading the calls, he CAN safely move them past 
unrelated computation, communication and even syncs using his Programmer 
Knowledge to ensure he doesn't move them past the code which 
consumes/overwrites the transfer buffers or synchronization which signals other 
threads to do so. The fact this transformation cannot blindly and automatically 
be performed by an optimizer in most interesting cases is a large part of the 
motivation for exposing an explicitly asynchronous library in the first place 
whereby an advanced user can do this manually.

Original comment by danbonachea on 11 Aug 2012 at 5:12

GoogleCodeExporter commented 9 years ago
Troy wrote, in part:
> Now, going the other direction, it also should not be legal to allow the 
accesses
> to issue AFTER the next fence.  BUPC claims that for the async functions the
> accesses could all issue after the fence, or even all issue at the sync.
> This situation is as if the _nb call was made after the fence:

In that last sentence the phrase "as if the _nb call" distills the essence of 
the difference between our lines of thought.  In your line of thought the 
relaxed accesses are firmly associated with the _nb call.  In the Berkeley line 
of thought the association is weaker, with the _nb call only defining one end 
of an interval.  

So, I do NOT agree that the transformation you describe (moving the _nb call 
past the next fence) is illegal.  In fact, it is my position that it is allowed 
PRECISELY because the _nb call's semantics (as given in the Berkeley proposal) 
are designed to "give license" to perform this transformation (but ONLY with 
respect to fences within the transfer interval).  In the Berkeley proposal use 
of an async transfer is an assertion by the user that the source and 
destination memories are untouched within the transfer interval, and the 
transformation is permitted.

This *is* what we (Berkeley) want to offer to our users, and where our 
"philosophies" differ.

I don't feel this behavior is "unintuitive" when one thinks about asynchronous 
data movement by network hardware or other "asynchronous agent".  I AGREE that 
it does not follow "intuitively" from thinking about "sync(upc_memFOO_nb(...))" 
as equivalent to "upc_memFOO(...)", but that is NOT an equivalence we provide, 
and this could be made clear(er) in the specification/documentation if desired.

Let's look at this difference in another way.
Given the following:
    h = upc_memput_nb( ... );
    other_work1();
    upc_fence;
    other_work2();
    sync( h );
Cray's proposal PROHIBITS transformation into
    other_work1();
    upc_fence;
    other_work2();
    h = upc_memput_nb( ... ); // has moved DOWN 3 lines
    sync( h );
and it REQUIRES a transformation equivalent to:
    h = upc_memput_nb( ... );
    other_work1();
    sync( h ); // has moved UP 2 lines
    upc_fence;
    other_work2();

So, why is moving the sync() across the fence REQUIRED while moving 
upc_memput_nb() is PROHIBITED?  To me THAT is an unintuitive aspect of the Cray 
proposal.

In the Berkeley proposal, the init and sync are endpoints of an interval with 
neither one "anchoring" the transfer.  Thus both transformations are legal, but 
neither is required (again this only applies to movement across fences WITHIN 
the transfer interval).

Original comment by phhargr...@lbl.gov on 11 Aug 2012 at 5:43

GoogleCodeExporter commented 9 years ago
"Cray needs to loosen its preconception that all operations are issued before 
the init call returns"

We do not have a preconception that all are issued; that's not even how our own 
hardware works.  We expect that the implementation will start issuing accesses 
in the init call.  It very well may offload the whole transfer to an entirely 
separate piece of hardware and, sure, that hardware may be overwhelmed and not 
do anything for a while, but the point is that it tries to start the transfer.  
The asynchronous call is essentially the programmer saying please try to start 
this transfer right away and return as soon as you can so that I can do 
something else in the meantime.

"The behavior of this transfer interval is what we're discussing, so the 
"behavior" of a non-existent transfer interval in the blocking calls is 
irrelevant."

No, it is quite relevant because there are existing programs that use the 
blocking upc_mem* calls that will migrate to the non-blocking calls for better 
performance and the programmer needs a semantic equivalence to get started.

"If they "behave" identically, then there's no point in introducing a new 
library."

They should be identical until you separate them.  This is the equivalence that 
I'm talking about.  The fact that the transfer window vanishes to nothing for 
the blocking call does not invalidate this equivalence.  The utility comes from 
being able to separate them and that is the point of extending the library 
interface.  The separation process is familiar and similar to replacing a 
upc_barrier with a upc_notify plus upc_wait and then stretching them apart.  
Before they are separated they are by definition identical to the original 
upc_barrier.

"You're looking at this as a compiler writer performing an automatic 
transformation based on static analysis. ...  The fact this transformation 
cannot blindly and automatically be performed by an optimizer in most 
interesting cases is a large part of the motivation for exposing an explicitly 
asynchronous library in the first place whereby an advanced user can do this 
manually."

The advanced user can achieve overlap while still respecting fences and 
barriers.  I don't believe that whatever additional overlap might be made 
possible by being able to suspend the normal fence and barrier rules for the 
duration of the async transfer would be worth the ensuing weirdness of not 
having a blocking/non-blocking+sync equivalence and not having a upc_barrier 
complete all accesses.

Original comment by johnson....@gmail.com on 11 Aug 2012 at 6:23

GoogleCodeExporter commented 9 years ago
"I AGREE that it does not follow "intuitively" from thinking about 
"sync(upc_memFOO_nb(...))" as equivalent to "upc_memFOO(...)", but that is NOT 
an equivalence we provide, and this could be made clear(er) in the 
specification/documentation if desired."

It's an equivalence that we intentionally provided.  Are you suggesting that we 
have the UPC 1.3 spec not guarantee the equivalence but as a vendor document 
that ours is equivalent?  That's possible to do, but less standardization is 
worse for users and I'd like to avoid that if possible.

"So, why is moving the sync() across the fence REQUIRED while moving 
upc_memput_nb() is PROHIBITED?  To me THAT is an unintuitive aspect of the Cray 
proposal."

We don't allow any access to cross any fence.  It does not matter that the 
access is being performed for the UPC thread via some asynchronous agent.  The 
accesses are still logically being performed by a particular UPC thread and its 
accesses are ordered with respect to its fences.  We follow this rule 
consistently.  As another example of an asynchronous agent, if a UPC thread has 
OpenMP threads underneath it and the OpenMP threads issue accesses, then a 
upc_fence completes all accesses issued by all OpenMP threads that are part of 
the UPC thread.  (I know this isn't possible for GASNet-based UPC 
implementations because you aren't allowed to sync a handle from a different 
pthread.)

Original comment by johnson....@gmail.com on 11 Aug 2012 at 7:08

GoogleCodeExporter commented 9 years ago
While I clearly haven't converted Troy, if feels to me like both sides now 
understand how the respective specification's goals/requirements differ.  
Hopefully things are now clear enough to an "outside observer" that others can 
make up their minds prior to the conf call.

Troy wrote:
>  Are you suggesting that we have the UPC 1.3 spec not guarantee the 
equivalence but as a vendor
>  document that ours is equivalent?

No.  What I was suggesting was that if the Berkeley design is chosen and one is 
concerned that users would expect the equivalence, then the spec could make a 
clear(er) statement about the lack of equivalence.

> We don't allow any access to cross any fence.

Berkeley instead leaves undefined which side of the fence the access(es) occur 
on, thus removing the issue of "crossing" one.  This is a fundamental 
difference and each side has presented arguments for their own approach and 
against the other.  I think these past few comments have been more "level 
headed" than some previous ones and allowed us to state our positions w/o 
name-calling or "mischaracterization" of the other.

> It does not matter that the access is being performed for the UPC thread via 
some asynchronous agent.

As Dan mentioned, the UPC I/O specification also has a model of asynchronous 
data movement similar to what the Berkeley async memcopy proposal contains.  
Does the Cray implementation also complete async I/Os at fences and other 
strict references?  If NO, what is the reasoning behind this different 
interpretation from how you now want async memcopy operations to be defined?  
If YES, does any other vendor share the interpretation that upc_fence must 
complete UPC I/O operations?  (Perhaps a tough question to answer, since I seem 
to recall that no other vendor has a UPC I/O implementation at the moment).

Original comment by phhargr...@lbl.gov on 11 Aug 2012 at 8:41

GoogleCodeExporter commented 9 years ago
"They should be identical until you separate them.  This is the equivalence 
that I'm talking about."...
""I AGREE that it does not follow "intuitively" from thinking about 
"sync(upc_memFOO_nb(...))" as equivalent to "upc_memFOO(...)", but that is NOT 
an equivalence we provide, and this could be made clear(er) in the 
specification/documentation if desired."

Let me clarify here, because the discussion has become quite arcane for the 
casual reader. Under the Berkeley proposal, the exact line:
  sync(upc_memFOO_nb(...);
IS SEMANTICALLY EQUIVALENT to:
  upc_memFOO(...)
This is the special case of an EMPTY transfer interval, so all the discussion 
about how opposing views handle that interval trivially collapse to no 
difference. Both lines of code encapsulate a synchronous transfer that consumes 
and produces the transfer buffers before returning, so surrounding code from 
this thread are guaranteed to observe updated values. 

Cray's current "half-fence" semantics for the first line additionally force 
some synchronization with other threads before sync() returns, which means the 
split-phase version generates MORE synchronization (and presumably more 
performance overhead) than the purely blocking version. This 
over-synchronization can never cause a correct program to become incorrect, but 
it IS a difference nonetheless. So unless they follow-thru removing that part 
of the semantics, it is the CRAY proposal that lacks semantic equivalence to 
the blocking version.

Now when it comes to TRANSFORMING the split-phase call and separating the init 
and sync, the interval becomes non-empty and that's where more care is 
required. Note that once you start moving ANY non-comment lines of code around, 
the program is by definition no longer the same program - what we're talking 
about is transformations which are guaranteed to have no semantically 
observable effect on the BEHAVIOR of the program, relative to the one 
containing a blocking call. This is very different from saying the programs are 
EQUIVALENT, because the entire point is to overlap communication and improve 
performance (so at the very least the network traffic and performance will not 
be "equivalent", and we want that).

Neither proposal allows movement of init or sync past accesses which 
potentially conflict with the buffers in-flight - this could potentially cause 
the program to access undefined data, and thereby demonstrate a change in 
behavior and correctness. Neither proposal allows movement past a fence IN 
GENERAL, although under the Berkeley proposal that transformation is SOMETIMES 
valid, if the fence is synchronizing unrelated data and provably does NOT 
affect access to the transfer buffers by some other thread. Both types of 
transformation require information that compilers usually cannot (or 
realistically, can never) prove, so we're talking about manual transformations 
by an advanced user who understands his data dependencies. An advanced user 
could use the same knowledge to move a blocking upc_mem* call past fences in 
his program that he knows are unrelated to the data transfer.

"We don't allow any access to cross any fence.  ...  We follow this rule 
consistently.  ... a upc_fence completes all accesses issued by all OpenMP 
threads that are part of the UPC thread. "

What about with MPI non-blocking operations?: 

Thread 0                                     Thread 1
MPI_Irecv(&destbuffer,...,1,&handle);
...
upc_fence;
MPI_Send(...) /* to T1 */                    MPI_Recv(...) /* from T0 */
...                                          MPI_Send(...) /* to T0 */
MPI_Wait(&handle,...)

I hope your fence implementation doesn't stall to await completion of that 
Irecv (which is asynchronously writing to destbuffer), or one could easily 
construct many examples where that additional synchronization creates deadlock..

Original comment by danbonachea on 11 Aug 2012 at 6:13

GoogleCodeExporter commented 9 years ago
MPI functions are a red herring because they do not perform UPC shared accesses.

"As Dan mentioned, the UPC I/O specification also has a model of asynchronous 
data movement similar to what the Berkeley async memcopy proposal contains.  
Does the Cray implementation also complete async I/Os at fences and other 
strict references?"

First, I'll directly answer your question by explaining that our UPC async I/O 
isn't very asynchronous at this point, so fences and strict accesses don't 
worry about this problem, but even if it was fully asynchronous, technically, 
the fences don't have to worry about this problem...

UPC I/O asyncs are weird and different from either the Cray or the BUPC 
proposal.  A UPC thread calling a UPC I/O async function actually doesn't 
perform any UPC shared accesses as part of the call!

    "Each call to upc_all_{fread,fwrite}_shared[_async] with a common file pointer behaves as if the read/write operations were performed by a single, distinct, anonymous thread which is different from any compute thread (and different for each operation)." [UPC 1.2 A.1.1.3 Paragraph 3]

According to UPC I/O, if the calling thread executes a fence, then it doesn't 
affect the ordering of the accesses being done by this other anonymous thread 
because a fence is an operation local to one thread.  Personally I think the 
UPC I/O spec is pretty wacky and this particular paragraph raises a bunch of 
questions like:  What data has affinity to the anonymous thread?  Why isn't it 
counted by THREADS?  Why does the description of collective functions in the 
spec not have to exclude these threads -- presumably they don't have to 
participate in a barrier?  Where does this thread come from?

But I digress.  Fact is, the current I/O appendix deals with the async issues 
by introducing the concept of anonymous UPC threads and neither the BUPC 
proposal nor the Cray proposal went that route.  With the existing upc_mem* 
functions and the Cray non-blocking functions, the UPC thread calling the 
function makes relaxed accesses that respect fences.  The UPC thread calling a 
BUPC async function also makes relaxed accesses, but ignores any fences or 
barriers between the init and the sync.  The claim is that the accesses occur 
at an undefined point relative to the intervening fences so that the fence has 
no effect, but the calling UPC thread is still logically the entity performing 
the accesses.

If BUPC would like to redefine their async functions as causing accesses to be 
performed by an anonymous thread, then I will concede that a BUPC async gets to 
ignore the fences, but then BUPC needs to flesh out this concept of an 
anonymous thread a lot more.  I really don't like that the anonymous thread 
concept is mentioned in passing in the I/O appendix as if it were a trivial 
thing because it's not.

If the definition of anonymous threads can be pinned down, then we need to see 
if users expect the calling thread to be the thing logically responsible for 
issuing the accesses or if they're comfortable thinking in terms of anonymous 
threads.

Personally I don't like the anonymous thread concept and I hope we don't have 
to go there.  UPC programs (currently) are supposed to have exactly THREADS 
number of threads for their entire lifetime.  I much prefer saying that the 
accesses are performed by the calling thread and hope that we end up with that.

Original comment by johnson....@gmail.com on 12 Aug 2012 at 2:57

GoogleCodeExporter commented 9 years ago
"MPI functions are a red herring because they do not perform UPC shared 
accesses."

MPI is NOT a red herring, because MPI functions are very commonly used in 
hybrid UPC/MPI programs to modify shared memory locations. Hybrid UPC+MPI apps 
are a very important usage class, and this property of using upc_fence to 
quiesce all changes to shared memory that Cray seems so married to simply 
doesn't hold true for those apps.

"Personally I don't like the anonymous thread concept and I hope we don't have 
to go there.  UPC programs (currently) are supposed to have exactly THREADS 
number of threads for their entire lifetime.  I much prefer saying that the 
accesses are performed by the calling thread and hope that we end up with that."

I agree that we should not extend the anonymous thread concept from UPC-IO. It 
is well-motivated for the parallel I/O library for a number of different 
reasons, mostly having to do with how collective parallel I/O libraries work, 
but those aren't applicable to non-collective upc_mem*_async. My only point in 
mentioning the UPC-IO example was to demonstrate that 1.2 already includes 
operations that asynchronously modify shared memory that are not governed by 
upc_fence and upc_barrier. So the fairy-tale of using upc_barrier to quiesce 
the whole world doesn't even work in pure UPC-1.2, let alone 
UPC+MPI+POSIX-async-IO...  This derived and undocumented property only works in 
very a narrowly limited subset of UPC, so we should stop worrying about 
preserving it.

"The UPC thread calling a BUPC async function also makes relaxed accesses, but 
ignores any fences or barriers between the init and the sync. "

False - as we've said over and over the thread does not IGNORE them, in fact 
any fences or barriers continue to govern normal relaxed accesses issued by the 
application. However the library's relaxed accesses which comprise the async 
memcpy are permitted to take place anywhere inside the transfer interval, and 
the transfer buffers remain officially "off-limits" for the duration of the 
transfer interval (until the sync), so no valid UPC program can PROVE whether 
or not the fences in the interval had any effect on the accesses that comprise 
the transfer. 

Original comment by danbonachea on 12 Aug 2012 at 3:44

GoogleCodeExporter commented 9 years ago
"MPI is NOT a red herring, because MPI functions are very commonly used in 
hybrid UPC/MPI programs to modify shared memory locations. Hybrid UPC+MPI apps 
are a very important usage class, and this property of using upc_fence to 
quiesce all changes to shared memory that Cray seems so married to simply 
doesn't hold true for those apps."

Correct -- we don't expect a upc_fence to affect MPI.  I said that MPI is a red 
herring because Paul asked how our fences handled MPI_Send.  The answer is that 
UPC fences have no effect on MPI communication.  If you have a UPC program and 
it calls MPI_Send, the shared qualifiers get casted off.  From the perspective 
of the UPC program, the call is to a function that can touch only local data 
and at most the upc_fence is obligated to do an mfence() to enforce ordering, 
which generally won't do anything to an MPI transfer, but that's fine.

"So the fairy-tale of using upc_barrier to quiesce the whole world doesn't even 
work in pure UPC-1.2, let alone UPC+MPI+POSIX-async-IO"

I never said that upc_barrier covered other communication models.  It quiesces 
UPC shared accesses for the UPC threads involved in the barrier (the anonymous 
UPC I/O threads are evidently not participants in upc_barrier).  We should 
preserve this property.  MPI calls do not make UPC shared accesses.  An OpenMP 
thread or a POSIX thread could make a UPC shared access, and those have to be 
covered by fences and barriers.

From the above discussion, it has become apparent that we want the following 
things that part of the BUPC team does not want:

  - Full equivalence between existing upc_mem* functions and their upc_mem*_nb counterparts plus a local sync. (I believe Yili agreed with this point when we worked on the consensus proposal.)
  - upc_barrier should continue to quiesce all shared accesses.
  - Non-blocking shared accesses are ordered by fences, no matter if the fence is before, within, or after the init-sync interval, just like normal shared accesses.
  - The ability to fully (globally) complete an individual nb transfer without forcing completion of other communication.

Note that as part of the compromise approach, we conceded requiring the sync 
call to be made for the _nb calls instead of allowing the user to omit it 
because BUPC needed that to free resources, and BUPC conceded that the sync 
call could be optional for _nbi calls.

Original comment by johnson....@gmail.com on 12 Aug 2012 at 5:20

GoogleCodeExporter commented 9 years ago
"  - Full equivalence between existing upc_mem* functions and their upc_mem*_nb 
counterparts plus a local sync. (I believe Yili agreed with this point when we 
worked on the consensus proposal.)"

See comment 65 - the Berkeley proposal provides this exact equivalence, the 
CRAY proposal does not (because it adds extra synchronization which is not 
present in the blocking version). 

"  - upc_barrier should continue to quiesce all shared accesses."

Programmers who care about this property can achieve it by placing their 
barriers outside transfer intervals. Comment #56 demonstrates why programmers 
need to be aware of the boundaries of their transfer intervals to write correct 
programs with this library, under either proposal.

In my opinion this unmoderated textual conversation has become a waste of time 
on both sides. Neither side is barely listening to the other, and we're 
certainly not making progress. Nobody aside from the two arguing parties seems 
to be reading it. I'm done with this thread.

Original comment by danbonachea on 12 Aug 2012 at 5:41

GoogleCodeExporter commented 9 years ago
"See comment 65 - the Berkeley proposal provides this exact equivalence, the 
CRAY proposal does not (because it adds extra synchronization which is not 
present in the blocking version)."

I want equivalence for the nb call plus the lsync, not the nb call plus the 
gsync.  It's the gsync call that has the half-fence, so I think you've confused 
the two proposed sync calls.

Another example to think about before the call...

// assume THREADS == 2
shared int x[THREADS];
int delivered;
...
if ( MYTHREAD == 0 ) {
    x[0] = 1;

    /* BUPC: Starts an async transfer window that is terminated by a sync call.
     *
     * Cray: Starts an async transfer window that is terminated by a sync call
     * or by any fence.
     */
    upc_memput_[async/nb] with target x[1], source x[0], for sizeof( int ) bytes

    ...
    /* Computation or other communication. */
    ...

    /* BUPC: Has no effect on the async transfer.
     *
     * Cray: Forces completion of the async transfer.
     */
    upc_barrier;
    upc_barrier;
    upc_barrier;

    /* BUPC: Required to terminate the async transfer window and to free
     * resources.
     *
     * Cray: Concedes that the sync can be required in UPC 1.3 to give other
     * implementations an opportunity to free resources, but does not currently
     * require it because any fence will terminate the window and Cray has
     * sufficient resource tracking without it.
     */
    upc_waitsync / upc_gsync
}
else {
    upc_barrier;

    /* BUPC: This access is illegal, even though a barrier separates it from
     * Thread 0's x[0] = 1 assignment because Thread 0 has not yet sync'd a
     * copy that uses x[0] as a source buffer.
     *
     * Cray: This access is legal and has no effect on Thread 0's async copy
     * because the barrier already forced completion of the copy.
     */
    x[0] = 2;

    upc_barrier;

    /* BUPC: This local access is illegal because Thread 0 has not yet sync'd
     * a copy that uses x[1] as a destination buffer.  The value of x[1] may be
     * 0, 1, 2, or even something else.
     *
     * Cray: This local access is legal and will read that x[1] == 1.
     */
    delivered = ( x[1] > 0 );

    upc_barrier;

    if ( ( delivered == 0 ) && ( x[1] == 1 ) ) {
        /* BUPC: The claim is that this program is already illegal because it
         * contains illegal accesses so this test proves nothing.  Cray
         * considers the accesses to be legal, such that if this block executes,
         * it proves that the async read the x[0] value from one side of a
         * barrier but wrote it to x[1] on the other side of the barrier,
         * thereby detecting a relaxed access that crossed a barrier.  BUPC
         * claims this situation is possible but not legally observable and
         * therefore not a memory model violation.
         *
         * Cray: This block will never execute because the x[1] value was
         * delivered back at the first barrier.
         */
    }
}

Original comment by johnson....@gmail.com on 12 Aug 2012 at 8:34

GoogleCodeExporter commented 9 years ago
A related issue is the semantics for non-blocking data movement as they will be 
applied to an eventual non-blocking collectives proposal.  This is related to 
the current topic because it has been stated (By George in comment #4) that it 
is strongly desired that the "Collectives 2.0" proposal, when revised, use the 
same model/semantics for asynchronous data movement as the NB memcpy family (to 
the extent that can be done, given that one is point-to-point while the other 
is collective).

It is my position that the Berkeley semantics (destination(s) undefined during 
a transfer interval spanning from init to sync) is equally/easily applicable to 
specification of semantics for non-blocking collectives.  I don't have a 
similar confidence that the Cray semantics would work well for collectives, but 
feel that to be fair Cray should make that judgement, not me.  So, here are two 
"points of concern" which come to mind regarding NB collectives.

1) Unlike the point-to-point NB memcopy interfaces, where is no distinguished 
initiator.  For instance, given a Gather operation: is the root thread the 
initiator of all data movement (Pull), or are all threads initiators of  the 
pieces with local affinity (Push).  This is important to define with the Cray 
semantics because it determines which strict operations would be required to 
complete the collective operation.  "All strict operations on all participating 
threads" is one possible answer that is completely unambiguous.  Another 
unambitious option might be "strict operations fence all data movement to and 
from memory with affinity to the issuing thread".

2) Even in the presence of a clear (and agreeable to all parties) definition of 
which strict operations completed which portions (or the entirety) of a 
collective operation, I wonder if such a semantic is practical to implement 
given that most scalable implementation involve spanning trees and thus 
multiple intermediaries along the data movement path (again envisioning a 
Gather as an example).  Thus, a strict reference may require global 
communication/coordination even when issued by a single thread (thus NOT 
collectively).

Original comment by phhargr...@lbl.gov on 12 Aug 2012 at 9:09

GoogleCodeExporter commented 9 years ago
In comment #68, Troy wrote:
"I said that MPI is a red herring because Paul asked how our fences handled 
MPI_Send."

No, I did not.
No reference to MPI_Send appears in any comments I have authored.

Keeping this discussion civil is hard enough without people putting words in my 
mouth.  Perhaps it was an honest mistake, since Dan (who DID post a mixed 
UPC/MPI example) and I have similar viewpoints on most things, but that mistake 
just reinforced Dan's observation that "Neither side is barely listening to the 
other".

Original comment by phhargr...@lbl.gov on 12 Aug 2012 at 9:38

GoogleCodeExporter commented 9 years ago
"No, I did not.  No reference to MPI_Send appears in any comments I have 
authored."

Oops, yep, that was a Dan comment in #65.  My apologies.  72 comments is a lot 
to keep track of!

I'm sorry if you and Dan don't feel that I'm listening.  I've tried to respond 
to all comments as quickly as possible (it's the weekend, even...) because I 
had hopes that we could resolve more issues prior to the call.  This online 
discussion is a wonderful tool and has shed a lot of light on how different 
people interpret the UPC 1.2 spec as well as plans for changes.  The fact that 
a discussion occurred that did not cause any of the three of us to switch sides 
is unfortunate, but it happens, and believe it or not there are other issues on 
this site that I'm MUCH more interested in resolving than this one.  Hopefully 
there is enough info here now that others who will be on the call can come in 
knowing which design they support, or even come in with a new third idea.

Original comment by johnson....@gmail.com on 12 Aug 2012 at 11:02

GoogleCodeExporter commented 9 years ago
Dan wrote: "In my opinion this unmoderated textual conversation has become a 
waste of time on both sides. Neither side is barely listening to the other, and 
we're certainly not making progress. Nobody aside from the two arguing parties 
seems to be reading it. I'm done with this thread."

Lest my lack of interaction on this topic is interpreted as implicit 
agreement/disagreement or apathy -- I have followed the exchanges with interest 
and formed my own opinion.  Although the debate has been sometimes heated, I 
agree with Troy that I think that many good points have been made on both sides 
and issues raised.

For the record, as I've stated before: I'm not generally in favor of adding any 
async. operations to UPC, because to me they feel too "MPI like" and because 
they introduce a degree of anachronism (by design) that make parallel programs 
written in UPC even more difficult to understand and analyze.  Separately, I 
questioned the suitability of some enhancements made in the NPB benchmark suite 
that used Berkeley's async. memory copy operations.  At the time, it was a 
fairly easy call that those non-standard vendor-only extensions should be 
excluded from the NPB suite given its role as a reference benchmark.  Even when 
an async. memory op. library is accepted as a required library, I'd probably 
question its use the default implementations of the NPB benchmarks, but I'd 
have less ground to stand on.

That said, given that the decision has been made to add an async. memory op. 
library to the required UPC library, I support Berkeley's position that the 
transmission interval between an init and sync. call should be defined in a way 
that it is unaffected by strict accesses or upc_fence statements that occur 
within the transmission interval and conversely those strict accesses and 
upc_fence statements do not affect the semantics of the async. memory op. 
library calls.

My main reason for supporting Berkeley suggested definition is based on the 
assumption that a programmer who uses these async. memory op. library calls is 
likely a more expert/advanced UPC programmer, and that this programmer has 
decided to use the async. memory operation calls to improve program 
performance.  To gain the most performance improvement that the async. mem. op. 
library has to offer, I think that the programmer will willingly accept the 
responsibility to ensure that the source data buffers remain untouched during 
the transfer interval and that the destination buffer is not valid until the 
relevant sync. call has returned.

I also support the addition of a signalling put capability, perhaps as a 
separate library.  I would in fact like to see it added as a required library 
in the 1.3 spec. if it has been determined to be both suitably well-defined and 
understood.

If the decision is made to maintain the "half fence" (a term coined in comment 
#24), and then Berkeley holds to their position of maintaining their bupc_* 
calls as a response, I would view this as a failing of the UPC language design 
and specification process.  In fact, as a UPC compiler and runtime implementer, 
my reaction is: "why bother?" when considering the implementation of the new 
required library.  Keeping separate bupc_ and cray_ forms of the library call 
seems superior to me, because at least it emphasizes the differences and 
doesn't offer the illusion of standardization.

Given that the current Cray definition of the async. mem. ops. is more 
restrictive than Berkeley's, there is no backwards compatibility issue for 
their users if the Berkeley semantics are accepted.  That is: programs written 
to the Cray definition of the library will continue to work as they did before. 
 Thus it seems that decision of imposing additional restrictions on the use of 
the async. mem. ops. reduces to a philosophical difference, as noted in one of 
the preceding comments.

In forming my opinion, I found the following points to be most compelling:

1. Comment #23: "One of my many criticisms of the current async proposal is 
that it cannot achieve this level of overlap on the important case of 
cache-coherent shared-memory systems, because the semantics require the 
insertion of "half-fences" and guarantees of local/global visibility, which are 
significantly more heavyweight and would inhibit the hardware-provided 
reordering optimizations.  As such, the simple case of upc_mem*_async();gsync() 
would be expected to perform SLOWER and provide LESS overlap than the existing 
analogous upc_mem* call for those systems. It makes no sense to introduce an 
"async" library whose semantics for memory overlap are often MORE restrictive 
than the existing synchronous counterparts."

2. comment #43: "Our goal with the Berkeley async transfer library was to 
enable far more aggressive overlap of communication with unrelated computation 
and other communication. We are trying to overlap the entire cost of a 
communication, and allow it to asynchronously continue undisturbed without 
interference from unrelated operations."

3. comment #47: "I think the Berkeley semantics are better described as an 
overhead expressway - it bypasses all the city traffic below and is unaffected 
by city traffic lights, because the laws of the road guarantee the cars below 
cannot even SEE the highway traffic, let alone interact with it. The on-ramps 
and off-ramps are clearly defined by the library calls which define where cars 
enter and exit the normal flow of relaxed operations on the city streets, but 
while they're "in-flight" on the expressway they operate completely 
independently of everything else."

4. comment #56: "This doesn't mean the memory model is "broken", it just means 
there is a new way to write programs with undefined behavior that the memory 
model does not govern. Programmers who embark upon using a new library feature 
are burdened with using it CORRECTLY, otherwise they get undefined behavior and 
commonly derived properties of the memory model may not hold true."

Although I appreciate that some users have indicated their preference for the 
"half-fence" semantics proposed by Cray, I wonder if they would still support 
that preference if they were told that the performance of the library call 
might be impacted by this restriction?

Original comment by gary.funck on 13 Aug 2012 at 5:14

GoogleCodeExporter commented 9 years ago
Hi Gary.  Thanks for offering another perspective.  I would like to point out 
that the async plus the gsync is not equivalent to the existing upc_mem* call 
because the existing call guarantees only local completion wheras the gsync is 
global completion (and the gsync is where the half fence is).  So by converting 
an existing upc_mem*call call to an async plus an lsync, which is equivalent, 
there is no new performance hit.

Original comment by johnson....@gmail.com on 13 Aug 2012 at 5:52

GoogleCodeExporter commented 9 years ago
Correction: "introduce a degree of anachronism" should read "introduce a degree 
of asynchronism".  (My mis-use of the spell checker.)

Original comment by gary.funck on 13 Aug 2012 at 5:58

GoogleCodeExporter commented 9 years ago
In Comment 75, Troy wrote (in part): "So by converting an existing upc_mem*call 
call to an async plus an lsync, which is equivalent, there is no new 
performance hit."

Understood.  Thanks for the clarification.

Original comment by gary.funck on 13 Aug 2012 at 6:00

GoogleCodeExporter commented 9 years ago
Gary Funk wrote (in part):
> If the decision is made to maintain the "half fence" (a term coined in 
comment #24), and then Berkeley
> holds to their position of maintaining their bupc_* calls as a response, I 
would view this as a failing of
> the UPC language design and specification process.  In fact, as a UPC 
compiler and runtime implementer,
> my reaction is: "why bother?" when considering the implementation of the new 
required library.

Gary,

Berkeley UPC has an obligation to backwards compatibility for our existing 
users' codes which demands that we retain these entry points at least for some 
reasonable deprecation period durring which uses can convert their codes to the 
standardized interfaces.  This would be true regardless of what semantics are 
chosen for the standard.  How close or far the standard's semantics are from 
those of our current interfaces will determine if there are two distinct 
implementations within our runtime, or just a layer of #defines.

I fully expect that Cray has a similar commitment to keeping their users' codes 
operating unchanged, at least in the near term.  I don't think any of us are 
retaining legacy interfaces for the purpose of undermining the standardization 
process.

> Although I appreciate that some users have indicated their preference for the 
"half-fence" semantics
> proposed by Cray, I wonder if they would still support that preference if 
they were told that the
> performance of the library call might be impacted by this restriction?

It is our hope that the signaling-put (which will have both blocking and async 
flavors) will prove applicable to the same usage case (producer-consumer 
notification) that the "half-fence" or "gsync" are intended to address.  
Quoting now from Steven in comment #40:
 +>   The "half-fence" that we proposed on the global sync formally provides acquire semantics on relaxed
 +>   accesses.  This is necessary to permit pairwise synchronization with a remote thread via relaxed
 +>   operations to notify that thread that the non-blocking operation is complete.
While the signaling-put is intended to "package" the relaxed (blocking or 
non-blocking) operation and the pairwise synchronization into a single 
operation.

Original comment by phhargr...@lbl.gov on 13 Aug 2012 at 6:10

GoogleCodeExporter commented 9 years ago
An addition to my comment 74: If a consensus can't be reached where both Cray 
and Berkeley agree to deprecate their flavors of the async. mem. op libraries 
in favor of the new standard required library, I would move to defer inclusion 
of the async. mem. op. library in version 1.3 of the specification.

A separate editorial note: I understand the advantage of vendors offering new 
libraries prefixed with something that identifies the vendor, to avoid 
confusion that the new library appears to be part of the UPC standard, for 
example.  But from my perspective this has created a Tower of Babel where for 
example, we have separately named libraries for returning cpu timer ticks, or 
we have async. mem. op. libraries with differing semantics (and differing 
names).  What I would much prefer is that vendors conferred and then drafted a 
single proposed library that appears in the Proposed library section of the UPC 
specification.  It is this form of the library that users might experiment with 
so that ultimately the definition is finalized and moved to "required" or 
"optional".  Users who use the proposed library accept the responsibility to 
change their code as the library definition evolves before it is finalized.

Original comment by gary.funck on 13 Aug 2012 at 6:12

GoogleCodeExporter commented 9 years ago
Paul wrote (in part): "Berkeley UPC has an obligation to backwards 
compatibility for our existing users' codes which demands that we retain these 
entry points at least for some reasonable deprecation period durring which uses 
can convert their codes to the standardized interfaces.  This would be true 
regardless of what semantics are chosen for the standard.  How close or far the 
standard's semantics are from those of our current interfaces will determine if 
there are two distinct implementations within our runtime, or just a layer of 
#defines."

I understand that those vendor-specific interfaces will have to be preserved -- 
for a while.  The question is: when/if those old vendor-specific interfaces are 
deprecated and then removed.  My guess is that if they offer distinct 
capabilities separate from the new standard library that they will never be 
deprecated.  It is this potential for divergence that bothers me and in my view 
significantly lessens the value of a standard definition.

Original comment by gary.funck on 13 Aug 2012 at 6:20

GoogleCodeExporter commented 9 years ago
Gary wrote (jn part):
> My guess is that if they offer distinct capabilities separate from the new 
standard library that they will
> never be deprecated.  It is this potential for divergence that bothers me and 
in my view significantly
> lessens the value of a standard definition.

There must be a balance struck between the individual implementer's freedom to 
"innovate" and the "good of the community".  I suspect that the economic 
realities faced by commercial vendors will make them less quick to remove the 
implementation-specific interfaces than the not-for-profits.  So, there may be 
no "right" answer.  However, that won't prevent me from putting forth a set of 
SUGGESTIONS:

+ As long as a new library is Optional there should be no "pressure" for 
removal of a vendor-specific (near) equivalent.  This is intended to encourage 
head-to-head comparison which will help form a community consensus as to 
whether the Optional library should be promoted to Required status or 
potentially dropped if a better solution is to be found by embracing an 
alternative in a future Spec revision.

+ Once a library has reached Required status, the first compiler release by a 
given vendor to claim to support the new spec version should document any 
"competing" vendor-specific library/feature as deprecated.

+ Once a library has reached Required status, a vendor should REMOVE 
"competing" extensions with their first complier to claim to support the NEXT 
spec, or after 2 calendar years (whichever is sooner).  Example: a compiler for 
UPC spec v1.4 drops vendor-specific libs which compete with Required libs 
introduced in the 1.3 spec.

Since there is no UPC "branding" there is no manner by which we can "enforce" 
these suggestions, but I'd like to think we can all agree to behave as members 
of a cooperative community.

Thoughts?

Original comment by phhargr...@lbl.gov on 13 Aug 2012 at 6:48

GoogleCodeExporter commented 9 years ago
I spent a lot of time thinking about this over the weekend (in between moving 
stuff into the new apartment, so I missed all the new comments...).  I did end 
up finding a significant use case for allowing non-blocking calls to continue 
beyond a fence that I could not come up with a good way of restructuring to 
avoid the fence.  In light of that, I sat down with Troy this morning and 
discussed what we could do to make this use-case work.  We still have some 
reservations about the Berkeley proposal, but I think Cray can get a lot closer 
to it now that we have a good concrete use-case to work with.  We haven't fully 
fleshed things out yet, but here's the major changes that we're considering at 
the moment:

1. Remove the *_gsync() and "half-fence" from the non-blocking proposal.  There 
would only be a single sync call mostly matching the semantics of the original 
Berkeley proposal.  As Dan pointed out, most of the desire for this call on 
Cray's part is subsumed by the signalling put proposal from Berkeley (Could 
this be added to the issue tracker?  Also, I'd also like to see a signalling 
get in there...).  I still think that the functionality provided by the 
"half-fence" could be generally useful in the future, so I'll make a separate 
proposal for language level support for the addition of less restrictive 
fences.  And yes, that proposal will include the proposed formal changes to the 
memory model.

2. Permit non-blocking operations to continue beyond a fence with some 
restrictions.

3. Only the calling/initiating thread would be explicitly restricted from 
accessing memory locations involved in the call until the sync.  I'm still 
working out the formal modifications to the memory model, but the gist of it is 
that the read and write operations involved in a non-blocking call occur at 
some unspecified time between the initiation and the sync, and are treated as 
relaxed (or local for the non-shared arguments to memget/memput) operations, 
thus they can be re-ordered with respect to other relaxed operations, including 
those occurring "after" the sync in the source.  However, all threads must 
agree on their order with respect to strict operations issued by the calling 
thread, and thus must also agree on the set of relaxed operations with which 
they can be re-ordered.

I think these changes would suffice to enable the functionality and performance 
that we're looking for in non-blocking routines, hopefully without 
significantly burdening implementations or adding undue complexity to memory 
access rules.  Importantly, this restricts the possible results of overlapping 
accesses to involved memory locations rather than making such programs illegal, 
which should make debugging and formal analysis of programs using the 
non-blocking operations simpler.

Programs using the existing Berkeley routines could switch to using the 
standard ones with no changes.  Programs using the existing Cray routines might 
require changes, though I think we could safely make this change as the 
routines were mostly used to avoid the overheads caused by the same-address 
restriction.

This is still a rough idea, so comments are welcome.

Original comment by sdvor...@cray.com on 13 Aug 2012 at 4:44

GoogleCodeExporter commented 9 years ago
To help clarify comment 82, consider the following short UPC program:

#include <upc_relaxed.h>
relaxed shared [2] int src[2*THREADS];
relaxed shared [2] int dst[2*THREADS];
strict shared int phase = 0;
int main()
{
  int i;
  upc_handle_t handle;
  if ( THREADS < 2 ) return -1;
  upc_forall(i=0;i<2*THREADS;++i;&src[i]) src[i] = dst[i] = MYTHREAD;
  upc_barrier;
  if ( MYTHREAD == 0 ) {
    phase  = 1;
    handle = upc_memcpy_nb( &dst[2], &src[0], 2 * sizeof(int) );
    phase  = 2;
    upc_sync( handle );
    phase  = 3;
  }
  else {
    int my_phase[2], my_dst[2];
    my_phase[0] = phase;
    my_dst[0]   = dst[2];
    my_dst[1]   = dst[3];
    my_phase[1] = phase;
  }
  return 0;
}

Obviously, if a thread's first read of phase (into my_phase[0]) observes the 
value '3', then that thread is guaranteed to observe the value '0' for both 
locations of dst, as the sync call occurs before thread 0's strict write of the 
value '3' into phase.  Likewise, if a thread's final read of phase (into 
my_phase[1]) observes the value '0', then that thread is guaranteed to observe 
the value '1' for both locations of dst, as the initiation call occurs after 
thread 0's strict write of the value '1' into phase.

If my_phase[0] reads any of '0', '1', or '2' from phase and the value read into 
my_phase[1] doesn't prevent it, or my_phase[1] reads any of '1', '2', or '3' 
and the value read into my_phase[0] doesn't prevent it, then the values read 
into my_dst[0] and my_dst[1] could be either '0' or '1' (or something else 
entirely, as noted in UPC 1.2 B.3.2.1 paragraph 2).  This is due to the relaxed 
operations associated with the upc_memcpy_nb() occurring at an unspecified time 
between the initiation and the sync, and thus the values aren't guaranteed 
except outside of that region.

However, the values must be consistent across threads.  Therefore, the 
following behavior is disallowed:

Thread 1 observes
my_dst[0]   is 0
my_phase[1] is 1

Thread 2 observes
my_phase[0] is 2
my_dst[0]   is 1

Because all threads must observe the operations occur in the same order with 
respect to strict operations on the calling thread, it is not permitted for 
thread 2 to observe the original value of dst[2] after observing the value '2' 
for phase if thread 1 has observed the updated value of dst[2] before observing 
the strict write of the value '2' into phase.

Formally, thread 1 and thread 2's orderings include (with the first argument 
representing the thread issuing the operation)

T1: RW(T1,dst[2],1) SW(T0,phase,1) RW(T0,dst[2],0) RR(T1,dst[2],0) 
SR(T1,phase,1) SW(T0,phase,2)
T2: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) SR(T2,phase,2) 
RR(T2,dst[2],1) RW(T0,dst[2],0)

Since the thread 1 and thread 2's orderings disagree with when RW(T0,dst[2],0) 
occurs relative to SW(T0,phase,2), the orderings are incompatible, and thus 
disallowed.

The following similar behavior is allowed though:

Thread 1
my_dst[0]   is 0
my_phase[1] is 2

Thread 2 observes
my_phase[0] is 2
my_dst[0]   is 1

There are now at least two potential orderings for thread 1, one matching the 
scenario above, where the write to phase happens after the write to dst,

T1: RW(T1,dst[2],1) SW(T0,phase,1) RW(T0,dst[2],0) RR(T1,dst[2],0) 
SW(T0,phase,2) SR(T1,phase,2)

and one where the write to phase happens before the write to dst

T1: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) RW(T0,dst[2],0) 
RR(T1,dst[2],0) SR(T1,phase,2)

Notice that both generate the observed behavior, as thread 1 doesn't have 
enough information to show where RW(T0,dst[2],0) occurred relative to 
SW(T0,phase,2).  Thread 2 still has

T2: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) SR(T2,phase,2) 
RR(T2,dst[2],1) RW(T0,dst[2],0)

which is compatible with the second option for thread 1's ordering.  Thus, 
there is a valid ordering on all threads and the behavior is therefore allowed.

Original comment by sdvor...@cray.com on 13 Aug 2012 at 7:59

GoogleCodeExporter commented 9 years ago
"1. Remove the *_gsync() and "half-fence" from the non-blocking proposal.  
There would only be a single sync call mostly matching the semantics of the 
original Berkeley proposal."

I'm glad to hear we may be making some progress here.

"As Dan pointed out, most of the desire for this call on Cray's part is 
subsumed by the signalling put proposal from Berkeley (Could this be added to 
the issue tracker?  Also, I'd also like to see a signalling get in there...)."

See issue 80: http://code.google.com/p/upc-specification/issues/detail?id=80

"However, all threads must agree on their order with respect to strict 
operations issued by the calling thread,"

I still strongly believe it's a mistake to go down that road. It's simpler and 
less confusing to state the contents of the destination buffer are undefined 
for the entire transfer interval (until after the sync), which also gives 
implementations more flexibility wrt in-flight operations. I don't see a 
motivation for accommodating programs that want to touch the transfer buffers 
(from ANY thread) before the explicit sync.

Regarding your example in comment 83:

If non-zero threads read my_phase[1]==0, then the reads of dest are guaranteed 
to have seen the initial values. We agree there.

However in all other cases, the Berkeley semantics state that if non-zero 
threads read anything except '3' into my_phase[0], then the subsequent reads of 
dst are reading undefined values (because the program failed to ensure they 
occurred after the successful upc_sync). The contents of the destination buffer 
remain undefined for ALL threads until after the upc_sync, which in practice 
means other threads cannot safely read the contents until the initiating thread 
has used a strict operation or other synchronization method to signal that the 
upc_sync has successfully completed.

Furthermore, "undefined" should not be read as "the prior or subsequent value", 
or "something I can explain via the memory model", it really is COMPLETELY 
undefined (ie garbage). In particular, there is no guarantee that two different 
threads should read the same undefined values, or that anything read should 
look anything like the original or final values. Once a program has strayed 
into reading undefined values, the memory model provides no guarantees 
whatsoever, and there is no need for the undefined values to be "explainable" 
in terms of the memory model. As a consequence of this, VALID programs can 
never even observe "where" in the transfer interval the "unspecified" data 
transfer operations occurred, let alone agree on where they occurred. 

This philosophy gives implementations the maximal freedom within the transfer 
interval to perform the transfer via the most efficient means possible, with 
guaranteed freedom from conflicting accesses by perverse programs. We want the 
destination buffer to remain completely "off-limits" until the explicit sync 
succeeds. This flexibility enables a number of optimizations that would not 
otherwise be possible. We expect anything less than this to have a measurable 
performance impact on the common case for platforms of interest, and therefore 
unacceptable.

"restricts the possible results of overlapping accesses to involved memory 
locations rather than making such programs illegal, which should make debugging 
and formal analysis of programs using the non-blocking operations simpler."

This is a valid consideration, but in my opinion should be completely secondary 
to any performance-related concerns. The advanced user elected to use this 
advanced interface solely to gain performance - that's the entire motivation 
for the added complexity of this interface. That being said, there are various 
potential approaches to debugging the use of this library - the simplest being 
to replace non-blocking calls back to their blocking equivalents and see what 
changes. Another approach, taken by dynamic race condition detectors, would be 
to explicitly write "trap" values into the destination buffers during the 
transfer interval and signal when threads touch them.

Original comment by danbonachea on 16 Aug 2012 at 12:57

GoogleCodeExporter commented 9 years ago
"However in all other cases, the Berkeley semantics state that if non-zero 
threads read anything except '3' into my_phase[0], then the subsequent reads of 
dst are reading undefined values (because the program failed to ensure they 
occurred after the successful upc_sync). The contents of the destination buffer 
remain undefined for ALL threads until after the upc_sync, which in practice 
means other threads cannot safely read the contents until the initiating thread 
has used a strict operation or other synchronization method to signal that the 
upc_sync has successfully completed."

This implies that the non-blocking forms, if synced before any fences, are NOT 
equivalent to the "blocking" forms.  Threads can touch memory involved in 
upc_mem{cpy|get|put}() during the operation, and there are well-defined, though 
very loose, rules regarding the results.  This should also be true of the 
non-blocking variants.

"In particular, there is no guarantee that two different threads should read 
the same undefined values, or that anything read should look anything like the 
original or final values."

The memory model already explicitly permits this: 'Furthermore, because the 
shared writes implied by the library call have unspecified size, thread 1 may 
even read intermediate values into local_z0 and local_z1 which correspond to 
neither the initial nor the final values for those shared objects.'

"This philosophy gives implementations the maximal freedom within the transfer 
interval to perform the transfer via the most efficient means possible, with 
guaranteed freedom from conflicting accesses by perverse programs. We want the 
destination buffer to remain completely "off-limits" until the explicit sync 
succeeds. This flexibility enables a number of optimizations that would not 
otherwise be possible. We expect anything less than this to have a measurable 
performance impact on the common case for platforms of interest, and therefore 
unacceptable."

As defined, it permits using a (non-UPC) thread to do the transfer 
asynchronously.  It permits use of RDMA hardware.  It permits memory reordering 
on SMPs.  It permits caching/aggregation, assuming flushing occurs on fences, 
as is required for regular relaxed accesses.  What non-blocking optimizations 
does it prohibit?

"That being said, there are various potential approaches to debugging the use 
of this library - the simplest being to replace non-blocking calls back to 
their blocking equivalents and see what changes. Another approach, taken by 
dynamic race condition detectors, would be to explicitly write "trap" values 
into the destination buffers during the transfer interval and signal when 
threads touch them."

Neither of which can conclusively answer the question "Is my program a valid 
UPC program?" in all cases.  No other memory operation in UPC places USER 
restrictions on other threads' memory operations.  There are restrictions 
placed on the implementation, but never the user.  The code may not behave the 
way the user expects or intends, but it is still a valid UPC program.  
Prohibiting the calling thread from accessing the memory is ok, as it is 
relatively easy to prove that the user's code is valid or invalid with that 
restriction.

Original comment by sdvor...@cray.com on 16 Aug 2012 at 3:23

GoogleCodeExporter commented 9 years ago
"This implies that the non-blocking forms, if synced before any fences, are NOT 
equivalent to the "blocking" forms.  Threads can touch memory involved in 
upc_mem{cpy|get|put}() during the operation, and there are well-defined, though 
very loose, rules regarding the results.  This should also be true of the 
non-blocking variants."

I disagree. B.3.2.1 specifies that blocking upc_mem* updates the destination 
buffer with "relaxed shared writes of unspecified size and ordering", which 
implies that threads concurrently reading the destination buffer without 
synchronization are effectively reading garbage (it may not look anything like 
the initial or final values). This is exactly the same as a Berkeley 
upc_sync(memput_async()), ie an empty transfer interval. In both cases the 
destination values read by other unsynchronized threads are completely 
indeterminate. Equivalence of undefined values is meaningless.

"As defined, it permits using a (non-UPC) thread to do the transfer 
asynchronously.  It permits use of RDMA hardware.  It permits memory reordering 
on SMPs.  It permits caching/aggregation, assuming flushing occurs on fences, 
as is required for regular relaxed accesses.  What non-blocking optimizations 
does it prohibit?"

It suffers from the same performance pitfall as the original proposal - namely 
that random strict operations issued by the caller for unrelated computations 
during the transfer interval globally destroy the overlap potential. This is 
especially a problem in multi-module/multi-author/multi-domain applications 
where one module wants to initiate some communication in preparation for the 
next time step, then call a different, independent module to compute on its own 
independent data in the overlap interval. If the programmer has explicitly 
asserted a given transfer can proceed independently of everything until the 
explicit sync and has promised not to touch the transfer buffers until then, we 
should not stall on the first random strict operation in the transfer interval 
to await completion (thereby silently defeating the programmer's attempt at 
overlap). 

This pitfall creates the possibility of "performance bugs" whenever the program 
wants to call a function they did not write during the transfer interval, if 
that function happens to include a strict operation. The user has taken pains 
to arrange an overlapped transfer and everything in his code is kosher, but the 
unrelated library call is stalling to complete his overlapped transfer for no 
reason. This is a "global" kind of bug that crosses all module boundaries, and 
may be impossible for the programmer to diagnose in the case of separate 
compilation. This seems far more surprising and problematic to me than programs 
which "break the contract" of the non-blocking transfer interval and issue 
unsynchronized reads to the destination that result in undefined values; that 
is usually a "local" kind of bug because escape analysis can at least tell you 
which modules may possibly be touching the data buffers, and in common cases it 
should only be the current module.

"Neither of which can conclusively answer the question "Is my program a valid 
UPC program?" in all cases."

UPC is not a type-safe language. That property is impossible to prove for most 
interesting programs, even without this library. That's not even a goal of 
C-based languages. Programmers who want/need that type of guarantee should be 
using a type-safe, high-level language. Otherwise, they need to fully 
understand the semantics of their language/library and use their brains to 
avoid straying into undefined behavior.

"No other memory operation in UPC places USER restrictions on other threads' 
memory operations.  There are restrictions placed on the implementation, but 
never the user.  The code may not behave the way the user expects or intends, 
but it is still a valid UPC program.  Prohibiting the calling thread from 
accessing the memory is ok, as it is relatively easy to prove that the user's 
code is valid or invalid with that restriction."

Disagree again. Even in plain 1.2, any program which contains data races can 
result in reading indeterminate values. Every read-after-write data race 
between threads (ie without intervening synchronization) results in 
indeterminate values being read. 

It's important to note the bulk of the formal memory model is concerned with 
individual reads and writes of MATCHING scalar size, where word-tearing is 
implicitly assumed to not occur (this is a weakness of the current model which 
remains to be solved - issue 61). Under those conditions and assumptions the 
"indeterminate" values returned by those racing reads are guaranteed to return 
either the initial or final value. All of the memory model examples in the 
appendix work under that regime. However once you throw byte-oriented bulk 
transfers into a race, that small assurance also goes out the window and racing 
reads may observe values which are neither the initial nor final value, and are 
just plain undefined. This is why B.3.2.1 specifies that byte-oriented library 
calls make relaxed read/writes of "unspecified SIZE and order", to make it 
clear that conflicting reads (of even blocking upc_mem* calls) will get 
undefined values (this should perhaps be made more explicit). This is ugly but 
it reflects the reality of implementation - it may be possible (in most cases) 
to prevent word-tearing for scalar read/writes, but it's unavoidable for 
untyped bulk transfers.

It's impossible in general to prove a lack of data races, even using runtime 
techniques. This is just another form of data race - the calling thread is 
performing writes of undefined size and order throughout the entire transfer 
interval, and if another thread reads the destination during that interval (ie 
without synchronization protecting it from doing so) then it gets undefined 
garbage values. Even if we adopted the Cray proposal and forced strict 
operations to terminate the transfer interval, you can still easily construct 
data races where unsynchronized threads read garbage from the destination 
before the caller issues a strict op. This potential for data races is a 
fundamental property of UPC 1.2 (and most shared-memory programming), and 
something the programmer must already keenly avoid. The only difference with 
the new library is the destination buffer contains undefined values for the 
several lines of code until the sync, rather than the duration of the single 
line of code of the blocking upc_mem*.

I think we're just avoiding the core difference here, which is when the 
transfer interval ends. Arguments about races and consistency apply equally to 
conflicting unsynchronized accesses in the transfer interval regardless of 
where it ends. Berkeley wants the transfer interval to terminate only at the 
explicit sync. Cray wants it to additionally terminate at any earlier random 
strict operation issued by the calling thread during the transfer interval. 
Performance and modularity concerns both argue for the Berkeley version. The 
Cray version seems to primarily be motivated by perverse programs that issue an 
explicitly async transfer and then want to read guaranteed results before it 
has been explicitly synced. I have yet to see a realistic example of a program 
that needs this additional guarantee, that cannot be rewritten to explicitly 
sync the transfer before the synchronizing strict op (or use a signalling 
memput instead, for that usage case). The secondary Cray motivation seems to be 
programs that "accidentally" violate the contract of the library interface and 
want stronger semantics to aid in debugging, but Berkeley is not willing to 
sacrifice the performance of correctly-synchronized production codes to 
accomodate the behavior of insufficiently-synchronized buggy programs.

Original comment by danbonachea on 16 Aug 2012 at 10:29

GoogleCodeExporter commented 9 years ago
"It suffers from the same performance pitfall as the original proposal - namely 
that random strict operations issued by the caller for unrelated computations 
during the transfer interval globally destroy the overlap potential."

I think you're misunderstanding what I was proposing, which is probably my 
fault ;).  I'm proposing that the relaxed writes involved in a non-blocking 
transfer may occur at any time between the initialization and sync, but that 
all threads must agree on when any given individual relaxed write occurs 
relative to fences on the calling thread.  The writes are still completely 
independent and of unspecified size, so all this really says is that if you 
observe the final result at a particular location before a fence, then observe 
the fence, you are guaranteed that that any relaxed writes you issue TO THAT 
LOCATION after observing the fence are guaranteed to be "correctly" ordered on 
all other threads, even if the sync hasn't actually occurred yet.  It doesn't 
tell you anything about any other memory locations involved, as they are 
independent of one another. 

"Disagree again. Even in plain 1.2, any program which contains data races can 
result in reading indeterminate values. Every read-after-write data race 
between threads (ie without intervening synchronization) results in 
indeterminate values being read."

UPC still permits valid programs to contain data races though.  Our point of 
contention was that we understood your proposal to say that programs with data 
races on memory involved in non-blocking transfers are not valid UPC programs.  
We have no problems with the results of the race being indeterminate, but we do 
not want to say that such programs are no longer UPC programs merely because of 
the race condition.

"Even if we adopted the Cray proposal and forced strict operations to terminate 
the transfer interval"

This is completely misunderstanding what I was proposing.  All I intended was 
that once a thread (other than the calling thread) observes the final result at 
a particular memory location AND a fence on the calling thread that provably 
occurred after it, then all threads are guaranteed observe the final result at 
that memory location IF they have also observed that fence.

Original comment by sdvor...@cray.com on 17 Aug 2012 at 3:38

GoogleCodeExporter commented 9 years ago
"I'm proposing that the relaxed writes involved in a non-blocking transfer may 
occur at any time between the initialization and sync, but that all threads 
must agree on when any given individual relaxed write occurs relative to fences 
on the calling thread.... if you observe the final result at a particular 
location before a fence, then observe the fence, you are guaranteed that that 
any relaxed writes you issue TO THAT LOCATION after observing the fence are 
guaranteed to be "correctly" ordered on all other threads"

There are several implicit assumptions in your statements above that I think 
are problematic:

1. It assumes the initial and final values of the given memory location differ, 
so there's a change to notice at all. I think you'd agree that when they are 
the same value, no conclusion can be drawn.

2. It assumes lack of word-tearing, which can easily occur between the 
byte-oriented library writes and any reads larger than a byte. Consider this 
case of a 4-byte memory "location" that appears somewhere in the middle of the 
destination buffer:
          Byte address:   0    1    2   3
   initial value (hex):   00   00   00  00
   final value   (hex):   00   77   00  00
If a conflicting 4-byte read from a remote thread gets back "0x00770000", that 
LOOKS like the final value has been written, but the library may in fact still 
be writing the last two bytes (as a separate "relaxed write of unspecified 
size"). If the thread mistakenly assumes the transfer is "done" and issues a 
concurrent write to the word, then the value it writes may later get clobbered 
when the library writes the second half of the bytes. 

3. It implicitly assumes that any given byte in the destination buffer is only 
written once, and that there aren't multiple "relaxed writes of unspecified 
size and order" that affect the same byte.

4. Finally it assumes the destination bytes are only written with the "new" 
data and doesn't temporarily hold "something else entirely"

I see no reason to restrict implementations to require 3 and 4, simply to 
accommodate "perverse" programs that want to try and infer transfer completion 
using data races, rather than using the explicit sync call and proper, 
guaranteed-correct synchronization. We have a real example of at least one 
network that can accelerate transfers by using the destination buffer as 
"scratch space" for storing temporary metadata while the transfer is in-flight. 

Looking forward to the non-blocking collectives, it will likely be standard 
practice to utilize destination buffers as scratch space during the transfer 
interval, and programs attempting the type of inference you describe by peeking 
at the destination will do so at their own peril. As mentioned in an earlier 
comment, we should at least attempt to devise a general semantic philosophy 
that will also make sense for the closely-related non-blocking collectives.

For all these reasons I continue to argue that the contents of the destination 
buffer should remain fully and truly undefined until after the successful 
explicit upc_sync (and for other threads, until they've used strict operations 
or other provided synchronization features to guarantee upc_sync has returned 
on the caller). The same goes for the restriction that the source buffer remain 
unmodified until the transfer is upc_sync'd. Bulk synchronous programs will 
probably accomplish such synchronization trivially at the next barrier. 
Programs that need more fine-grained point-to-point data delivery with 
synchronization should be using the signalling put library. We should stop 
worrying about the behavior of programs that read and act upon undefined values.

"We have no problems with the results of the race being indeterminate, but we 
do not want to say that such programs are no longer UPC programs merely because 
of the race condition."

Programs that read undefined values and then act upon them in any non-trivial 
way will exhibit undefined behavior (garbage in, garbage out). When we're 
dealing with untyped bulk library transfers racing with conflicting 
unsynchronized scalar accesses, the indeterminacy goes way beyond 
"initial-value or final-value", the contents of the destination buffer are 
completely undefined.

For those who like analogies, my best "simple analogy" is a program which does 
something like:

char *garbage = malloc(20);
printf("%s",garbage);

The contents of that uninitialized memory are technically "indeterminate" (ISO 
C99 7.20.3.3), but many/most interesting actions consuming those values will 
lead to undefined behavior (in this case, anything from zero output to a 
segmentation fault).

Original comment by danbonachea on 17 Aug 2012 at 6:15

GoogleCodeExporter commented 9 years ago
You're still misunderstanding my intention.  When I said observed the final 
value, I meant the final value, not some intermediate value that happens to be 
the same.  More formally, all I'm proposing is that the set StrictOnThreads(M), 
which enforces the ordering of relaxed operations issued by a thread relative 
to strict operations on that thread, must be globally consistent and complete, 
just like it currently is, rather than allowing threads to define it 
differently or excluding some pairs from it.  Thus, from the perspective of the 
memory model, for each relaxed operation involved in the transfer Rop_nb, for 
every strict operation Sop_ct on the calling thread between the initiation and 
the sync, either (Rop_nb, Sop_ct) is a member of StrictOnThreads(M) OR (Sop_ct, 
Rop_nb) is a member of StrictOnThreads(M) and it is undefined which is (as long 
as it is consistent with the other pairs in StrictOnThreads(M)).

"Programs that read undefined values and then act upon them in any non-trivial 
way will exhibit undefined behavior (garbage in, garbage out). When we're 
dealing with untyped bulk library transfers racing with conflicting 
unsynchronized scalar accesses, the indeterminacy goes way beyond 
"initial-value or final-value", the contents of the destination buffer are 
completely undefined."

Yes, we agree on this completely.  All we're asking is that it not be ERRONEOUS 
for UPC programs to access the memory before the sync as you said in comment 
30, but rather that it results in undefined values read/written (note, not 
completely undefined behavior).  If those undefined values are used later in 
the program, as in your printf example, it may or may not lead to undefined 
behavior, depending on how they are later used.

Original comment by sdvor...@cray.com on 17 Aug 2012 at 1:41

GoogleCodeExporter commented 9 years ago
"from the perspective of the memory model, for each relaxed operation involved 
in the transfer Rop_nb, for every strict operation Sop_ct on the calling thread 
between the initiation and the sync, either (Rop_nb, Sop_ct) is a member of 
StrictOnThreads(M) OR (Sop_ct, Rop_nb) is a member of StrictOnThreads(M) and it 
is undefined which is (as long as it is consistent with the other pairs in 
StrictOnThreads(M))."

I agree with the statement above. 

However, because the relaxed operations that comprise Rop_nb are of unspecified 
size (leading to word tearing), can occur anywhere in the transfer interval, 
might overlap or conflict with each other (writing the same location multiple 
times), or even temporarily write a completely unrelated value to the 
destination, there is effectively no useful property that can be inferred by 
reading the undefined values in the destination buffer during the transfer 
interval. 

I want to avoid giving programmers the impression they can accomplish anything 
useful by reading the undefined values during the transfer interval. I'll admit 
my language in comment 30 is a bit too strong - the implementation shouldn't 
crash the program for simply issuing these "sketchy" reads (or otherwise 
invalidate the unrelated parts of the program), but the observed values are 
effectively meaningless. That being said, I'd like to specify that conflicting 
writes in the transfer interval that overwrite the destination buffer lead to 
"undefined behavior" - not only "undefined results" in the destination, because 
implementations who use it as scratch space might easily crash or do other 
nasty things if their meta data is suddenly overwritten by random code while 
the library "owns" the buffer.

The only point in specifying the transfer in terms of relaxed ops is to provide 
well-defined semantics to code OUTSIDE the transfer interval; so the calling 
thread is guaranteed to observe updated values via relaxed reads immediately 
after returning from upc_sync, and if it signals a second thread using a strict 
write issued AFTER the upc_sync, that thread can observe the fence using a 
strict read and subsequently also see guaranteed updated values in the 
destination via relaxed reads. Converse statements can also be made wrt relaxed 
writes of the source before the initiation.

Original comment by danbonachea on 17 Aug 2012 at 3:08

GoogleCodeExporter commented 9 years ago
"However, because the relaxed operations that comprise Rop_nb are of 
unspecified size (leading to word tearing), can occur anywhere in the transfer 
interval, might overlap or conflict with each other (writing the same location 
multiple times), or even temporarily write a completely unrelated value to the 
destination, there is effectively no useful property that can be inferred by 
reading the undefined values in the destination buffer during the transfer 
interval."

Exactly.  All I was trying to get at in comment 82 was that StrictOnThreads(M) 
should still be globally consistent and complete (and that we'll probably have 
to patch up the definition of Precedes() to work this in).  I admittedly wasn't 
very clear about that, but at the time, I was still going through the memory 
model trying to figure out exactly how to say what I meant.  B.3.2.1 would 
still prevent the programmer from making use of this fact in any meaningful 
way, since as you pointed out in comment 88, they could observe an intermediate 
value that happens to match the final value.

Original comment by sdvor...@cray.com on 17 Aug 2012 at 3:24

GoogleCodeExporter commented 9 years ago
"That being said, I'd like to specify that conflicting writes in the transfer 
interval that overwrite the destination buffer lead to "undefined behavior" - 
not only "undefined results" in the destination, because implementations who 
use it as scratch space might easily crash or do other nasty things if their 
meta data is suddenly overwritten by random code while the library "owns" the 
buffer."

Consider the following program:

#include <upc.h>
relaxed shared int array[ELEMS_PER_THREAD*THREADS];
strict shared int phase = 0;

int main()
{
    upc_forall( int i=0; i<ELEMS_PER_THREAD*THREADS; i++ ; &array[i] ) array[i] = MYTHREAD;
    upc_barrier;
    upc_memcpy( &array[1], &array[MYTHREAD], ELEMS_PER_THREAD * sizeof(int) );
    upc_barrier;
    return 0;
}

This is a legal UPC program, even if it is completely useless, and has no 
undefined behavior.  However, the final value of the elements of array with 
affinity to thread 1 are undefined because the relaxed operations involved in 
the upc_memcpy() are of indeterminate size and could be arbitrarily reordered.

If I change the upc_memcpy() to a non-blocking form with an immediate sync:

#include <upc.h>
relaxed shared int array[ELEMS_PER_THREAD*THREADS];
strict shared int phase = 0;

int main()
{
    upc_forall( int i=0; i<ELEMS_PER_THREAD*THREADS; i++ ; &array[i] ) array[i] = MYTHREAD;
    upc_barrier;
    upc_handle_t handle = upc_memcpy_nb( &array[1], &array[MYTHREAD], ELEMS_PER_THREAD * sizeof(int) );
    upc_sync( handle );
    upc_barrier;
    return 0;
}

It should be equivalent to the first program.  With your undefined behavior 
clause though, this program is not equivalent to the first, because it now has 
undefined behavior and could crash or initiate global thermonuclear war(*).  
That is unacceptable.

* http://www.imdb.com/title/tt0086567/

Original comment by sdvor...@cray.com on 17 Aug 2012 at 5:37

GoogleCodeExporter commented 9 years ago
"It should be equivalent to the first program.  With your undefined behavior 
clause though, this program is not equivalent to the first, because it now has 
undefined behavior and could crash"

I agree this would be an additional small usage restriction, and figured you 
probably wouldn't agree (which is why I prefaced it with "I'd LIKE to 
specify"). If we were to add this restriction on conflicting, unsynchronized 
writes to the destination it would make sense to also add a similar restriction 
to the blocking variants for the same reason, but I'm guessing this will be a 
hard sell.

This corner case unfortunately probably prohibits implementations that use the 
destination buffer for scratch space, unless they are Very Careful to prevent 
corrupted metadata from spreading garbage beyond the destination buffer. But 
this is not game-breaking for us, I think we can agree to work without that 
restriction. The main reason I brought it up was to expose another way in which 
the destination buffer contents are completely undefined during the transfer 
interval.

Original comment by danbonachea on 17 Aug 2012 at 5:47

GoogleCodeExporter commented 9 years ago
All "brand new" library proposals are targeted for starting in the "Optional" 
library document. Promotion to the "Required" document comes later after at 
least 6 months residence in the ratified Optional document, and other 
conditions described in the Appendix A spec process.

Original comment by danbonachea on 17 Aug 2012 at 5:53

GoogleCodeExporter commented 9 years ago
Set default Consensus to "Low".

Original comment by gary.funck on 19 Aug 2012 at 11:26

GoogleCodeExporter commented 9 years ago
Official change proposal mailed to the lists on 10/4/2012.

Description:
------------------
Attached is an updated proposal for the non-blocking library spec, also 
committed in the proposal area of SVN. I believe it represents our current 
semantic consensus, based on the issue thread and my telecon notes. I've spent 
many hours polishing the verbiage and I believe it is complete, but it's 
entirely possible I've missed something or left an ambiguity - constructive 
comments are welcome. I fully expect it will undergo minor revisions before 
being merged into the optional library document, but in the interests of 
expediency and in accordance with our official procedures I'm starting the 
mandatory 4 week comment period now.

Some highlights:

* The bulk of the semantics are specified under the common requirement section, 
to avoid tedious duplication. It defines several important terms used 
throughout. Each function section merely includes semantics specific to that 
function and a reference to the common requirements.
* I've used the term "transfer" throughout to denote the non-blocking 
operations, since "copy" could be misread to exclude put and get, and 
upc_memset is in no way a "copy".
* There are currently two code examples - providing an example for every 
function seemed like overkill and didn't add anything useful. This is 
consistent with library examples in the C99 spec (which are somewhat rare), eg 
see fprintf and fscanf.
* Tweaked the spelling of the header file (upc_nb.h) and the feature macro 
(__UPC_NB__)
* Section organization and synopses updated to match the core lib spec.

Original comment by danbonachea on 4 Oct 2012 at 10:43

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks for all of the hard work that went into this proposal.  It looks great!

Here's some feedback:

Pg. 5, #10: The semantics with respect to source memory should also be 
specified.  It seems like the intent is to allow concurrent reads from source 
memory -- for comparison concurrent reads from source memory are erroneous in 
MPI (e.g. for Isend/Irecv or RMA operations).

Pg. 5, #12: Still trying to process this paragraph.

Pg. 6, #4: Suggest s/shall/must/ to strengthen this statement.

Pg. 6, #4: What happens if a UPC thread exits (e.g. returns from main) without 
completing explicit or implicit transfers?  Given the current text, this is an 
error and anything can happen (hang, crash, catch fire, etc).

Pg. 11, 7.4.5.3, #2: Args are missing in upc_memput_nbi in this paragraph.  
Should be upc_memput_nbi(dst, src, n).

Sec. 7.4.6: The handle is an IN parameter.  It might be helpful if this handle 
were INOUT and were reset to UPC_COMPETE_HANDLE once it has been synchronized 
(since the old value is invalid).  This is helpful if the user is trying to 
complete an array of handles or is polling on a handle location that gets 
reused.  The user can, of course, implement this behavior with the current 
interface, so this is not a strong suggestion.

Sec. 7.4.6: Error handling -- what happens if the user passes an invalid 
handle?  If we return a complete/incomplete flag, there's no opportunity to 
return an error.  Have we adequately defined what happens if the handle is 
invalid?

Sec. 7.4.7.1/2 #4: This statement suggests that implicit handle synchronization 
functions are forbidden from progressing explicit handle operations; I don't 
think this is the intended meaning.  A clearer statement might be:

The upc_synci(_attempt) function does not complete explicit-handle transfers.

Original comment by james.di...@gmail.com on 5 Oct 2012 at 9:18

GoogleCodeExporter commented 9 years ago
I'm leery of 7.4.2.10 and 7.4.2.11.  These to me seem to conflict with the 
relationship of the non-blocking routines to the existing routines in the even 
that a remote thread attempts to access memory involved in the transfer during 
the transfer interval and there are no fences in the transfer interval.  I 
thought we'd agreed that we would simply note that the relaxed operations 
making up the transfer (as defined in B.3.2.1 of the 1.2 spec) occur at an 
unspecified time in the transfer interval specifically so that the non-blocking 
routines would match the existing routines behavior in this case.

Original comment by sdvor...@cray.com on 5 Oct 2012 at 9:36

GoogleCodeExporter commented 9 years ago
Reponses to Jim and Steve:

"Pg. 5, #10: The semantics with respect to source memory should also be 
specified.  It seems like the intent is to allow concurrent reads from source 
memory -- for comparison concurrent reads from source memory are erroneous in 
MPI (e.g. for Isend/Irecv or RMA operations)."

Regarding MPI, their stated rationale for prohibiting concurrent reads of 
source memory is to support RDMA on non cache-coherent systems, which I believe 
are firmly outside the scope of UPC.

The source buffer is const-qualified, and concurrent reads of it are implicitly 
permitted. I'm adding the following paragraph to explicitly state this for 
clarity, and also clarify what happens when multiple concurrent transfers 
overlap in memory:

 \np The source memory specified in a transfer is not modified by the transfer.
 Concurrent reads of source memory areas by any thread are permitted and behave as usual.
 Multiple concurrent transfers initiated by any thread are permitted to specify overlapping source memory areas.
 If a transfer specifies destination memory which overlaps its own source, or the source or destination of a
 concurrent transfer initiated by any thread, the resulting values in all destination memory specified 
 by the affected transfers are indeterminate.

"Pg. 6, #4: Suggest s/shall/must/ to strengthen this statement."

This sentence appears very early in the semantic descriptions while definitions 
are still being established.
I intentionally prefaced the sentence with "Generally" and did not use "shall", 
because it's not a binding restriction - specifically in the case when the 
explicit-handle initiation returns UPC_COMPLETE_HANDLE, the operation is 
already complete and no sync call is required. However this is an unusual 
corner case and I wanted to provide a conceptual overview paragraph to 
familiarize the reader with the broad form of the interface, unclouded by such 
corner-cases, before getting into the actual nitty-gritty of requirements.

"Pg. 6, #4: What happens if a UPC thread exits (e.g. returns from main) without 
completing explicit or implicit transfers?  Given the current text, this is an 
error and anything can happen (hang, crash, catch fire, etc)."

This is deliberate. As an implementor I don't have a strong feeling about this, 
but I wanted to allow other implementations freedom on this point. I can 
imagine implementations on loosely-coupled systems that might produce an ugly 
error if an RDMA packet arrives for a non-existing process. This is analogous 
to MPI_Finalize, probably for similar reasons. I think the alternative would be 
to augment exit() and friends with clauses similar to the current ones about 
flushing all open file streams, but that seems like it might be nasty to 
implement and could impact the performance of the common case (which I would 
consider unacceptable for this library).

"Pg. 11, 7.4.5.3, #2: Args are missing in upc_memput_nbi in this paragraph.  
Should be upc_memput_nbi(dst, src, n)."

Fixed - thanks for catching that!

"The handle is an IN parameter.  It might be helpful if this handle were INOUT 
and were reset to UPC_COMPETE_HANDLE once it has been synchronized (since the 
old value is invalid).  This is helpful if the user is trying to complete an 
array of handles or is polling on a handle location that gets reused.  The user 
can, of course, implement this behavior with the current interface, so this is 
not a strong suggestion."

We've gone back and forth on this detail in various drafts. As you say the user 
can easily write a wrapper to provide one interface in terms of the other, so 
it's mostly a matter of taste. I prefer this interface for reasons of 
uniformity - the init functions return the handle by value, so the sync 
function also takes it by value. This is directly analogous to the by-value 
interface of malloc() and free() (and other UPC interfaces that mimic those). 
The C99 libraries seem to generally favor by-value calling conventions for 
types that are logically values, and there are very few examples of 
by-reference interfaces for logical values (excluding types like FILE and 
strings that are naturally pointers). If we ever add functions for syncing an 
entire array of handles as a group, we would use by-reference calling 
conventions - that's one of the enhancements we're considering for a future 
version of this interface.

"what happens if the user passes an invalid handle?  If we return a 
complete/incomplete flag, there's no opportunity to return an error.  Have we 
adequately defined what happens if the handle is invalid?"

This is a violation of a "shall" requirement, which by C99 Sec. 4 means 
behavior is undefined. That in turn implies by C99 3.4.3 that the 
implementation is permitted to do anything from ignoring it, to crashing the 
program and launching the nukes.

"The upc_synci(_attempt) function does not complete explicit-handle transfers."

I like this suggested wording and will add it.

Steve wrote:
"I'm leery of 7.4.2.10 and 7.4.2.11.  These to me seem to conflict with the 
relationship of the non-blocking routines to the existing routines in the even 
that a remote thread attempts to access memory involved in the transfer during 
the transfer interval and there are no fences in the transfer interval.  I 
thought we'd agreed that we would simply note that the relaxed operations 
making up the transfer (as defined in B.3.2.1 of the 1.2 spec) occur at an 
unspecified time in the transfer interval specifically so that the non-blocking 
routines would match the existing routines behavior in this case."

I don't understand your objection. If the source or destination is modified by 
any thread during the non-blocking transfer interval, the resulting value in 
the destination is indeterminate. Similarly, reads of the destination memory 
during the transfer interval will get indeterminate results until the operation 
is synchronized. Note this is stronger than "undefined behavior" - the 
implementation is not permitted to crash if the program violates these rules, 
but the result is garbage. In comments 88-90 I believe we agreed upon these 
properties, which are also true of every non-blocking interface I've ever read 
and should be familiar to advanced users. 7.4.2.10-11 state these properties 
directly and unambiguously, rather than leaving it to be inferred by the 
reader. Unlike the blocking interface, it's very easy to write a program with 
the nb library that violates these rules, and I want to make it very clear that 
such a program gets inderminite results. The reader does not need a deep 
understanding of the memory model and word tearing to understand the 
straightforward and important restrictions stated in 7.4.2.10-11. These 
paragraphs make it unambiguously clear that programs like the one in comment 83 
read undefined values when they violate the rules. 

Paragraph 12 defines the effects of the transfer interval as a whole with 
respect to surrounding operations that are guaranteed to be ordered outside the 
interval. Specifically, it explains that on-thread data dependencies are 
preserved for conflicting relaxed accesses issued by the initiating thread 
before the init or after the sync. Also, it defines the effects of the transfer 
interval as a whole behave like relaxed accesses when there are strict 
operations outside the transfer interval that synchronize with other threads 
who then touch the buffers. Realistically this paragraph is mostly targetted at 
compiler/optimizer writers, showing how the transfers fit into the memory model 
to prevent ambiguities. Footnote 2 summarizes the most important implications 
of this paragraph for end users.

Original comment by danbonachea on 6 Oct 2012 at 7:11

GoogleCodeExporter commented 9 years ago
"I don't understand your objection."

My objection is merely that there are no equivalent statements to 7.4.2.10 and 
7.4.2.11 for the "blocking" library routines.  Instead, this behavior is 
described in the memory model.  I therefore believe these statements are 
unnecessary for the non-blocking routines, and that the memory model 
(specifically B.3.2.1) should simply be updated to note that relaxed operations 
that make up a non-blocking routine occur at an unspecified time during the 
transfer interval.  Otherwise, it may appear that the non-blocking routines 
have stronger restrictions than the "blocking" routines do.

Original comment by sdvor...@cray.com on 8 Oct 2012 at 10:23