Intrepid / upc-specification

Automatically exported from code.google.com/p/upc-specification
0 stars 1 forks source link

Library: non-blocking memory copy extensions #41

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is to log the UPC non-blocking memory copy library extensions.

For more information, please see
https://sites.google.com/a/lbl.gov/upc-proposals/extending-the-upc-memory-copy-l
ibrary-functions

Original issue reported on code.google.com by yzh...@lbl.gov on 22 May 2012 at 11:41

GoogleCodeExporter commented 9 years ago

Original comment by phhargr...@lbl.gov on 1 Jun 2012 at 3:43

GoogleCodeExporter commented 9 years ago

Original comment by phhargr...@lbl.gov on 1 Jun 2012 at 6:08

GoogleCodeExporter commented 9 years ago
Given the difference between Cray's and Berkeley's positions on the 
non-blocking memory copy proposal, I was hoping to restart the discussion in 
the hopes of having some consensus.

Generally speaking, I think that my (and my users') position between the two is 
that we prefer non-blocking memory copy functions to *NOT* be 
independent/agnostic of a upc_fence or strict synchronization.  That is, we 
essentially support the Cray position.

My understanding is that the biggest motivation for the fence/strict 
independence is that some users may start a non-blocking copy and then call a 
function or library that calls a fence internally.  While we recognize that 
this may happen and it would eliminate much of the benefit of the non-blocking 
copy, we feel that using a fence is inherently an expensive operation that 
should be used judiciously, but should (from 6.6.1.5) apply to "all shared 
accesses."

I think it is somewhat philosophically orthogonal to provide an independent 
communication channel within UPC that is still essentially called UPC, but has 
to be managed separately from the "traditional" shared UPC accesses.

As a far less important issue, I prefer the "_nb/_nbi" suffix to the 
"_async/_asynci" suffix.

Original comment by nspark.w...@gmail.com on 11 Jun 2012 at 3:44

GoogleCodeExporter commented 9 years ago
Let me attempt to connect this issue with the UPC collectives 2.0 issue 
(appropriately numbered Issue 42). There, too, we have a problem of not being 
able to use upc_fence to guarantee completion of operations.

If we can formulate upc_fence and handles in a way that allows libraries to use 
it as an extension tool, we could deal with the (very valid) Berkeley 
objections and make Troy happy too. 

Of course Bill will be upset. So that's the price to pay.

:)

Original comment by ga10...@gmail.com on 15 Jun 2012 at 3:12

GoogleCodeExporter commented 9 years ago
First, I attach the latest documents from Berkeley and Cray that may facilitate 
discussion and help clarify confusions.

I think it's logical that "upc_fence" would sync all outstanding implicit 
non-blocking operations.  But how about explicit handle operations?

For example,

/* foo may be only in some binary format developed by a third party */
void foo() { upc_fence; }

h = upc_memcpy_nb(...);
foo();
sync(h);

Cray's position: It's an user error to call upc_fence (and thus foo()) before 
sync(h).

Berkeley's position: upc_fence has no effect on h.

Neither seems to be perfect.  Any suggestions or comments?

In addition, we should carefully consider and define the differences between 
local completion and global completion as stated in Cray's document.

Original comment by yzh...@lbl.gov on 15 Jun 2012 at 4:58

Attachments:

GoogleCodeExporter commented 9 years ago
I understand that the community Nick represents is in favor of something more 
like Cray's version than Berkeley's version.  While that is not my personal 
preference, I am willing to accept the input of the USERS as more relevant than 
the distastes of a lone implementer.  So, let's see where this leads us...

I don't have a problem with implementation of syncing implicit NB ops in 
upc_fence().  I never did except that doing so w/o the matching behavior for 
explicit handles seemed a stupid half-way commitment.

It has been the interaction of strict accesses (and therefore upc_fence()) with 
explicit-handle NB ops that has been my main concern (the implementation 
costs).  In the interest of reaching consensus I will concede that strict 
accesses must complete NB ops.  Specifically, I concede that placing 
non-blocking memcpy functions outside of the UPC memory model is unacceptable 
to the community.

As Yili mentions in the previous comment, we are still in some disagreement on 
explicit-handle operations.  My main concern is the one Yili expresses: Cray's 
current proposal that a upc_fence() is illegal between init and sync makes it 
"difficult" to call external code (to achieve communication-computation 
overlap).  In fact, the current Cray proposal would require different code 
depending on whether the code in a called function includes ANY strict accesses.

My hope is to "meet half-way" with something that has the most desirable 
properties.
I think that permitting strict accesses and upc_fence(), while keeping the 
handle "live", permits the user to write code without the need to know if any 
external functions (or even their own in a large project) contain fences or 
strict accesses.  
The Cray-proposed behavior of PROHIBITING the sync after a fence or strict 
access seems sufficiently distasteful to me that I am willing to drop my 
objections to handle-tracking overheads to void it (lesser of two evils in my 
mind).

Would the following be acceptable to Cray:
 + "strict access" in what follows implicitly includes calls to upc_fence()
 + a strict access between init and sync of an explicit-handle NB op is permitted.
 + such a strict access causes completion of all outstanding NB transfers (both implicit and explicit-handle) EXACTLY as they do any normal relaxed access (no special-case spec language required)
 + However, any handle from an operation initiated, but not yet synced, before the strict access is still "live" and therefore an explicit sync call is REQUIRED to free any associated resources
 + Note: "complete" is with respect to memory-model while "synced" is with respect to retiring the handle.  The "completion" occurs no later than the "sync", but can be forced earlier with a strict access.

One additional thought that occurs:
If the user uses "#include <upc_strict.h>" or "#pragma upc strict" then ALL 
shared accesses between the init and sync of an NB call would become strict.  
This feels to me like another reason to keep handles "live" and allow the same 
code to work in either strict or relaxed mode.

Also, I endorse the inclusion of "restrict" in the prototypes, which appears to 
have been unintentionally omitted from the Berkeley proposal.  It was not our 
intent to support overlapping src/dest.

NOTE:
In Berkeley UPC we introduce our extensions with a bupc_* prefix rather than 
using in the upc_* namespace for our extensions.  This means that if the 
eventual specification differs from our initial versions, user codes can 
continue to use the bupc_* prefixed "legacy" versions rather than seeing their 
code break when they update to a compiler implementing the new spec and 
therefore changes the semantics of the upc_* version.
So, I would recommend that to save Cray's users from some pain we adopt the 
"_async" family of names to NOT collide with Cray's current implementations 
(which may differ from the final spec semantics).

Original comment by phhargr...@lbl.gov on 15 Jun 2012 at 11:16

GoogleCodeExporter commented 9 years ago
All our (Cray's) concerns were with regard to the memory model--specifically 
that NB operations be treated as relaxed operations and thus are ordered by 
fences.  We proposed that having the sync come after a fence be "undefined" 
behavior (note, undefined does not mean illegal) to ensure that no strictly 
conformant program did this.  An implementation would then be free to break the 
memory model in a non-conformant program, and thus could permit outstanding NBs 
past a fence like Berkeley currently permits.  We didn't mean to make it an 
error to call the sync routine after a fence, merely to make it so that users 
couldn't rely on any particular behavior in that case and subtly discourage its 
use in portable applications.

Of course, there's no need for this if fences/strict accesses are explicitly 
defined to complete outstanding NBs as far as the memory model is concerned.  
In that case we have no problems requiring the handle be explicitly sync'd even 
after a fence.

Original comment by sdvor...@cray.com on 16 Jun 2012 at 12:44

GoogleCodeExporter commented 9 years ago
Excellent!  It sounds like Cray and Berkeley have converged on their most 
significant differences.  Now we need a volunteer to draft a new proposal that 
we can look over to make sure we agree on the little stuff too.

I am sorry if Yili or I mis-characterized Cray's intended semantic for 
sync-after-fence.

Berkeley will continue to offer bupc_* async-memcpy extension which operate 
outside of the memory model, and will add upc_* async-memcpy functions which 
behave just as other relaxed accesses with respect to the memory model.

Original comment by phhargr...@lbl.gov on 16 Jun 2012 at 1:28

GoogleCodeExporter commented 9 years ago
Just a note that there was an email exchange external to this issue forum that 
resulted in the following conclusion:

The LaTeX for the Cray proposal will be uploaded to the SVN repository (my 
task, once I figure out how).  Everyone then will be able to edit that version 
until we have something that we can recommend as a unified non-blocking 
proposal.

I also wanted to note the relationship between this issue and Issue 7 Comment 
#30:  http://code.google.com/p/upc-specification/issues/detail?id=7#c30   
Answering that question would help us to include clearer language describing 
the semantics of the sync functions.

Original comment by johnson....@gmail.com on 19 Jun 2012 at 6:49

GoogleCodeExporter commented 9 years ago

Original comment by gary.funck on 3 Jul 2012 at 6:07

GoogleCodeExporter commented 9 years ago

Original comment by gary.funck on 3 Jul 2012 at 6:08

GoogleCodeExporter commented 9 years ago

Original comment by gary.funck on 3 Jul 2012 at 6:09

GoogleCodeExporter commented 9 years ago
Troy has checked in the draft of the NB memory operation extension.  Attached 
is an extended version of the document that includes more background 
information.

Original comment by yzh...@lbl.gov on 1 Aug 2012 at 4:37

Attachments:

GoogleCodeExporter commented 9 years ago
I've steered clear of this discussion until now, but as the original author of 
both the Berkeley async proposal and much of the memory model, I'd like to 
provide some perspective that I believe is missing from the current discussion. 
I don't claim to state an official position for the Berkeley group, this is my 
own expert opinion.

*** Executive summary *** 
I'm arguing that the non-blocking memcpy functions should NOT be synchronized 
in any way by upc_fence or other strict accesses issued between init and sync. 
The non-blocking library includes an explicit and required synchronization 
call, and the transfer should be permitted to continue during the entire 
"transfer interval" between the initiation and successful sync, regardless of 
the actions of the calling thread (or libraries it invokes) in that interval. 
As far as the memory model is concerned, the data transfer behaves as a set of 
relaxed read/write operations of unspecified size and order, which are "issued" 
by the calling thread at a unspecified time during the transfer interval. The 
ops are not affected by fences or other strict operations issued in the 
transfer interval, because it is explicitly unspecified whether they were 
issued "before" or "after" any such fences. I believe this semantic "dodge" 
makes it totally compatible with the memory model.
Furthermore, the operations specified by the library should not imply any 
fencing semantics, which needlessly complicate the interface and may impose 
performance degradation through unwanted semantics. The transfers are relaxed 
operations and any required synchronization should be explicitly added by the 
user as strict operations or other fences around the transfer interval.

Point 1: ***The memory model already allows optimization of "blocking" 
memcopies***

By B.3.2.1: "For non-collective functions in the UPC standard library (e.g. upc 
mem{put, get, cpy}), any implied data accesses to shared objects behave as a 
set of relaxed shared reads and relaxed shared writes of unspecified size and 
ordering, issued by the calling thread."
In English, this means as far as the memory model and compiler are concerned, 
the accesses implied by the existing "blocking" memcopy functions are handled 
as an unspecified set of relaxed access. Specifically, the compiler/runtime is 
already free to reorder them with respect to any surrounding relaxed access, 
subject only to the limitations of static analysis (or runtime checking) to 
avoid reordering conflicting writes or passing a strict operation. High-quality 
implementations will thus already achieve some communication overlap from calls 
to the existing memcopy libraries.
In proposing an extension to these existing libraries, we must specify 
semantics that allow significantly MORE aggressive overlap to occur, leading to 
measurable performance gain - otherwise the complexity of the proposed 
extension is not justified.

Point 2: ***The primary goal of the async feature is performance***

Async memcopies do not add any expressiveness to the language - ie any async 
data movement operations can be already be expressed with their fully blocking 
counterparts. I doubt anyone would argue that explicitly async memcopies are 
more concise or elegant than a blocking call, nor do they improve the 
readability or debuggability of the UPC program. On the contrary, the 
programmer has chosen to sacrifice all of these features to some extent, all in 
order to (hopefully) reap an improvement in performance by explicitly 
requesting a communication overlap optimization which is either too hard (due 
to limitation of static analysis) or too costly (overhead of dynamic 
optimization) for the compiler to perform automatically. As performance is the 
primary and overriding goal of this library feature, great care should be taken 
to avoid any semantic roadblocks with the potential to artificially hinder 
performance of the library under realistic usage scenarios.

Point 3: ***Async calls are an annotation that explicitly suppress the memory 
model**

The whole point of the explicitly asynchronous memcopies is to provide an 
annotation from the user to the compiler/runtime asserting that the accesses 
performed by the copy do not conflict with any accesses or other operations 
that occur between the initiation and the sync call. The user is asking the 
compiler to "trust" this assertion and maximize the performance of the transfer 
while ignoring any potential conflicts. This obviously includes conflicting 
read/write accesses to the memory in question (otherwise the assertion is 
meaningless). I believe it ALSO should apply to any strict operations or fences 
that may occur (possibly in hidden callees or external libraries that defeat 
static analysis). It makes no sense to "suppress" the memory model for one type 
of conflict but preserve it for another.

Yes, this finessing of the memory model makes these async calls harder to 
write, understand and debug than their blocking counterparts, but that's 
precisely the price the programmer is paying for a chance at improved 
performance. The C language has a long history of features with pointy edges 
that give you enough rope to hang yourself, in exchange for allowing the 
programmer to get "closer to the machine" and micromanage behavior where it 
matters. The async extensions are just the latest example of such an "advanced 
feature", and we should not saddle them with semantic half-measures that try to 
make them slightly more user-friendly at the expense of potentially sacrificing 
any amount of performance (which is their primary reason for existence).

Point 4: ***Async calls should not include additional fencing semantics***

The current proposal is deeply mired in providing fencing semantics, ensuring 
operations are locally or globally visible in the synchronization calls. This 
approach couples inter-thread synchronization with data transfer, making the 
operations more heavyweight and simultaneously imposing MORE synchronization 
semantics than their blocking counterparts. For example, a upc_memput_nb which 
returns UPC_COMPLETE_HANDLE or immediately followed by a gsync currently 
implies "global visibility", which is MORE than the guarantees on blocking 
upc_memput and may be considerably more costly. The proposal also introduces 
several concepts (eg. "half-fences" and "local visibility") which are not 
defined by the current memory model, and may be tricky to formally define. It 
appears to advocate a usage case where async transfers behave sufficiently 
"strict-like" that after a sync the user can issue a RELAXED flag write to 
synchronize other threads. This is completely backward from the UPC philosophy 
and best practice, which is to use relaxed operations for data movement, and 
strict operations to update flag variables and perform synchronization.

In my opinion this piggybacking of fencing semantics on the library calls 
should be removed entirely. An important usage class of applications want to 
perform a large number of non-blocking operations in phases separated by 
user-provided barriers (3D FFT being one obvious example), and any fencing 
semantics on the individual operations is undesirable and only has the 
potential to degrade performance. These applications don't care at all about 
local or global completion of anything before the next barrier, and don't want 
to pay for the library computing it under the covers or imposing hidden fences.

The transfers performed by the library should abstractly behave as a set of 
relaxed ops with respect to the memory model. There is no difference between 
local and global completion, because the accesses in question are entirely 
relaxed. They behave exactly as relaxed operations issued at an unspecified 
time during the transfer interval. They are not affected by fences or other 
strict operations issued in the transfer interval, because it is explicitly 
unspecified whether they were issued "before" or "after" any such fences. The 
same logic implies that conflicting accesses in the transfer interval also 
return undefined results. A successful sync call indicates the relaxed 
operations have all been "performed", thus ensuring any subsequent conflicting 
operations issued by the calling thread see the updated values. Programs that 
wish to enforce global visibility of a transfer should explicitly issue a fence 
or other strict operation after the sync call.

The approach I'm describing significantly simplies the current proposal 
(removing many unnecessary functions), makes the semantics easier to understand 
(by removing all the fence-related goop) and at the same time removes semantics 
which have the potential to reduce performance.
It also brings it more in line with the memory model and the semantics of the 
existing blocking operations. I believe more high-level discussion of this 
nature is prudent before accepting the current semantics, which seem 
problematic in many ways.

Original comment by danbonachea on 3 Aug 2012 at 12:29

GoogleCodeExporter commented 9 years ago
Regarding Comment #14, I very strongly disagree on all points and most of these 
issues have been considered prior to forming the consensus proposal.  See 
comments below.

"The transfers are relaxed operations and any required synchronization should 
be explicitly added by the user as strict operations or other fences around the 
transfer interval."

    No.  A fence or strict access MUST complete ALL prior non-blocking operations to be compatible with the existing UPC memory model.  Therefore, if you want to have multiple non-blocking operations in flight and then use a fence or strict access to complete ONE of them, you end up forcing completion of ALL of them.  This is not acceptable for software pipelining of large copies.

"Point 1: ***The memory model already allows optimization of "blocking" 
memcopies***"

    No, existing UPC implementations implement upc_mem* as blocking copies.  I believe both BUPC and Cray concluded that the functions had to be blocking via different reasoning and there's a comment somewhere (not in this Issue) that explains both lines of reasoning.  The goal of this proposal is to provide non-blocking copy functions.

"Point 2: ***The primary goal of the async feature is performance***
Async memcopies do not add any expressiveness to the language - ie any async 
data movement operations can be already be expressed with their fully blocking 
counterparts."

    Given that upc_mem* functions already exist in UPC 1.2, this is not a very compelling argument.  I equally could argue that the existing upc_mem* functions are unnecessary because the language already provides expressiveness in terms of relaxed-access loops that a compiler should be able to automatically convert to the equivalent of a upc_mem* call.  I could use the Cray compiler as a proof-of-concept that this optimization is possible and argue that the existing upc_mem* functions should never have existed, but I think there is a benefit to programmers in having these functions to make their intention explicit instead of relying on optimization.

"Point 3: ***Async calls are an annotation that explicitly suppress the memory 
model**

    Programming language constructs that sit outside the memory model are dangerous and confusing.  A memory model provides a way of reasoning about a program.  No matter what part of my program I'm currently looking at, I have confidence that the rest of the program and any libraries that I've linked to written in that same language are following the same rules.  If I write "upc_fence" then I know that nothing is in flight immediately after that statement executes -- I don't have to wonder if there's something elsewhere in the code that has a memory-model-exempt copy ongoing on which my upc_fence has no effect.

"Point 4: ***Async calls should not include additional fencing semantics***"

    Ideally, no, but we need a way to manage individual non-blocking transfers.  We want to keep the existing meaning of upc_fence and explain semantics relative to that as much as possible.  The half-fence is essentially how the Cray implementation works now.

"The approach I'm describing significantly simplies the current proposal 
(removing many unnecessary functions), makes the semantics easier to understand 
(by removing all the fence-related goop) and at the same time removes semantics 
which have the potential to reduce performance."

    See above response regarding ignoring the memory model.

"It also brings it more in line with the memory model and the semantics of the 
existing blocking operations. I believe more high-level discussion of this 
nature is prudent before accepting the current semantics, which seem 
problematic in many ways."

    No, it does not bring it in line with the memory model because it explicitly denies being part of the memory model.

Original comment by johnson....@gmail.com on 3 Aug 2012 at 2:53

GoogleCodeExporter commented 9 years ago
"A fence or strict access MUST complete ALL prior non-blocking operations to be 
compatible with the existing UPC memory model.  "

I think you missed an important point - I'm basically arguing the definition of 
"prior". I'm proposing that the accesses implied by the library are issued at 
an UNSPECIFIED time between the initiation and successful sync call, so any 
intervening fences need not synchronize them because it cannot be proven that 
those anonymous accesses were issued "prior" to that fence - therefore no 
violation of the memory model can be observed. Explicitly async semantics 
already introduce the user to the concept of an asynchronous data transfer 
agent, and I'm arguing that agent is issuing the abstract memory operations at 
an intentionally unspecified time within the interval between init and sync.

Whatever semantics we come up with will need to be explained within the context 
of the formal memory model, and the easiest way to do this is to define the 
library's effects as a set of abstract operations. I propose to allow these 
operations to be abstractly issued anywhere within the transfer window, whereas 
you seem to be arguing they should be nailed down to all be issued at the 
initiation. I believe my approach is cleaner and allows for higher performance. 
I am NOT ignoring the memory model, I'm just defining the library semantics in 
such a way that fences don't interfere with its operation.

"existing UPC implementations implement upc_mem* as blocking copies"

This is an implementation decision, and is not required by the currently 
(looser) specification. There have been prototype UPC implementations that 
perform software caching at runtime that relax this decision to provide some 
overlap for "blocking" operations.
In any case, the upc_mem* operations do NOT imply any strict accesses or 
fences, so the current async proposal is definitely adding additional 
synchronization where none exists in the blocking version.

Original comment by danbonachea on 3 Aug 2012 at 3:27

GoogleCodeExporter commented 9 years ago
"I think you missed an important point - I'm basically arguing the definition 
of "prior". I'm proposing that the accesses implied by the library are issued 
at an UNSPECIFIED time between the initiation and successful sync call, so any 
intervening fences need not synchronize them because it cannot be proven that 
those anonymous accesses were issued "prior" to that fence - therefore no 
violation of the memory model can be observed. Explicitly async semantics 
already introduce the user to the concept of an asynchronous data transfer 
agent, and I'm arguing that agent is issuing the abstract memory operations at 
an intentionally unspecified time within the interval between init and sync."

No, the memory model becomes completely broken (or alternatively, the async 
updates are useless) if we do this.  Fences prevent backwards movement of 
relaxed accesses as well, so either all threads observe the results before the 
fence or all threads observe the results after.  Relaxed accesses cannot be 
reordered with respect to fence, so regardless of when the "access" occurs for 
an async call, all threads still have to agree on it.  In order to guarantee 
that all threads agree, most implementations are going to have to either sync 
at the fence, or delay starting the operation until after the last fence prior 
to the user's sync, thus defeating the purpose of allowing an async to bypass a 
fence in the first place.

"I am NOT ignoring the memory model, I'm just defining the library semantics in 
such a way that fences don't interfere with its operation."

And thus ignoring the rules surrounding reordering relaxed accesses in the 
presence of fences.

"There have been prototype UPC implementations that perform software caching at 
runtime that relax this decision to provide some overlap for "blocking" 
operations."

Remind me again what the trend is for available memory per-thread?  I don't 
think that this can be considered a useful solution given that we barely have 
enough space to track the address ranges that are outstanding, let alone all 
the data.

Original comment by sdvor...@cray.com on 3 Aug 2012 at 5:17

GoogleCodeExporter commented 9 years ago
"No, the memory model becomes completely broken (or alternatively, the async 
updates are useless) if we do this.  Fences prevent backwards movement of 
relaxed accesses as well, so either all threads observe the results before the 
fence or all threads observe the results after.  Relaxed accesses cannot be 
reordered with respect to fence, so regardless of when the "access" occurs for 
an async call, all threads still have to agree on it.  In order to guarantee 
that all threads agree, most implementations are going to have to either sync 
at the fence, or delay starting the operation until after the last fence prior 
to the user's sync, thus defeating the purpose of allowing an async to bypass a 
fence in the first place."

The abstract relaxed accesses which comprise the transfer need not be issued as 
a group all at once - they are a set of relaxed read/write operations that can 
be issued any time after the initiation call and before a successful 
synchronization returns. As such any fences in the transfer interval may occur 
before a subset of them have been issued. Understand I'm not suggesting this as 
an IMPLEMENTATION, this is merely the formalism I propose to define the 
semantics of the call within the existing formalism of the memory model, and 
has the side-effect that intervening operations do not interfere with the 
asynchronous transfer.

More importantly, I'm proposing the source memory is required to be constant 
and the contents of the destination memory are explicitly undefined between the 
init and sync, so threads are not permitted to be "peeking" before sync anyhow, 
which is another reason that reordering with respect to intervening operations 
cannot be observed.

Original comment by danbonachea on 3 Aug 2012 at 5:30

GoogleCodeExporter commented 9 years ago
I'm mostly with Troy on this one. If an async memory operation has been waited 
upon already, it should be synchronized by fence - or we are in semantic hell. 

In my lights just because an async memput has been waited on does not mean that 
the data has been deposited remotely. It only means that the send buffer can be 
reused. In this situation, if I go with Dan I will *never* know whether the 
transfer has finally completed. If I go with Troy then the fence will guarantee 
completion. Thus, Troy :)

I have not thought through all other possible orderings of events (e.g. put, 
fence, wait). 

Original comment by ga10...@gmail.com on 3 Aug 2012 at 5:54

GoogleCodeExporter commented 9 years ago
"If an async memory operation has been waited upon already, it should be 
synchronized by fence - or we are in semantic hell. "

I don't believe there's any disagreement about that - once a sync operation has 
returned successfully (ie "waited upon") and you've subsequently issued a 
fence, the operation is guaranteed to be "complete" as far as all threads is 
concerned. Both approaches ensure this. The argument centers around fences that 
are issued BEFORE the sync, while the operation is still "in-flight" and what 
they mean.

"if I go with Dan I will *never* know whether the transfer has finally 
completed"

That's not the case :)  In my interpretation, the transfer has "completed" with 
respect to the calling thread once the sync returns successfully (ie subsequent 
conflicting data accesses are preserved). The next strict operation ensures 
they are complete with respect to all threads.

Original comment by danbonachea on 3 Aug 2012 at 6:01

GoogleCodeExporter commented 9 years ago
"existing UPC implementations implement upc_mem* as blocking copies"

As an additional counter-example, consider the important case of a system with 
full hardware shared memory support, like a large SMP. The Berkeley 
implementation of upc_memput/upc_memget on such a system boils down to a C99 
memcpy(), which at ABI level results in a series of load/store instructions in 
a loop. Other UPC implementations on such hardware probably look similar. There 
are no architectural memory fences or compiler reordering fences inserted 
before or after the operation, because none are dictated by the UPC memory 
model for upc_memput/upc_memget. As a result, on a modern memory hierachy these 
load/stores can and will be aggressively reordered with respect to surrounding, 
non-conflicting memory load/stores, which may correspond to surrounding relaxed 
operations that access memory with affinity to different language-level 
threads. The necessary cache coherency, write buffering, conflict checking and 
load/store reordering is performed entirely in hardware by all modern 
processors with shared-memory support. The current memory model specification 
for upc_memput/upc_memget is intentionally permissive of this implementation.

Now it's true that an async memory copy facility will probably enable the 
largest performance benefit on loosely-coupled, distributed-memory hardware. 
However the semantics should be designed to still allow reasonable performance 
when run on the simple case of cache coherent shared memory hardware. The 
semantic insertion of memory fences around the proposed async operations to 
enforce the suggested completion guarantees has the potential to make async 
operations significantly MORE expensive and expose LESS overlap on shared 
memory hardware than the equivalent upc_mem{put,get} call which includes no 
extraneous fencing semantics. This seems fundamentally broken.

Original comment by danbonachea on 3 Aug 2012 at 11:32

GoogleCodeExporter commented 9 years ago
You're misinterpreting "blocking" as inserting a full fence, which is a much 
stronger statement.  The basic problem is that two relaxed operations issued by 
the same thread must be observed in program order.  Therefore, the 
implementation must guarantee the ordering between upc_mem* copies and relaxed 
operations to those memory addresses by the calling thread.  Because it is 
difficult in general to prove there are no other accesses to those memory 
addresses, implementations must do something to prevent incorrect orderings.  
In the case of your "large SMP", the hardware takes care of it.  On distributed 
systems that lack such hardware support, one simple solution is to simply block 
until the upc_mem* operation is globally visible before issuing any further 
operations.

Original comment by sdvor...@cray.com on 4 Aug 2012 at 3:57

GoogleCodeExporter commented 9 years ago
Sorry, the second sentence in comment 22 should have been "two relaxed 
operations to the same memory location issued by the same thread", not just 
"two relaxed operations issued by the same thread".

Original comment by sdvor...@cray.com on 4 Aug 2012 at 4:04

GoogleCodeExporter commented 9 years ago
Troy said: "existing UPC implementations implement upc_mem* as blocking copies"
sdvormwa said: "the implementation must guarantee the ordering between upc_mem* 
copies and relaxed operations to those memory addresses by the calling thread."

Correct - I'm very aware of the memory consistency requirements involved for 
upc_mem*, having written the memory model, the relevant spec language, and 
several implementations myself :). 

I was responding to Troy and illustrating that there are platforms of interest 
which can satisfy those requirements without any sort of "blocking" in the 
implementation whatsoever. Cache coherent shared memory platforms can implement 
upc_mem* as a simple set of load/stores, which the hardware is then free to 
aggressively reorder  in both directions with respect to surrounding 
non-conflicting load/stores that correspond to other relaxed operations. The 
1.2 spec for those operations is deliberately loose enough to allow this 
important optimization in hardware.

One of my many criticisms of the current async proposal is that it cannot 
achieve this level of overlap on the important case of cache-coherent 
shared-memory systems, because the semantics require the insertion of 
"half-fences" and guarantees of local/global visibility, which are 
significantly more heavyweight and would inhibit the hardware-provided 
reordering optimizations.  As such, the simple case of upc_mem*_async();gsync() 
would be expected to perform SLOWER and provide LESS overlap than the existing 
analogous upc_mem* call for those systems. It makes no sense to introduce an 
"async" library whose semantics for memory overlap are often MORE restrictive 
than the existing synchronous counterparts.

The larger point I'm trying to make here is that in introducing an async 
interface, we should be careful to specify semantics that are uniformly a 
RELAXATION relative to the existing synchronous library. As performance is the 
primary and solitary justification for the user to call the more complicated 
interface, any semantics with the potential to hinder performance on platforms 
of interest should be rejected.  The sole source of this semantic relaxation is 
the introduction of a "transfer interval", between the init and sync call, 
where the application promises "no threads are looking at my src/dst buffers", 
and therefore the implementation is free to perform the data transfer 
operations inside that interval without worrying about any of the activities of 
the ongoing computation. I contend this "promise" from the application should 
extend all the way until the library sync call, and consequently it is 
impossible for any thread to observe a violation of the memory model by 
inserting fences before the library operation has been synced (because no 
thread is allowed to be looking at the memory in question during that time). 
Stated another way, the application makes a promise not to look at the transfer 
memory until after a successful sync, and is not permitted to dynamically 
"change its mind" by issuing a fence in the middle of that interval - the 
assertion is made at the init call and remains in force until a successful sync.

Original comment by danbonachea on 4 Aug 2012 at 9:21

GoogleCodeExporter commented 9 years ago
Without the half-fence though, you have no way of implementing the very 
important use case of notifying another thread that the async transfer has 
completed without doing a MUCH more expensive full fence, as any relaxed 
operation may be reordered before the gsync.  That is the prime motivator of 
our insistence on including the half-fence on the gsync, as this use case is 
one of the most important to our customers.  Such a half fence may seem useless 
(or even too expensive) on systems where its performance is roughly in line 
with that of a full fence, but there are systems where the full fence is 
significantly more expensive.

Original comment by sdvor...@cray.com on 4 Aug 2012 at 4:47

GoogleCodeExporter commented 9 years ago
"The sole source of this semantic relaxation is the introduction of a "transfer 
interval", between the init and sync call, where the application promises "no 
threads are looking at my src/dst buffers", and therefore the implementation is 
free to perform the data transfer operations inside that interval without 
worrying about any of the activities of the ongoing computation."

So in other words, use of the async operations tells the implementation to 
ignore the memory model for a given range of memory locations until the gsync 
is reached.  The memory model is confusing enough already without the 
additional headache of language constructs that ignore it.

Original comment by sdvor...@cray.com on 4 Aug 2012 at 5:09

GoogleCodeExporter commented 9 years ago
"Without the half-fence though, you have no way of implementing the very 
important use case of notifying another thread that the async transfer has 
completed without doing a MUCH more expensive full fence, as any relaxed 
operation may be reordered before the gsync."

I acknowledge this is an important usage case, one that we refer to as a 
"signalling put", ie performing a memput and notifying the target of arrival. 
We've seen this application desire in other places as well. I agree that a user 
who is given ONLY the async library and no additional tools would need to 
implement this using a flag write - under the Berkeley semantics, he would need 
to use a strict write after sync. The proposed Cray semantics include extra 
semantics intended to allow him to use a relaxed write for signalling, however 
as I previously expressed, using relaxed operations for synchronization seems 
highly "sketchy" and completely contrary to the UPC memory model philosophy and 
officially stated "best practice". I'm not even convinced this is guaranteed to 
be correct, lacking a formal proof (and a formal definition of "half-fence"). 
Even if it works for this very specific case, encouraging the use of relaxed 
writes for synchronization as a UPC programming practice seems like a very Bad 
Idea and likely to lead to nightmarish race condition bugs for users.

In any case, I would argue the correct solution to this application requirement 
is NOT to saddle the async primitives with additional semantics that allow the 
user to "roll his own" questionable synchronization. A far better solution to 
that usage case is to introduce a DIFFERENT library function that encapsulates 
exactly the semantics required - ie perform a non-blocking memput and update a 
flag at the target when it arrives. Such an interface is more user-friendly, 
less error-prone, and ADDITIONALLY has the potential to eliminate an entire 
network round-trip for the signal write on a loosely-coupled system. We've been 
talking about introducing such an interface for a long time, and Berkeley UPC 
includes a working prototype implementation, described here:
   http://upc.lbl.gov/publications/upc_sem.pdf
If this is truly an important usage case for your customers, then I suggest we 
split that discussion into a separate issue and consider a library call to meet 
that need.

Let's set aside that usage case for the moment and assume we independently 
arrive at a library solution that provides an encapsulated solution for that 
application requirement. With that in place, can we agree to remove these 
potentially costly half-fence semantics from the proposed interface? The fact 
that Cray can implement them efficiently on one platform of interest does not 
justify their serious potential performance impact on other hardware.

Original comment by danbonachea on 4 Aug 2012 at 5:20

GoogleCodeExporter commented 9 years ago
"I'm not even convinced this is guaranteed to be correct, lacking a formal 
proof (and a formal definition of "half-fence")."

We already have an implicit partial definition, as it is what prevents updates 
to the same memory location from being reordered.  A gsync simply becomes a 
single relaxed operation that covers all memory locations.

Original comment by sdvor...@cray.com on 4 Aug 2012 at 5:30

GoogleCodeExporter commented 9 years ago
"The fact that Cray can implement them efficiently on one platform of interest 
does not justify their serious potential performance impact on other hardware."

The existing memory model already imposes serious performance impacts on large 
distributed memory systems (including at least Cray and IBM systems), as they 
have to jump through enormous hoops to prevent the reordering of operations to 
the same memory location.  These large distributed memory systems are also 
those that benefit the most from async operations, as the remote memory latency 
is so large, and the bandwidth relatively low compared to local accesses.  Are 
you seriously saying that we should constrain those systems even more because 
of concerns with the relatively small impact on smp systems that don't have 
much to gain from using the asyncs in the first place?

Original comment by sdvor...@cray.com on 4 Aug 2012 at 5:49

GoogleCodeExporter commented 9 years ago
"So in other words, use of the async operations tells the implementation to 
ignore the memory model for a given range of memory locations until the gsync 
is reached.  The memory model is confusing enough already without the 
additional headache of language constructs that ignore it."

I'm sorry but you're still not getting it. I'm not proposing to ignore the 
memory model. I think what's lacking is an understanding of what the memory 
model actually guarantees - I highly recommend you go re-read the formal 
semantics in Appendix B (the actual model, not the Cliff notes in 5.1). 

The memory model is NOT an operational description of a virtual machine, nor 
does it prescribe the contents of memory, even in the abstract. It is sometimes 
convenient to think about and discuss it in an operational sense, but that is 
NOT the basis of the formalism, and ultimately that mode of reasoning may be 
misleading and diverge from the true guarantees.

The memory model is defined entirely in terms of relaxed and strict reads and 
write operations, and for a given execution trace of a VALID UPC program it 
determines whether the execution was "UPC Consistent", in that one can 
construct the appropriate partial orders <_t and total order <_strict that 
satisfy the guarantees it provides. I'm not going to paste in the entire 
formalism here - it's all in appendix B. However, a VERY important and 
deliberate property of the model is that it does not make any guarantees about 
the possible results of operations that did not occur. Stated another way, if 
the execution trace did not directly "observe" a violation of the model, then 
the execution is consistent, regardless of what tricks the implementation may 
be playing under the covers (whose effects were not observed by any application 
read operations).

The Berkeley semantics for async are that it is ERRONEOUS for any application 
thread to modify the source buffer or in any way access the destination buffers 
during the transfer interval, between the library initiation call and the 
successful return of library sync call. A program that "cheats" and touches 
these buffers in the forbidden interval is an INVALID UPC program, and the 
memory model does not provide ANY guarantees whatsoever for an invalid program. 
Valid UPC programs Shall Not touch those buffers within the transfer interval, 
and this property makes it IMPOSSIBLE for them to observe exactly how those 
buffers were accessed by the library, and how those accesses may or may not 
have been affected by other non-related fences or synchronization constructs. 
Because all executions of valid programs are prohibited from observing any 
violations, by definition the memory model is preserved and the executions are 
"UPC Consistent". This is the essence of how the memory model works - if VALID 
programs cannot tell the difference, then the model is not violated.

Original comment by danbonachea on 4 Aug 2012 at 5:50

GoogleCodeExporter commented 9 years ago
I understand the memory model argument completely.  My qualm is with the 
ERRONEOUS part, as I think it is both confusing to programmers and difficult to 
detect.  That combination will lead to programming mistakes that are extremely 
hard to debug.  Simply saying "this program is invalid, so the behavior is 
undefined" is a nice cop-out for the language designer, but it's not so nice 
for the programmers.

Original comment by sdvor...@cray.com on 4 Aug 2012 at 6:03

GoogleCodeExporter commented 9 years ago
" My qualm is with the ERRONEOUS part, as I think it is both confusing to 
programmers and difficult to detect.  That combination will lead to programming 
mistakes that are extremely hard to debug."

I'm sorry but that's the very semantic basis of what's involved in any 
explicitly asynchronous transfer library. This is precisely the additional 
complexity that sophisticated users are accepting when they choose to use any 
async library. If users cannot figure out how to leave the buffers untouched 
during the transfer interval, then they have no business using an asynchronous 
library. 

Are you seriously arguing that we should care about the observed behavior of 
ERRONEOUS programs? I can easily devise many erroneous programs that lead to 
very bizzarre and inexplicable behaviors on any system of your choice, without 
even touching the UPC libraries. Our task as specification writers is to 
clearly define the contract between the user (who writes programs which the 
spec judges to be VALID) and the implementation (which generates executions 
with the guaranteed behavior for those valid problems).

Original comment by danbonachea on 4 Aug 2012 at 6:11

GoogleCodeExporter commented 9 years ago
"Are you seriously arguing that we should care about the observed behavior of 
ERRONEOUS programs?"

No, I'm arguing that async transfers should be defined as relaxed operations, 
and not excuse them from the rules regarding the ordering of relaxed operations 
in the presence of a fence.  Then we don't need to bother with your proposed 
cop-out in the first place.  Does this argument mean that some systems won't 
benefit as much from the asyncs?  Yes.  But those same systems get more benefit 
from the existing routines, so it balances out nicely.

Original comment by sdvor...@cray.com on 4 Aug 2012 at 6:22

GoogleCodeExporter commented 9 years ago
I don't think it's productive to continue this discussion in the present mode. 
It really feels like this is devolving into a textual shouting match, which is 
not a useful form of idea-sharing, collaboration or consensus building. I 
believe both sides have stated their positions, but the discussion has drifted 
from impartial analysis of the core issues to "I like my way, I hate your way, 
lets see how I can make the opposite side look ridiculous".

This is obviously a highly contentious issue, involving significant semantic 
subtlety and non-trivial implications for existing implementations. I believe 
some impartial moderation is called for, and some face-to-face (or at least 
voice-to-voice) interaction. I believe one of the action items from Friday's 
telecon was to setup a telecon devoted to this issue amongst interested parties.

Can we try that as a next step for making progress on some of these issues?

Original comment by danbonachea on 4 Aug 2012 at 6:25

GoogleCodeExporter commented 9 years ago
That's probably a good idea.

Original comment by sdvor...@cray.com on 4 Aug 2012 at 6:28

GoogleCodeExporter commented 9 years ago
Taking a step back and looking at the currently archived discussions, it seems 
to me that at the core of the disagreement is that the Cray and Berkeley 
proposals are trying to meet different needs.  The analogy that comes to mind 
is automotive: Berkeley has designed a "manual transmission" API and Cray has 
designed "automatic transmission".  Each has its place in the world, but we are 
now trying to pick exactly one to go into the UPC spec.

I have always believed that the Berkeley design is correct for the goals it is 
meant to address, and I suspect that if I understood the background of Cray 
proposal better I would also find it equally well suited to its design goals.  
So, I'd like to suggest that on the conference call we might start by trying to 
discuss/understand the GOALS rather than the various issues regarding the 
designs that have evolved to meet those goals. 

Since the Berkeley semaphore/signaling-put extension (see 
http://upc.lbl.gov/publications/upc_sem.pdf)* which Dan recently mentioned is 
(I believe) intended to address synchronization goals vaguely similar to Cray's 
async memcpy proposal, it may be helpful to at least skim that document before 
the call.

* NOTE: upc.lbl.gov is down right now, but expected to come back online about 
noon Monday, Pacific time.
In the meantime I've made the proposal available at 
https://upc-bugs.lbl.gov/~phargrov/upc_sem.pdf

Original comment by phhargr...@lbl.gov on 5 Aug 2012 at 8:19

GoogleCodeExporter commented 9 years ago
"I don't think it's productive to continue this discussion in the present mode. 
It really feels like this is devolving..."

    No, you started a useful discussion.  If it has felt as if it is devolving, please be aware that this discussion comes at a relatively late phase, after a subcommittee was formed to develop a consensus proposal.  I'm not sure why you weren't involved on the BUPC side of things.  I agree that we need to discuss this issue at length on the recently scheduled telecon, but I don't think that precludes discussion online because the telecon isn't for two more weeks.

"the core of the disagreement is that the Cray and Berkeley proposals are 
trying to meet different needs.  The analogy that comes to mind is automotive: 
Berkeley has designed a "manual transmission" API and Cray has designed 
"automatic transmission".  Each has its place in the world, but we are now 
trying to pick exactly one to go into the UPC spec."

    When working on the consensus proposal, I viewed the difference as BUPC and Cray starting from different origins, but both wanting a solution that is consistent with the memory model and useful to users.  I saw BUPC as starting with the memory model, fitting in _async extensions, and then attempting to make them useful to users by exempting the extensions from normal fence semantics.  I saw Cray as starting with a description of what the users wanted to do, writing _nb extensions to let them do it, then making them fit into the memory model without losing their utility by introducing the half-fence concept.

    While we're talking philosophy here, I think it's very important in this discussion that we not lose sight of the UPC spec as being the primary mechanism whereby users can find out how the language -- and presumably their compiler -- works.  UPC isn't like C or C++ where users can find zillions of books and online resources to help them out.  We should try to minimize putting things in the spec where the spec says one thing but 99% of implementations will do something that is apparently completely different but compliant.  For example, the reason that we're even discussing this problem is that the existing upc_mem* functions are blocking on many implementations.  The spec doesn't make them blocking, and any users reading the spec will see that they just wrap up a bunch of relaxed accesses into a convenient function call, but the functions generally are blocking and performance-conscious users must think of them that way.  To continue that example, I believe that users will view a non-blocking call as initiating the copy before it returns because most implementations will do that.  If the spec does not require that behavior, then we're again in the same confusing situation where there are basically two standards: (1) the UPC spec, and (2) how most UPC implementations work.

Original comment by johnson....@gmail.com on 6 Aug 2012 at 4:45

GoogleCodeExporter commented 9 years ago
After a chat about the NB memcpy proposals with one my users, I
thought I should pass one thing he said:

> I do strongly feel that the semantics for nonblocking reads/writes
> should be the same as for the nonblocking collectives (if and when
> they get implemented).  So any discussion of this should take that
> into account, even though the collectives are in a different
> proposal.  (I don't see really needing the extra flexibility of
> the Berkeley proposal for reads and writes, but I'm less sure
> about collectives.)

Original comment by nspark.w...@gmail.com on 6 Aug 2012 at 9:49

GoogleCodeExporter commented 9 years ago
"While we're talking philosophy here, I think it's very important in this 
discussion that we not lose sight of the UPC spec as being the primary 
mechanism whereby users can find out how the language -- and presumably their 
compiler -- works."

Philosophically, I strongly disagree that the spec should be geared as a 
training tool for users, or as a substitute for vendor-provided documentation 
of implementation-specific behaviors. Behavioral descriptions of particular 
implementations or even expected implementations have no place in a formal 
language spec.  The specification is a contract between all users and all 
implementations, and historically the UPC spec always strived to specify 
necessary and sufficient semantics - ie the minimally necessary restrictions on 
the implementation to provide sufficient functionality for the user. As the 
spec gains implementation restrictions and operational codification of 
behavior, you reduce the space of legal implementations and optimizations, 
potentially leading to performance degradation. Programming languages have a 
much longer life cycle than hardware systems, so as language writers we must be 
sensitive not only to current implementation stategies and platforms, but must 
also do our best to allow for improvement via future strategies and hardware. 
It's difficult to accurately predict where hardware will be in 5 or 10 years, 
but minimizing spec requirements to necessary and sufficient conditions gives 
us the most "wiggle room" to accomodate a changing hardware landscape in the 
future.

"the reason that we're even discussing this problem is that the existing 
upc_mem* functions are blocking on many implementations.  The spec doesn't make 
them blocking, and any users reading the spec will see that they just wrap up a 
bunch of relaxed accesses into a convenient function call, but the functions 
generally are blocking and performance-conscious users must think of them that 
way. "

To address your specific point about existing upc_mem* behavior, there is a 
very important semantic difference between "blocking" (ie synchronous) and 
strict (ie surrounded by fences that prevent access movement). These may happen 
to have similar performance characteristics under naive translation on a 
current distributed system, but are quite different on current systems with 
hardware shared memory support. One could imagine future systems with better 
hardware support for UPC where the difference could be even more significant. 
The difference is also quite important as far as the compiler is concerned - 
the relaxed semantics of upc_mem* allows for a good optimizer and/or a smart 
runtime system to intelligently reorganize and schedule the data transfer, 
using only serial/local data analysis. The appearance of any fences severely 
limits what an optimizer can do, because full parallel analysis with complete 
program information is usually required for provably correct transformations 
around fences. The fact that some implementations make no effort to exploit 
this semantic does not mean that the spec should be written to preclude such 
optimizations, which is why upc_mem* has the semantic specification that it 
does.

" To continue that example, I believe that users will view a non-blocking call 
as initiating the copy before it returns because most implementations will do 
that.  If the spec does not require that behavior, then we're again in the same 
confusing situation where there are basically two standards: (1) the UPC spec, 
and (2) how most UPC implementations work."

I see no compelling reason to require implementations to issue all accesses 
before returning from initiation, even in software. I can easily imagine 
implementations that could improve throughput under high load by delaying 
initiation based on network status. At the hardware level, we WANT the 
operations to be "in-flight" for as much of the transfer interval as required 
(that's the entire point of using an async library), and the asynchronous agent 
(eg the RDMA engine) should have the freedom to initiate accesses that perform 
the transfer when appropriate based on network resource status. 

Original comment by danbonachea on 6 Aug 2012 at 10:54

GoogleCodeExporter commented 9 years ago
Cray's proposal is trying to solve the problem that the same-address 
restriction prevents the compiler/run-time library from making the existing 
upc_mem* routines non-blocking on machines where the hardware provides multiple 
paths to (remote) memory, and thus must rely on software to make ordering 
guarantees.  Most high-performance scalable networks (including both Cray's and 
IBM's current offerings) are designed in this way, as it provides greater 
bandwidth and improved resilience against hardware failures.  Looking ~10 years 
out, we don't see this situation changing significantly, as most networks are 
moving more and more in this direction.  We therefore don't believe it is 
reasonable to expect hardware support on large distributed memory systems for 
the foreseeable future.

To enforce the ordering in software, an implementation must track operations 
that are "in-flight" and resolve conflicts in some way.  One proposed approach 
to this is software caching of relaxed accesses.  However, we do not believe 
this is a viable approach (in the context of this discussion) for large systems 
for the same reason it's not done it hardware: lack of memory.  The size of 
your cache determines the upper limit on the amount of data you can have 
in-flight.  Non-blocking operations are most useful when you have a lot of data 
to move, and the cache must be relatively small so there's still enough room 
for user data.  It is also complex to implement and can easily hurt performance 
more than it helps without per-application tuning.

Another approach is to track which shared memory locations have operations that 
are in-flight, and insert syncs of some kind when a conflict is detected.  
There's still a memory problem, but instead of large contiguous transfers being 
the problem, smaller "random-access" transfers kill the scalability of this 
approach, as the implementation can't efficiently store lots of "random" 
scattered memory addresses, and must therefore rely on much more coarse 
tracking.  I believe this is what IBM claimed to be doing (with a bit-vector 
permitting a single "in-flight" operation per remote thread/node if I recall 
correctly?) on one of the earlier phone conferences.  Cray does something 
similar to this, with the caveat that upc_mem* routines are always synced 
before returning for various reasons.  However, there is a noticeable overhead 
to this tracking, particularly on some important (to our customers) access 
patterns.

Other approaches either can't handle relatively common corner cases (static 
compiler analysis) or don't take advantage of available hardware offload 
mechanisms and have other scalability issues (active messages/RPC).  We 
therefore need some help from the user to get around this.

The "half-fence" that we proposed on the global sync formally provides acquire 
semantics on relaxed accesses.  This is necessary to permit pairwise 
synchronization with a remote thread via relaxed operations to notify that 
thread that the non-blocking operation is complete.  It is important that this 
be done with relaxed operations, as using strict operations would unnecessarily 
sync other non-blocking operations (which may include much more than simply the 
user's explicit use of the proposed routines!).  If another method of providing 
this functionality is made available, either via a new type of fence 
(upc_fence_acquire/upc_fence_release?) or Berkeley's semaphore proposal (which 
I haven't read yet), then I don't think we'd have a problem dropping this part 
of our proposal.

In terms of spec changes, I believe our proposal is much more conservative than 
Berkeley's.  Importantly, the new restrictions on accessing memory locations 
involved in a call to one of the proposed routines apply ONLY to the calling 
thread in our proposal.  As far as all the other threads are concerned, the 
proposed routines behave just like the existing upc_mem* routines, and thus no 
changes to the memory model are required--minus the "half-fence", which I think 
Dan has convinced me could be better provided in a different manner.  The 
proposed routines are simply another way to perform relaxed shared memory 
accesses, with the benefit/caveat that the same-address restriction is lifted 
between the initiation of the operation and the sync.  We believe this behavior 
is sufficient to provide the amount of communication/computation overlap users 
desire without adding significant additional complexity to the memory model.

We DO NOT believe permitting non-blocking operations to continue beyond a fence 
provides any useful additional functionality (perhaps you could provide an 
example where this is necessary?).  We DO believe that allowing it will confuse 
users who expect upc_fence (or worse, a UPC barrier!) to be a full memory 
barrier.  Additionally, it is a non-trivial task for the implementation to 
detect and warn users when they've (hopefully accidentally) written "illegal" 
code that accesses memory locations involved in a call to one of the proposed 
routines on a thread other than the calling thread before the sync, and will 
therefore be hard-pressed to aid the user in debugging the problems that this 
will cause.  We previously proposed adding a class of "super-relaxed" 
operations, which were relaxed operations that didn't have the same-address 
restriction.  It was rejected because of concerns it'd be too confusing to 
users, and added too much complexity to the memory model.  I can't imagine this 
is any less confusing, given that the legality of a users code won't be 
immediately obvious nor easily provable in all cases.

"Taking a step back and looking at the currently archived discussions, it seems 
to me that at the core of the disagreement is that the Cray and Berkeley 
proposals are trying to meet different needs.  The analogy that comes to mind 
is automotive: Berkeley has designed a "manual transmission" API and Cray has 
designed "automatic transmission".  Each has its place in the world, but we are 
now trying to pick exactly one to go into the UPC spec."

I think this is exactly the case, though I don't quite understand your 
automatic versus manual transmission analogy.  To my mind, a better analogy 
would be traffic at a street light.  Cray proposed a system that allows the 
user to say "trust me, I'll make it through before it turns red" to allow 
vehicles to continue when the light turns yellow, but doesn't allow anyone 
through a red light.  Berkeley proposed letting some vehicles go right through 
a red light, and denying insurance claims if an accident occurs due to the 
"illegal driving" of a vehicle with a green light hitting them.

Original comment by sdvor...@cray.com on 7 Aug 2012 at 2:13

GoogleCodeExporter commented 9 years ago
Because our common goal is to develop a consensus proposal, may I propose the 
following: let's discuss the disagreement points one by one instead of 
referring to the whole proposal.  I think there are good points on both sides 
so why not combine and agree on the best.

Here is my attempt to summarize the current disagreements:

1) Should upc_fence (strict memory ops in general) guarantee the completion of 
outstanding non-blocking memory operations?
A subcommittee of 5 people (including myself) had agreed to "Yes".
But since there are some different opinions now, let's revisit this issue.

2) Should the "sync" calls have fence/half-fence semantics?

3) Should there be both local and global sync functions?

4) Function naming (minor)

Please add and/or change the discussion points if you have any others.  I hope 
the list of disagreements will converge to zero as our discussion goes along.

Original comment by yzh...@lbl.gov on 7 Aug 2012 at 4:27

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
"let's discuss the disagreement points one by one instead of referring to the 
whole proposal.  I think there are good points on both sides so why not combine 
and agree on the best."

Yili- You've summarized the low-level technical differences between the two 
approaches, but I don't think that's the correct level of discussion at this 
time. I think what these discussions have revealed is the reason the two 
proposals differ in the details is because they were designed with a different 
set of high-level goals and to satisfy a different set of user needs. The 
technical details mostly follow logically from those differing goals. We cannot 
arrive at a consistent and well-designed interface by resolving technical 
points in a vacuum, without first straightening out the high-level goals of the 
interface.

sdvormwa@cray.com:
"The proposed routines are simply another way to perform relaxed shared memory 
accesses, with the benefit/caveat that the same-address restriction is lifted 
between the initiation of the operation and the sync.  We believe this behavior 
is sufficient to provide the amount of communication/computation overlap users 
desire without adding significant additional complexity to the memory model."

I think Paul is correct that we need a high-level discussion about goals of the 
interface. Alleviating the same-address restriction is nice, but is NOT the 
major goal the Berkeley proposal was trying to accomplish. Conflicting writes 
to the same address from a single thread with no intervening fence is not a 
pattern we expect in well-tuned applications, because it represents a neglected 
opportunity for communication coalescing. That being said, it may occasionally 
happen and still needs to be handled correctly, but it's not the case we're 
most interested in tuning for. Neither are we designing the async memcpy 
library to specifically serve as a "signalling put" - this is an important 
usage case that we feel deserves its own separate library interface and should 
not be conflated with pure asynchronous data movement.

Our goal with the Berkeley async transfer library was to enable far more 
aggressive overlap of communication with unrelated computation and other 
communication. We are trying to overlap the entire cost of a communication, and 
allow it to asynchronously continue undisturbed without interference from 
unrelated operations. The boundaries of the asynchronicity are defined by the 
init and sync library calls (as part of the "contract" between the app and 
library), not by random fences that may happen to occur in the unrelated code. 
The need we are trying to meet is the user explicitly asserts "perform this 
transfer in the background, and I will explicitly call you again when I need to 
ensure it has completed" - this is a familiar paradigm in other parallel 
libraries. I think it would be more surprising to the user who has invoked the 
async library that when he calls an unrelated application module written in 
UPC, suddendly the async transfers for his module are no longer achieving 
overlap; because the callee module uses a fence somewhere to synchronize some 
completely unrelated data.

Original comment by danbonachea on 7 Aug 2012 at 6:21

GoogleCodeExporter commented 9 years ago
"Conflicting writes to the same address from a single thread with no 
intervening fence is not a pattern we expect in well-tuned applications, 
because it represents a neglected opportunity for communication coalescing."

That is not the problem though.  The issue is that unless the implementation 
can PROVE there are no conflicting writes, it must conservatively assume there 
are, which impacts just about all codes.  Good compiler analysis can help in 
some cases, but there are important cases that it can't help with, usually due 
to other language design decisions--separate compilation probably being the 
most obvious.  Runtime caching / tracking / coalescing can all help sometimes 
as well, but the memory overhead limits their usefulness, and they tend to not 
scale well beyond a certain number of threads.

Original comment by sdvor...@cray.com on 7 Aug 2012 at 8:07

GoogleCodeExporter commented 9 years ago
I'm not sure if there is any substantial difference in the high-level goals of 
this extension -- skipping the adjectives, isn't the high-level goal the same 
on both sides: enable communication/computation and communication/communication 
overlaps?
(Note: I would like to save the discussion about half-fence-at-sync in a 
different post.)

Actually, for many common cases where no fence is used between nb init and nb 
sync, both the original Berkeley and Cray proposals behave similarly, if not 
the same.  The main disagreement is on how to handle the special case when a 
fence is used between an init and the corresponding sync.

danbonachea:
"Alleviating the same-address restriction is nice, but is NOT the major goal 
the Berkeley proposal was trying to accomplish. Conflicting writes to the same 
address from a single thread with no intervening fence is not a pattern we 
expect in well-tuned applications, because it represents a neglected 
opportunity for communication coalescing. That being said, it may occasionally 
happen and still needs to be handled correctly, but it's not the case we're 
most interested in tuning for. "

I think "alleviating the same-address restriction" is NOT a goal but a 
Mechanism to achieve the goal of overlapping.  Because of the same-address 
restriction, the UPC compiler/runtime cannot perform reordering optimization 
for 99% of common cases where there are actually no same-address accesses but 
the compiler/runtime just cannot prove its absence.  Another way to view the nb 
memcpy functions is that they provide a library approach for users to express 
"super relaxed" data accesses.   

I like Steve's analogy of "allowing outstanding non-blocking memory operations 
to pass a fence is like allowing cars to pass a red light".  While there could 
be special situations to justify such violations, I generally prefer to obey 
the traffic laws.   

Original comment by yzh...@lbl.gov on 7 Aug 2012 at 8:26

GoogleCodeExporter commented 9 years ago
"The issue is that unless the implementation can PROVE there are no conflicting 
writes, it must conservatively assume there are, which impacts just about all 
codes."

I completely agree - this is ONE of the main motivations for an explicitly 
asynchronously library. My point is that it's not the ONLY reason for using 
such a library and not the sole design goal, as your text I quoted in comment 
#43 seems to indicate. Specifically, it is not "sufficient" for the library to 
provide a tool to suppress the "same-address" restriction, we also want the 
semantics to enable full overlap of the communication with other, fully-general 
and unrelated activity (which the user asserts does not touch the transfer 
buffers).

Original comment by danbonachea on 7 Aug 2012 at 8:38

GoogleCodeExporter commented 9 years ago
" isn't the high-level goal the same on both sides: enable 
communication/computation and communication/communication overlaps?"

Both sides probably agree to that broad statement, but we need a more detailed 
and concrete description of the types of usage cases we wish to support, and 
how the library fits into those cases.

"I like Steve's analogy of "allowing outstanding non-blocking memory operations 
to pass a fence is like allowing cars to pass a red light".  While there could 
be special situations to justify such violations, I generally prefer to obey 
the traffic laws. "

I don't think we should be debating formal semantics by analogy. 

However since people seem seduced by the analogy, I think Steve's 
characterization is flawed. I think the Berkeley semantics are better described 
as an overhead expressway - it bypasses all the city traffic below and is 
unaffected by city traffic lights, because the laws of the road guarantee the 
cars below cannot even SEE the highway traffic, let alone interact with it. The 
on-ramps and off-ramps are clearly defined by the library calls which define 
where cars enter and exit the normal flow of relaxed operations on the city 
streets, but while they're "in-flight" on the expressway they operate 
completely independently of everything else. 

Original comment by danbonachea on 7 Aug 2012 at 9:17

GoogleCodeExporter commented 9 years ago
"Conflicting writes to the same address from a single thread...not the case 
we're most interested in tuning for"

    Same for us.  It's a rare case that has unfortunate performance consequences for the common case in at least two vendor implementations.  We don't optimize for it happening; we try to deal with it in a way that minimizes the impact that its very existence has on the common case.

"Neither are we designing the async memcpy library to specifically serve as a 
signalling put"

    Cray calls this a put-with-notify and we're interested in that functionality becoming part of UPC.  If it is separate from the _async/_nb functions, then so be it, but it does mean introducing more library functions than if _async/_nb could be used instead.

"The boundaries of the asynchronicity are defined by the init and sync library 
calls...not by random fences"

"Because all executions of valid programs are prohibited from observing any 
violations, by definition the memory model is preserved and the executions are 
"UPC Consistent". This is the essence of how the memory model works - if VALID 
programs cannot tell the difference, then the model is not violated." [Comment 
#30]

    Let me paraphrase that to make sure that I've got it and then come at this from a slightly different manner than I have before.  I still have my previous objections about the async fence behavior, but I want to look at upc_barrier because I think users will find that more surprising...

    The BUPC async proposal adds something to UPC that violates the memory model and then hides the fact that the memory model is being violated by declaring that otherwise legal programs that could observe the violation are now illegal.  For example, normally it is legal for two threads to modify the same data from opposite sides of a barrier and I could use this legal behavior to detect the async memory model violation, but instead it is declared that if there is an unsynchronized async to this data, then my program is illegal; i.e., even if I can run my program and demonstrate the memory model violation, the evidence is inadmissible.

    I don't think that this approach is valid for extending UPC (at least not in the backwards compatible manner that we want for UPC 1.3) because it could break the intent of existing code by removing the only mechanism that the programmer has to ensure that there is no ongoing communication: upc_barrier.  If I have a collective function in an existing library, I may have used a upc_barrier upon entry and exit to ensure that I can do what I want with any memory in between.  Currently this is a foolproof way of guarding against what comes before and after my collective library function and the only burden on my client is to call the function in a collective context.  With asyncs added, my barriers no longer offer complete protection and the burden shifts to the library client to ensure that any asyncs touching the data do not cross a call to this function somewhere in their call graph.

    I can see an argument that the library code is still legal and client code just needs to be more careful with the new language feature, but I don't think it's a very nice thing to do to people in a 1.2 -> 1.3 change because it essentially changes contracts of existing functions.  The contract here changing from "I promise to call this function in a collective context" to "I promise to call this function in a collective context and further promise to not be asynchronously touching any memory that the function may touch."  This change is particularly awkward if the the client doesn't have complete knowledge of all memory locations that the function may touch.

Original comment by johnson....@gmail.com on 8 Aug 2012 at 4:42

GoogleCodeExporter commented 9 years ago
I have four major concerns with allowing the routines to continue past fences.  
The first two are philosophical, while the final two are potential future 
problems I see as an implementer.

1. Allowing it adds restrictions on threads other than the calling thread.  
This is counter-intuitive, at least to me, as the one-sided model implies 
threads are independent outside of explicit inter-thread synchronization.  If 
the routines are synced by fences, other threads are not impacted by a thread's 
use of these routines at all.

2. The existing memory model is difficult to understand, but complete.  With 
this change, the memory model is no longer complete, as we've introduced a 
relaxed access with special rules that aren't reflected by the memory model.  
We can (and did) go back and forth all day about whether or not this breaks the 
memory model, but it certainly complicates the task of trying to understand it.

3. Violations of the access rules are relatively easy to detect on the calling 
thread, either through static analysis or fairly cheap run-time checks.  
Detecting violations on other threads is a much more difficult problem, as 
every thread must be aware of every other thread's non-blocking operations.  
This will make debugging extremely difficult.

4. I think this will eventually create a de facto memory model for the 
"illegal" codes, which like it or not, users will end up writing.  They'll find 
that the undefined results are acceptable on one implementation, and then other 
implementations will have to provide the same behavior for compatibility when 
the users port their code.  Since this could have very significant performance 
(not to mention implementation design) implications, I'd much prefer to hammer 
this out ahead of time rather than be stuck with a de facto definition that 
hamstrings us later.

Additionally, I still don't see a motivating need for allowing these to pass 
fences.  While Dan's vague "what-if" scenario could indeed cause problems, I'm 
having trouble coming up with a specific situation that it would apply to 
(ignoring signalling puts/gets, which we've agreed to handle separately).  
Could someone give a more concrete example where this functionality would be 
required?  Without some way of addressing the concerns I listed above, I don't 
think we should be adding this to the spec unless we have a specific use-case 
in mind--one that can't be done any other way.  Undefined behavior should be a 
last resort for specification writers, particularly when the trigger is so hard 
to detect.

"I don't think we should be debating formal semantics by analogy."

Agreed.  I just put it in there to lighten up the conversation after I didn't 
understand Paul's analogy.  That said, yours was pretty good, though highways 
generally have actual physical barriers preventing city traffic from 
interacting with its own traffic.

Original comment by sdvor...@cray.com on 8 Aug 2012 at 4:51

GoogleCodeExporter commented 9 years ago
I have been asked to contribute an opinion here. It is a long thread, and a 
passionate. Of the several possibilities discussed, I extracted two that seemed 
reasonable.

1) asynchronous memory operations have local completion semantics (i.e. waiting 
for an async memput only guarantees that the send buffer is reusable after a 
put). Fences operate on asynchronous operations just like on "normal" blocking 
ones.

2) asynchronous memory operations have global completion semantics (i.e. 
waiting for an async memput guarantees that it is now remotely complete as 
well). Fences do not operate on asynchronous operations - indeed, there is no 
point since just waiting already does the job.

There were others half-mentioned (or maybe I misunderstood the heated dialogue) 
- like remote memory ops that don't fence at all - that is, we *never* know 
whether they have ever remote completed. I will not consider such scenarios.

I prefer (1) over (2) (which puts me in cahoots w/ the Cray guys rather than 
Dan, I think). Here is why: because the (1) semantics is unsurprising. It is in 
line with what I already know about UPC - that relaxed writes have local 
completion semantics - I only know that send buffers can be reused when the 
write returns. (1) is *also* in line with MPI and shmem, to the best of my 
understanding - this may not be an argument for you, but sure is for me.

I'm not sure what you will say about processor ordering of asynchronous puts to 
the same remote address. I would love it if you could make yourself say that 
the *start* of the operation is what determines order - not when you wait for 
completion. This, again, would be unsurprising. It can be implemented with some 
effort - I will claim that the effort is the same as we are already making to 
order blocking puts on an unordered network.

You spent a lot of time talking about fences and their interaction with 
half-finished asynchronous operations. This seems like a red herring to me - if 
you are a crazy enough programmer to use non-blocking operations - need I 
elaborate on the perils of non-blocking remote operations? - well, in that case 
making sure that there are no pesky strict accesses, barriers and so forth 
between the op start and the wait should be child's play.

If you end up going for (2), it's still kind of OK ... it's different, but 
still has a kind of internal consistency. Fences would simply ignore 
non-blocking operations. You would order remote puts w.r.t each other based on 
when you wait for them - not when you start them. You could order remote puts 
w.r.t. normal blocking puts by employing strategic fences (although you'd be 
kissing goodbye to performance if you did that). It's serviceable ... but 
personally I don't really like it; it's a much larger change relative to what 
UPC users are used to in terms of ordering and fences.

My $0.02 in 1966 issue pennies ... if you have to flame me, do it gently.

Original comment by ga10...@gmail.com on 10 Aug 2012 at 2:44