Closed GoogleCodeExporter closed 9 years ago
Original comment by phhargr...@lbl.gov
on 1 Jun 2012 at 3:43
Original comment by phhargr...@lbl.gov
on 1 Jun 2012 at 6:08
Given the difference between Cray's and Berkeley's positions on the
non-blocking memory copy proposal, I was hoping to restart the discussion in
the hopes of having some consensus.
Generally speaking, I think that my (and my users') position between the two is
that we prefer non-blocking memory copy functions to *NOT* be
independent/agnostic of a upc_fence or strict synchronization. That is, we
essentially support the Cray position.
My understanding is that the biggest motivation for the fence/strict
independence is that some users may start a non-blocking copy and then call a
function or library that calls a fence internally. While we recognize that
this may happen and it would eliminate much of the benefit of the non-blocking
copy, we feel that using a fence is inherently an expensive operation that
should be used judiciously, but should (from 6.6.1.5) apply to "all shared
accesses."
I think it is somewhat philosophically orthogonal to provide an independent
communication channel within UPC that is still essentially called UPC, but has
to be managed separately from the "traditional" shared UPC accesses.
As a far less important issue, I prefer the "_nb/_nbi" suffix to the
"_async/_asynci" suffix.
Original comment by nspark.w...@gmail.com
on 11 Jun 2012 at 3:44
Let me attempt to connect this issue with the UPC collectives 2.0 issue
(appropriately numbered Issue 42). There, too, we have a problem of not being
able to use upc_fence to guarantee completion of operations.
If we can formulate upc_fence and handles in a way that allows libraries to use
it as an extension tool, we could deal with the (very valid) Berkeley
objections and make Troy happy too.
Of course Bill will be upset. So that's the price to pay.
:)
Original comment by ga10...@gmail.com
on 15 Jun 2012 at 3:12
First, I attach the latest documents from Berkeley and Cray that may facilitate
discussion and help clarify confusions.
I think it's logical that "upc_fence" would sync all outstanding implicit
non-blocking operations. But how about explicit handle operations?
For example,
/* foo may be only in some binary format developed by a third party */
void foo() { upc_fence; }
h = upc_memcpy_nb(...);
foo();
sync(h);
Cray's position: It's an user error to call upc_fence (and thus foo()) before
sync(h).
Berkeley's position: upc_fence has no effect on h.
Neither seems to be perfect. Any suggestions or comments?
In addition, we should carefully consider and define the differences between
local completion and global completion as stated in Cray's document.
Original comment by yzh...@lbl.gov
on 15 Jun 2012 at 4:58
Attachments:
I understand that the community Nick represents is in favor of something more
like Cray's version than Berkeley's version. While that is not my personal
preference, I am willing to accept the input of the USERS as more relevant than
the distastes of a lone implementer. So, let's see where this leads us...
I don't have a problem with implementation of syncing implicit NB ops in
upc_fence(). I never did except that doing so w/o the matching behavior for
explicit handles seemed a stupid half-way commitment.
It has been the interaction of strict accesses (and therefore upc_fence()) with
explicit-handle NB ops that has been my main concern (the implementation
costs). In the interest of reaching consensus I will concede that strict
accesses must complete NB ops. Specifically, I concede that placing
non-blocking memcpy functions outside of the UPC memory model is unacceptable
to the community.
As Yili mentions in the previous comment, we are still in some disagreement on
explicit-handle operations. My main concern is the one Yili expresses: Cray's
current proposal that a upc_fence() is illegal between init and sync makes it
"difficult" to call external code (to achieve communication-computation
overlap). In fact, the current Cray proposal would require different code
depending on whether the code in a called function includes ANY strict accesses.
My hope is to "meet half-way" with something that has the most desirable
properties.
I think that permitting strict accesses and upc_fence(), while keeping the
handle "live", permits the user to write code without the need to know if any
external functions (or even their own in a large project) contain fences or
strict accesses.
The Cray-proposed behavior of PROHIBITING the sync after a fence or strict
access seems sufficiently distasteful to me that I am willing to drop my
objections to handle-tracking overheads to void it (lesser of two evils in my
mind).
Would the following be acceptable to Cray:
+ "strict access" in what follows implicitly includes calls to upc_fence()
+ a strict access between init and sync of an explicit-handle NB op is permitted.
+ such a strict access causes completion of all outstanding NB transfers (both implicit and explicit-handle) EXACTLY as they do any normal relaxed access (no special-case spec language required)
+ However, any handle from an operation initiated, but not yet synced, before the strict access is still "live" and therefore an explicit sync call is REQUIRED to free any associated resources
+ Note: "complete" is with respect to memory-model while "synced" is with respect to retiring the handle. The "completion" occurs no later than the "sync", but can be forced earlier with a strict access.
One additional thought that occurs:
If the user uses "#include <upc_strict.h>" or "#pragma upc strict" then ALL
shared accesses between the init and sync of an NB call would become strict.
This feels to me like another reason to keep handles "live" and allow the same
code to work in either strict or relaxed mode.
Also, I endorse the inclusion of "restrict" in the prototypes, which appears to
have been unintentionally omitted from the Berkeley proposal. It was not our
intent to support overlapping src/dest.
NOTE:
In Berkeley UPC we introduce our extensions with a bupc_* prefix rather than
using in the upc_* namespace for our extensions. This means that if the
eventual specification differs from our initial versions, user codes can
continue to use the bupc_* prefixed "legacy" versions rather than seeing their
code break when they update to a compiler implementing the new spec and
therefore changes the semantics of the upc_* version.
So, I would recommend that to save Cray's users from some pain we adopt the
"_async" family of names to NOT collide with Cray's current implementations
(which may differ from the final spec semantics).
Original comment by phhargr...@lbl.gov
on 15 Jun 2012 at 11:16
All our (Cray's) concerns were with regard to the memory model--specifically
that NB operations be treated as relaxed operations and thus are ordered by
fences. We proposed that having the sync come after a fence be "undefined"
behavior (note, undefined does not mean illegal) to ensure that no strictly
conformant program did this. An implementation would then be free to break the
memory model in a non-conformant program, and thus could permit outstanding NBs
past a fence like Berkeley currently permits. We didn't mean to make it an
error to call the sync routine after a fence, merely to make it so that users
couldn't rely on any particular behavior in that case and subtly discourage its
use in portable applications.
Of course, there's no need for this if fences/strict accesses are explicitly
defined to complete outstanding NBs as far as the memory model is concerned.
In that case we have no problems requiring the handle be explicitly sync'd even
after a fence.
Original comment by sdvor...@cray.com
on 16 Jun 2012 at 12:44
Excellent! It sounds like Cray and Berkeley have converged on their most
significant differences. Now we need a volunteer to draft a new proposal that
we can look over to make sure we agree on the little stuff too.
I am sorry if Yili or I mis-characterized Cray's intended semantic for
sync-after-fence.
Berkeley will continue to offer bupc_* async-memcpy extension which operate
outside of the memory model, and will add upc_* async-memcpy functions which
behave just as other relaxed accesses with respect to the memory model.
Original comment by phhargr...@lbl.gov
on 16 Jun 2012 at 1:28
Just a note that there was an email exchange external to this issue forum that
resulted in the following conclusion:
The LaTeX for the Cray proposal will be uploaded to the SVN repository (my
task, once I figure out how). Everyone then will be able to edit that version
until we have something that we can recommend as a unified non-blocking
proposal.
I also wanted to note the relationship between this issue and Issue 7 Comment
#30: http://code.google.com/p/upc-specification/issues/detail?id=7#c30
Answering that question would help us to include clearer language describing
the semantics of the sync functions.
Original comment by johnson....@gmail.com
on 19 Jun 2012 at 6:49
Original comment by gary.funck
on 3 Jul 2012 at 6:07
Original comment by gary.funck
on 3 Jul 2012 at 6:08
Original comment by gary.funck
on 3 Jul 2012 at 6:09
Troy has checked in the draft of the NB memory operation extension. Attached
is an extended version of the document that includes more background
information.
Original comment by yzh...@lbl.gov
on 1 Aug 2012 at 4:37
Attachments:
I've steered clear of this discussion until now, but as the original author of
both the Berkeley async proposal and much of the memory model, I'd like to
provide some perspective that I believe is missing from the current discussion.
I don't claim to state an official position for the Berkeley group, this is my
own expert opinion.
*** Executive summary ***
I'm arguing that the non-blocking memcpy functions should NOT be synchronized
in any way by upc_fence or other strict accesses issued between init and sync.
The non-blocking library includes an explicit and required synchronization
call, and the transfer should be permitted to continue during the entire
"transfer interval" between the initiation and successful sync, regardless of
the actions of the calling thread (or libraries it invokes) in that interval.
As far as the memory model is concerned, the data transfer behaves as a set of
relaxed read/write operations of unspecified size and order, which are "issued"
by the calling thread at a unspecified time during the transfer interval. The
ops are not affected by fences or other strict operations issued in the
transfer interval, because it is explicitly unspecified whether they were
issued "before" or "after" any such fences. I believe this semantic "dodge"
makes it totally compatible with the memory model.
Furthermore, the operations specified by the library should not imply any
fencing semantics, which needlessly complicate the interface and may impose
performance degradation through unwanted semantics. The transfers are relaxed
operations and any required synchronization should be explicitly added by the
user as strict operations or other fences around the transfer interval.
Point 1: ***The memory model already allows optimization of "blocking"
memcopies***
By B.3.2.1: "For non-collective functions in the UPC standard library (e.g. upc
mem{put, get, cpy}), any implied data accesses to shared objects behave as a
set of relaxed shared reads and relaxed shared writes of unspecified size and
ordering, issued by the calling thread."
In English, this means as far as the memory model and compiler are concerned,
the accesses implied by the existing "blocking" memcopy functions are handled
as an unspecified set of relaxed access. Specifically, the compiler/runtime is
already free to reorder them with respect to any surrounding relaxed access,
subject only to the limitations of static analysis (or runtime checking) to
avoid reordering conflicting writes or passing a strict operation. High-quality
implementations will thus already achieve some communication overlap from calls
to the existing memcopy libraries.
In proposing an extension to these existing libraries, we must specify
semantics that allow significantly MORE aggressive overlap to occur, leading to
measurable performance gain - otherwise the complexity of the proposed
extension is not justified.
Point 2: ***The primary goal of the async feature is performance***
Async memcopies do not add any expressiveness to the language - ie any async
data movement operations can be already be expressed with their fully blocking
counterparts. I doubt anyone would argue that explicitly async memcopies are
more concise or elegant than a blocking call, nor do they improve the
readability or debuggability of the UPC program. On the contrary, the
programmer has chosen to sacrifice all of these features to some extent, all in
order to (hopefully) reap an improvement in performance by explicitly
requesting a communication overlap optimization which is either too hard (due
to limitation of static analysis) or too costly (overhead of dynamic
optimization) for the compiler to perform automatically. As performance is the
primary and overriding goal of this library feature, great care should be taken
to avoid any semantic roadblocks with the potential to artificially hinder
performance of the library under realistic usage scenarios.
Point 3: ***Async calls are an annotation that explicitly suppress the memory
model**
The whole point of the explicitly asynchronous memcopies is to provide an
annotation from the user to the compiler/runtime asserting that the accesses
performed by the copy do not conflict with any accesses or other operations
that occur between the initiation and the sync call. The user is asking the
compiler to "trust" this assertion and maximize the performance of the transfer
while ignoring any potential conflicts. This obviously includes conflicting
read/write accesses to the memory in question (otherwise the assertion is
meaningless). I believe it ALSO should apply to any strict operations or fences
that may occur (possibly in hidden callees or external libraries that defeat
static analysis). It makes no sense to "suppress" the memory model for one type
of conflict but preserve it for another.
Yes, this finessing of the memory model makes these async calls harder to
write, understand and debug than their blocking counterparts, but that's
precisely the price the programmer is paying for a chance at improved
performance. The C language has a long history of features with pointy edges
that give you enough rope to hang yourself, in exchange for allowing the
programmer to get "closer to the machine" and micromanage behavior where it
matters. The async extensions are just the latest example of such an "advanced
feature", and we should not saddle them with semantic half-measures that try to
make them slightly more user-friendly at the expense of potentially sacrificing
any amount of performance (which is their primary reason for existence).
Point 4: ***Async calls should not include additional fencing semantics***
The current proposal is deeply mired in providing fencing semantics, ensuring
operations are locally or globally visible in the synchronization calls. This
approach couples inter-thread synchronization with data transfer, making the
operations more heavyweight and simultaneously imposing MORE synchronization
semantics than their blocking counterparts. For example, a upc_memput_nb which
returns UPC_COMPLETE_HANDLE or immediately followed by a gsync currently
implies "global visibility", which is MORE than the guarantees on blocking
upc_memput and may be considerably more costly. The proposal also introduces
several concepts (eg. "half-fences" and "local visibility") which are not
defined by the current memory model, and may be tricky to formally define. It
appears to advocate a usage case where async transfers behave sufficiently
"strict-like" that after a sync the user can issue a RELAXED flag write to
synchronize other threads. This is completely backward from the UPC philosophy
and best practice, which is to use relaxed operations for data movement, and
strict operations to update flag variables and perform synchronization.
In my opinion this piggybacking of fencing semantics on the library calls
should be removed entirely. An important usage class of applications want to
perform a large number of non-blocking operations in phases separated by
user-provided barriers (3D FFT being one obvious example), and any fencing
semantics on the individual operations is undesirable and only has the
potential to degrade performance. These applications don't care at all about
local or global completion of anything before the next barrier, and don't want
to pay for the library computing it under the covers or imposing hidden fences.
The transfers performed by the library should abstractly behave as a set of
relaxed ops with respect to the memory model. There is no difference between
local and global completion, because the accesses in question are entirely
relaxed. They behave exactly as relaxed operations issued at an unspecified
time during the transfer interval. They are not affected by fences or other
strict operations issued in the transfer interval, because it is explicitly
unspecified whether they were issued "before" or "after" any such fences. The
same logic implies that conflicting accesses in the transfer interval also
return undefined results. A successful sync call indicates the relaxed
operations have all been "performed", thus ensuring any subsequent conflicting
operations issued by the calling thread see the updated values. Programs that
wish to enforce global visibility of a transfer should explicitly issue a fence
or other strict operation after the sync call.
The approach I'm describing significantly simplies the current proposal
(removing many unnecessary functions), makes the semantics easier to understand
(by removing all the fence-related goop) and at the same time removes semantics
which have the potential to reduce performance.
It also brings it more in line with the memory model and the semantics of the
existing blocking operations. I believe more high-level discussion of this
nature is prudent before accepting the current semantics, which seem
problematic in many ways.
Original comment by danbonachea
on 3 Aug 2012 at 12:29
Regarding Comment #14, I very strongly disagree on all points and most of these
issues have been considered prior to forming the consensus proposal. See
comments below.
"The transfers are relaxed operations and any required synchronization should
be explicitly added by the user as strict operations or other fences around the
transfer interval."
No. A fence or strict access MUST complete ALL prior non-blocking operations to be compatible with the existing UPC memory model. Therefore, if you want to have multiple non-blocking operations in flight and then use a fence or strict access to complete ONE of them, you end up forcing completion of ALL of them. This is not acceptable for software pipelining of large copies.
"Point 1: ***The memory model already allows optimization of "blocking"
memcopies***"
No, existing UPC implementations implement upc_mem* as blocking copies. I believe both BUPC and Cray concluded that the functions had to be blocking via different reasoning and there's a comment somewhere (not in this Issue) that explains both lines of reasoning. The goal of this proposal is to provide non-blocking copy functions.
"Point 2: ***The primary goal of the async feature is performance***
Async memcopies do not add any expressiveness to the language - ie any async
data movement operations can be already be expressed with their fully blocking
counterparts."
Given that upc_mem* functions already exist in UPC 1.2, this is not a very compelling argument. I equally could argue that the existing upc_mem* functions are unnecessary because the language already provides expressiveness in terms of relaxed-access loops that a compiler should be able to automatically convert to the equivalent of a upc_mem* call. I could use the Cray compiler as a proof-of-concept that this optimization is possible and argue that the existing upc_mem* functions should never have existed, but I think there is a benefit to programmers in having these functions to make their intention explicit instead of relying on optimization.
"Point 3: ***Async calls are an annotation that explicitly suppress the memory
model**
Programming language constructs that sit outside the memory model are dangerous and confusing. A memory model provides a way of reasoning about a program. No matter what part of my program I'm currently looking at, I have confidence that the rest of the program and any libraries that I've linked to written in that same language are following the same rules. If I write "upc_fence" then I know that nothing is in flight immediately after that statement executes -- I don't have to wonder if there's something elsewhere in the code that has a memory-model-exempt copy ongoing on which my upc_fence has no effect.
"Point 4: ***Async calls should not include additional fencing semantics***"
Ideally, no, but we need a way to manage individual non-blocking transfers. We want to keep the existing meaning of upc_fence and explain semantics relative to that as much as possible. The half-fence is essentially how the Cray implementation works now.
"The approach I'm describing significantly simplies the current proposal
(removing many unnecessary functions), makes the semantics easier to understand
(by removing all the fence-related goop) and at the same time removes semantics
which have the potential to reduce performance."
See above response regarding ignoring the memory model.
"It also brings it more in line with the memory model and the semantics of the
existing blocking operations. I believe more high-level discussion of this
nature is prudent before accepting the current semantics, which seem
problematic in many ways."
No, it does not bring it in line with the memory model because it explicitly denies being part of the memory model.
Original comment by johnson....@gmail.com
on 3 Aug 2012 at 2:53
"A fence or strict access MUST complete ALL prior non-blocking operations to be
compatible with the existing UPC memory model. "
I think you missed an important point - I'm basically arguing the definition of
"prior". I'm proposing that the accesses implied by the library are issued at
an UNSPECIFIED time between the initiation and successful sync call, so any
intervening fences need not synchronize them because it cannot be proven that
those anonymous accesses were issued "prior" to that fence - therefore no
violation of the memory model can be observed. Explicitly async semantics
already introduce the user to the concept of an asynchronous data transfer
agent, and I'm arguing that agent is issuing the abstract memory operations at
an intentionally unspecified time within the interval between init and sync.
Whatever semantics we come up with will need to be explained within the context
of the formal memory model, and the easiest way to do this is to define the
library's effects as a set of abstract operations. I propose to allow these
operations to be abstractly issued anywhere within the transfer window, whereas
you seem to be arguing they should be nailed down to all be issued at the
initiation. I believe my approach is cleaner and allows for higher performance.
I am NOT ignoring the memory model, I'm just defining the library semantics in
such a way that fences don't interfere with its operation.
"existing UPC implementations implement upc_mem* as blocking copies"
This is an implementation decision, and is not required by the currently
(looser) specification. There have been prototype UPC implementations that
perform software caching at runtime that relax this decision to provide some
overlap for "blocking" operations.
In any case, the upc_mem* operations do NOT imply any strict accesses or
fences, so the current async proposal is definitely adding additional
synchronization where none exists in the blocking version.
Original comment by danbonachea
on 3 Aug 2012 at 3:27
"I think you missed an important point - I'm basically arguing the definition
of "prior". I'm proposing that the accesses implied by the library are issued
at an UNSPECIFIED time between the initiation and successful sync call, so any
intervening fences need not synchronize them because it cannot be proven that
those anonymous accesses were issued "prior" to that fence - therefore no
violation of the memory model can be observed. Explicitly async semantics
already introduce the user to the concept of an asynchronous data transfer
agent, and I'm arguing that agent is issuing the abstract memory operations at
an intentionally unspecified time within the interval between init and sync."
No, the memory model becomes completely broken (or alternatively, the async
updates are useless) if we do this. Fences prevent backwards movement of
relaxed accesses as well, so either all threads observe the results before the
fence or all threads observe the results after. Relaxed accesses cannot be
reordered with respect to fence, so regardless of when the "access" occurs for
an async call, all threads still have to agree on it. In order to guarantee
that all threads agree, most implementations are going to have to either sync
at the fence, or delay starting the operation until after the last fence prior
to the user's sync, thus defeating the purpose of allowing an async to bypass a
fence in the first place.
"I am NOT ignoring the memory model, I'm just defining the library semantics in
such a way that fences don't interfere with its operation."
And thus ignoring the rules surrounding reordering relaxed accesses in the
presence of fences.
"There have been prototype UPC implementations that perform software caching at
runtime that relax this decision to provide some overlap for "blocking"
operations."
Remind me again what the trend is for available memory per-thread? I don't
think that this can be considered a useful solution given that we barely have
enough space to track the address ranges that are outstanding, let alone all
the data.
Original comment by sdvor...@cray.com
on 3 Aug 2012 at 5:17
"No, the memory model becomes completely broken (or alternatively, the async
updates are useless) if we do this. Fences prevent backwards movement of
relaxed accesses as well, so either all threads observe the results before the
fence or all threads observe the results after. Relaxed accesses cannot be
reordered with respect to fence, so regardless of when the "access" occurs for
an async call, all threads still have to agree on it. In order to guarantee
that all threads agree, most implementations are going to have to either sync
at the fence, or delay starting the operation until after the last fence prior
to the user's sync, thus defeating the purpose of allowing an async to bypass a
fence in the first place."
The abstract relaxed accesses which comprise the transfer need not be issued as
a group all at once - they are a set of relaxed read/write operations that can
be issued any time after the initiation call and before a successful
synchronization returns. As such any fences in the transfer interval may occur
before a subset of them have been issued. Understand I'm not suggesting this as
an IMPLEMENTATION, this is merely the formalism I propose to define the
semantics of the call within the existing formalism of the memory model, and
has the side-effect that intervening operations do not interfere with the
asynchronous transfer.
More importantly, I'm proposing the source memory is required to be constant
and the contents of the destination memory are explicitly undefined between the
init and sync, so threads are not permitted to be "peeking" before sync anyhow,
which is another reason that reordering with respect to intervening operations
cannot be observed.
Original comment by danbonachea
on 3 Aug 2012 at 5:30
I'm mostly with Troy on this one. If an async memory operation has been waited
upon already, it should be synchronized by fence - or we are in semantic hell.
In my lights just because an async memput has been waited on does not mean that
the data has been deposited remotely. It only means that the send buffer can be
reused. In this situation, if I go with Dan I will *never* know whether the
transfer has finally completed. If I go with Troy then the fence will guarantee
completion. Thus, Troy :)
I have not thought through all other possible orderings of events (e.g. put,
fence, wait).
Original comment by ga10...@gmail.com
on 3 Aug 2012 at 5:54
"If an async memory operation has been waited upon already, it should be
synchronized by fence - or we are in semantic hell. "
I don't believe there's any disagreement about that - once a sync operation has
returned successfully (ie "waited upon") and you've subsequently issued a
fence, the operation is guaranteed to be "complete" as far as all threads is
concerned. Both approaches ensure this. The argument centers around fences that
are issued BEFORE the sync, while the operation is still "in-flight" and what
they mean.
"if I go with Dan I will *never* know whether the transfer has finally
completed"
That's not the case :) In my interpretation, the transfer has "completed" with
respect to the calling thread once the sync returns successfully (ie subsequent
conflicting data accesses are preserved). The next strict operation ensures
they are complete with respect to all threads.
Original comment by danbonachea
on 3 Aug 2012 at 6:01
"existing UPC implementations implement upc_mem* as blocking copies"
As an additional counter-example, consider the important case of a system with
full hardware shared memory support, like a large SMP. The Berkeley
implementation of upc_memput/upc_memget on such a system boils down to a C99
memcpy(), which at ABI level results in a series of load/store instructions in
a loop. Other UPC implementations on such hardware probably look similar. There
are no architectural memory fences or compiler reordering fences inserted
before or after the operation, because none are dictated by the UPC memory
model for upc_memput/upc_memget. As a result, on a modern memory hierachy these
load/stores can and will be aggressively reordered with respect to surrounding,
non-conflicting memory load/stores, which may correspond to surrounding relaxed
operations that access memory with affinity to different language-level
threads. The necessary cache coherency, write buffering, conflict checking and
load/store reordering is performed entirely in hardware by all modern
processors with shared-memory support. The current memory model specification
for upc_memput/upc_memget is intentionally permissive of this implementation.
Now it's true that an async memory copy facility will probably enable the
largest performance benefit on loosely-coupled, distributed-memory hardware.
However the semantics should be designed to still allow reasonable performance
when run on the simple case of cache coherent shared memory hardware. The
semantic insertion of memory fences around the proposed async operations to
enforce the suggested completion guarantees has the potential to make async
operations significantly MORE expensive and expose LESS overlap on shared
memory hardware than the equivalent upc_mem{put,get} call which includes no
extraneous fencing semantics. This seems fundamentally broken.
Original comment by danbonachea
on 3 Aug 2012 at 11:32
You're misinterpreting "blocking" as inserting a full fence, which is a much
stronger statement. The basic problem is that two relaxed operations issued by
the same thread must be observed in program order. Therefore, the
implementation must guarantee the ordering between upc_mem* copies and relaxed
operations to those memory addresses by the calling thread. Because it is
difficult in general to prove there are no other accesses to those memory
addresses, implementations must do something to prevent incorrect orderings.
In the case of your "large SMP", the hardware takes care of it. On distributed
systems that lack such hardware support, one simple solution is to simply block
until the upc_mem* operation is globally visible before issuing any further
operations.
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 3:57
Sorry, the second sentence in comment 22 should have been "two relaxed
operations to the same memory location issued by the same thread", not just
"two relaxed operations issued by the same thread".
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 4:04
Troy said: "existing UPC implementations implement upc_mem* as blocking copies"
sdvormwa said: "the implementation must guarantee the ordering between upc_mem*
copies and relaxed operations to those memory addresses by the calling thread."
Correct - I'm very aware of the memory consistency requirements involved for
upc_mem*, having written the memory model, the relevant spec language, and
several implementations myself :).
I was responding to Troy and illustrating that there are platforms of interest
which can satisfy those requirements without any sort of "blocking" in the
implementation whatsoever. Cache coherent shared memory platforms can implement
upc_mem* as a simple set of load/stores, which the hardware is then free to
aggressively reorder in both directions with respect to surrounding
non-conflicting load/stores that correspond to other relaxed operations. The
1.2 spec for those operations is deliberately loose enough to allow this
important optimization in hardware.
One of my many criticisms of the current async proposal is that it cannot
achieve this level of overlap on the important case of cache-coherent
shared-memory systems, because the semantics require the insertion of
"half-fences" and guarantees of local/global visibility, which are
significantly more heavyweight and would inhibit the hardware-provided
reordering optimizations. As such, the simple case of upc_mem*_async();gsync()
would be expected to perform SLOWER and provide LESS overlap than the existing
analogous upc_mem* call for those systems. It makes no sense to introduce an
"async" library whose semantics for memory overlap are often MORE restrictive
than the existing synchronous counterparts.
The larger point I'm trying to make here is that in introducing an async
interface, we should be careful to specify semantics that are uniformly a
RELAXATION relative to the existing synchronous library. As performance is the
primary and solitary justification for the user to call the more complicated
interface, any semantics with the potential to hinder performance on platforms
of interest should be rejected. The sole source of this semantic relaxation is
the introduction of a "transfer interval", between the init and sync call,
where the application promises "no threads are looking at my src/dst buffers",
and therefore the implementation is free to perform the data transfer
operations inside that interval without worrying about any of the activities of
the ongoing computation. I contend this "promise" from the application should
extend all the way until the library sync call, and consequently it is
impossible for any thread to observe a violation of the memory model by
inserting fences before the library operation has been synced (because no
thread is allowed to be looking at the memory in question during that time).
Stated another way, the application makes a promise not to look at the transfer
memory until after a successful sync, and is not permitted to dynamically
"change its mind" by issuing a fence in the middle of that interval - the
assertion is made at the init call and remains in force until a successful sync.
Original comment by danbonachea
on 4 Aug 2012 at 9:21
Without the half-fence though, you have no way of implementing the very
important use case of notifying another thread that the async transfer has
completed without doing a MUCH more expensive full fence, as any relaxed
operation may be reordered before the gsync. That is the prime motivator of
our insistence on including the half-fence on the gsync, as this use case is
one of the most important to our customers. Such a half fence may seem useless
(or even too expensive) on systems where its performance is roughly in line
with that of a full fence, but there are systems where the full fence is
significantly more expensive.
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 4:47
"The sole source of this semantic relaxation is the introduction of a "transfer
interval", between the init and sync call, where the application promises "no
threads are looking at my src/dst buffers", and therefore the implementation is
free to perform the data transfer operations inside that interval without
worrying about any of the activities of the ongoing computation."
So in other words, use of the async operations tells the implementation to
ignore the memory model for a given range of memory locations until the gsync
is reached. The memory model is confusing enough already without the
additional headache of language constructs that ignore it.
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 5:09
"Without the half-fence though, you have no way of implementing the very
important use case of notifying another thread that the async transfer has
completed without doing a MUCH more expensive full fence, as any relaxed
operation may be reordered before the gsync."
I acknowledge this is an important usage case, one that we refer to as a
"signalling put", ie performing a memput and notifying the target of arrival.
We've seen this application desire in other places as well. I agree that a user
who is given ONLY the async library and no additional tools would need to
implement this using a flag write - under the Berkeley semantics, he would need
to use a strict write after sync. The proposed Cray semantics include extra
semantics intended to allow him to use a relaxed write for signalling, however
as I previously expressed, using relaxed operations for synchronization seems
highly "sketchy" and completely contrary to the UPC memory model philosophy and
officially stated "best practice". I'm not even convinced this is guaranteed to
be correct, lacking a formal proof (and a formal definition of "half-fence").
Even if it works for this very specific case, encouraging the use of relaxed
writes for synchronization as a UPC programming practice seems like a very Bad
Idea and likely to lead to nightmarish race condition bugs for users.
In any case, I would argue the correct solution to this application requirement
is NOT to saddle the async primitives with additional semantics that allow the
user to "roll his own" questionable synchronization. A far better solution to
that usage case is to introduce a DIFFERENT library function that encapsulates
exactly the semantics required - ie perform a non-blocking memput and update a
flag at the target when it arrives. Such an interface is more user-friendly,
less error-prone, and ADDITIONALLY has the potential to eliminate an entire
network round-trip for the signal write on a loosely-coupled system. We've been
talking about introducing such an interface for a long time, and Berkeley UPC
includes a working prototype implementation, described here:
http://upc.lbl.gov/publications/upc_sem.pdf
If this is truly an important usage case for your customers, then I suggest we
split that discussion into a separate issue and consider a library call to meet
that need.
Let's set aside that usage case for the moment and assume we independently
arrive at a library solution that provides an encapsulated solution for that
application requirement. With that in place, can we agree to remove these
potentially costly half-fence semantics from the proposed interface? The fact
that Cray can implement them efficiently on one platform of interest does not
justify their serious potential performance impact on other hardware.
Original comment by danbonachea
on 4 Aug 2012 at 5:20
"I'm not even convinced this is guaranteed to be correct, lacking a formal
proof (and a formal definition of "half-fence")."
We already have an implicit partial definition, as it is what prevents updates
to the same memory location from being reordered. A gsync simply becomes a
single relaxed operation that covers all memory locations.
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 5:30
"The fact that Cray can implement them efficiently on one platform of interest
does not justify their serious potential performance impact on other hardware."
The existing memory model already imposes serious performance impacts on large
distributed memory systems (including at least Cray and IBM systems), as they
have to jump through enormous hoops to prevent the reordering of operations to
the same memory location. These large distributed memory systems are also
those that benefit the most from async operations, as the remote memory latency
is so large, and the bandwidth relatively low compared to local accesses. Are
you seriously saying that we should constrain those systems even more because
of concerns with the relatively small impact on smp systems that don't have
much to gain from using the asyncs in the first place?
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 5:49
"So in other words, use of the async operations tells the implementation to
ignore the memory model for a given range of memory locations until the gsync
is reached. The memory model is confusing enough already without the
additional headache of language constructs that ignore it."
I'm sorry but you're still not getting it. I'm not proposing to ignore the
memory model. I think what's lacking is an understanding of what the memory
model actually guarantees - I highly recommend you go re-read the formal
semantics in Appendix B (the actual model, not the Cliff notes in 5.1).
The memory model is NOT an operational description of a virtual machine, nor
does it prescribe the contents of memory, even in the abstract. It is sometimes
convenient to think about and discuss it in an operational sense, but that is
NOT the basis of the formalism, and ultimately that mode of reasoning may be
misleading and diverge from the true guarantees.
The memory model is defined entirely in terms of relaxed and strict reads and
write operations, and for a given execution trace of a VALID UPC program it
determines whether the execution was "UPC Consistent", in that one can
construct the appropriate partial orders <_t and total order <_strict that
satisfy the guarantees it provides. I'm not going to paste in the entire
formalism here - it's all in appendix B. However, a VERY important and
deliberate property of the model is that it does not make any guarantees about
the possible results of operations that did not occur. Stated another way, if
the execution trace did not directly "observe" a violation of the model, then
the execution is consistent, regardless of what tricks the implementation may
be playing under the covers (whose effects were not observed by any application
read operations).
The Berkeley semantics for async are that it is ERRONEOUS for any application
thread to modify the source buffer or in any way access the destination buffers
during the transfer interval, between the library initiation call and the
successful return of library sync call. A program that "cheats" and touches
these buffers in the forbidden interval is an INVALID UPC program, and the
memory model does not provide ANY guarantees whatsoever for an invalid program.
Valid UPC programs Shall Not touch those buffers within the transfer interval,
and this property makes it IMPOSSIBLE for them to observe exactly how those
buffers were accessed by the library, and how those accesses may or may not
have been affected by other non-related fences or synchronization constructs.
Because all executions of valid programs are prohibited from observing any
violations, by definition the memory model is preserved and the executions are
"UPC Consistent". This is the essence of how the memory model works - if VALID
programs cannot tell the difference, then the model is not violated.
Original comment by danbonachea
on 4 Aug 2012 at 5:50
I understand the memory model argument completely. My qualm is with the
ERRONEOUS part, as I think it is both confusing to programmers and difficult to
detect. That combination will lead to programming mistakes that are extremely
hard to debug. Simply saying "this program is invalid, so the behavior is
undefined" is a nice cop-out for the language designer, but it's not so nice
for the programmers.
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 6:03
" My qualm is with the ERRONEOUS part, as I think it is both confusing to
programmers and difficult to detect. That combination will lead to programming
mistakes that are extremely hard to debug."
I'm sorry but that's the very semantic basis of what's involved in any
explicitly asynchronous transfer library. This is precisely the additional
complexity that sophisticated users are accepting when they choose to use any
async library. If users cannot figure out how to leave the buffers untouched
during the transfer interval, then they have no business using an asynchronous
library.
Are you seriously arguing that we should care about the observed behavior of
ERRONEOUS programs? I can easily devise many erroneous programs that lead to
very bizzarre and inexplicable behaviors on any system of your choice, without
even touching the UPC libraries. Our task as specification writers is to
clearly define the contract between the user (who writes programs which the
spec judges to be VALID) and the implementation (which generates executions
with the guaranteed behavior for those valid problems).
Original comment by danbonachea
on 4 Aug 2012 at 6:11
"Are you seriously arguing that we should care about the observed behavior of
ERRONEOUS programs?"
No, I'm arguing that async transfers should be defined as relaxed operations,
and not excuse them from the rules regarding the ordering of relaxed operations
in the presence of a fence. Then we don't need to bother with your proposed
cop-out in the first place. Does this argument mean that some systems won't
benefit as much from the asyncs? Yes. But those same systems get more benefit
from the existing routines, so it balances out nicely.
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 6:22
I don't think it's productive to continue this discussion in the present mode.
It really feels like this is devolving into a textual shouting match, which is
not a useful form of idea-sharing, collaboration or consensus building. I
believe both sides have stated their positions, but the discussion has drifted
from impartial analysis of the core issues to "I like my way, I hate your way,
lets see how I can make the opposite side look ridiculous".
This is obviously a highly contentious issue, involving significant semantic
subtlety and non-trivial implications for existing implementations. I believe
some impartial moderation is called for, and some face-to-face (or at least
voice-to-voice) interaction. I believe one of the action items from Friday's
telecon was to setup a telecon devoted to this issue amongst interested parties.
Can we try that as a next step for making progress on some of these issues?
Original comment by danbonachea
on 4 Aug 2012 at 6:25
That's probably a good idea.
Original comment by sdvor...@cray.com
on 4 Aug 2012 at 6:28
Taking a step back and looking at the currently archived discussions, it seems
to me that at the core of the disagreement is that the Cray and Berkeley
proposals are trying to meet different needs. The analogy that comes to mind
is automotive: Berkeley has designed a "manual transmission" API and Cray has
designed "automatic transmission". Each has its place in the world, but we are
now trying to pick exactly one to go into the UPC spec.
I have always believed that the Berkeley design is correct for the goals it is
meant to address, and I suspect that if I understood the background of Cray
proposal better I would also find it equally well suited to its design goals.
So, I'd like to suggest that on the conference call we might start by trying to
discuss/understand the GOALS rather than the various issues regarding the
designs that have evolved to meet those goals.
Since the Berkeley semaphore/signaling-put extension (see
http://upc.lbl.gov/publications/upc_sem.pdf)* which Dan recently mentioned is
(I believe) intended to address synchronization goals vaguely similar to Cray's
async memcpy proposal, it may be helpful to at least skim that document before
the call.
* NOTE: upc.lbl.gov is down right now, but expected to come back online about
noon Monday, Pacific time.
In the meantime I've made the proposal available at
https://upc-bugs.lbl.gov/~phargrov/upc_sem.pdf
Original comment by phhargr...@lbl.gov
on 5 Aug 2012 at 8:19
"I don't think it's productive to continue this discussion in the present mode.
It really feels like this is devolving..."
No, you started a useful discussion. If it has felt as if it is devolving, please be aware that this discussion comes at a relatively late phase, after a subcommittee was formed to develop a consensus proposal. I'm not sure why you weren't involved on the BUPC side of things. I agree that we need to discuss this issue at length on the recently scheduled telecon, but I don't think that precludes discussion online because the telecon isn't for two more weeks.
"the core of the disagreement is that the Cray and Berkeley proposals are
trying to meet different needs. The analogy that comes to mind is automotive:
Berkeley has designed a "manual transmission" API and Cray has designed
"automatic transmission". Each has its place in the world, but we are now
trying to pick exactly one to go into the UPC spec."
When working on the consensus proposal, I viewed the difference as BUPC and Cray starting from different origins, but both wanting a solution that is consistent with the memory model and useful to users. I saw BUPC as starting with the memory model, fitting in _async extensions, and then attempting to make them useful to users by exempting the extensions from normal fence semantics. I saw Cray as starting with a description of what the users wanted to do, writing _nb extensions to let them do it, then making them fit into the memory model without losing their utility by introducing the half-fence concept.
While we're talking philosophy here, I think it's very important in this discussion that we not lose sight of the UPC spec as being the primary mechanism whereby users can find out how the language -- and presumably their compiler -- works. UPC isn't like C or C++ where users can find zillions of books and online resources to help them out. We should try to minimize putting things in the spec where the spec says one thing but 99% of implementations will do something that is apparently completely different but compliant. For example, the reason that we're even discussing this problem is that the existing upc_mem* functions are blocking on many implementations. The spec doesn't make them blocking, and any users reading the spec will see that they just wrap up a bunch of relaxed accesses into a convenient function call, but the functions generally are blocking and performance-conscious users must think of them that way. To continue that example, I believe that users will view a non-blocking call as initiating the copy before it returns because most implementations will do that. If the spec does not require that behavior, then we're again in the same confusing situation where there are basically two standards: (1) the UPC spec, and (2) how most UPC implementations work.
Original comment by johnson....@gmail.com
on 6 Aug 2012 at 4:45
After a chat about the NB memcpy proposals with one my users, I
thought I should pass one thing he said:
> I do strongly feel that the semantics for nonblocking reads/writes
> should be the same as for the nonblocking collectives (if and when
> they get implemented). So any discussion of this should take that
> into account, even though the collectives are in a different
> proposal. (I don't see really needing the extra flexibility of
> the Berkeley proposal for reads and writes, but I'm less sure
> about collectives.)
Original comment by nspark.w...@gmail.com
on 6 Aug 2012 at 9:49
"While we're talking philosophy here, I think it's very important in this
discussion that we not lose sight of the UPC spec as being the primary
mechanism whereby users can find out how the language -- and presumably their
compiler -- works."
Philosophically, I strongly disagree that the spec should be geared as a
training tool for users, or as a substitute for vendor-provided documentation
of implementation-specific behaviors. Behavioral descriptions of particular
implementations or even expected implementations have no place in a formal
language spec. The specification is a contract between all users and all
implementations, and historically the UPC spec always strived to specify
necessary and sufficient semantics - ie the minimally necessary restrictions on
the implementation to provide sufficient functionality for the user. As the
spec gains implementation restrictions and operational codification of
behavior, you reduce the space of legal implementations and optimizations,
potentially leading to performance degradation. Programming languages have a
much longer life cycle than hardware systems, so as language writers we must be
sensitive not only to current implementation stategies and platforms, but must
also do our best to allow for improvement via future strategies and hardware.
It's difficult to accurately predict where hardware will be in 5 or 10 years,
but minimizing spec requirements to necessary and sufficient conditions gives
us the most "wiggle room" to accomodate a changing hardware landscape in the
future.
"the reason that we're even discussing this problem is that the existing
upc_mem* functions are blocking on many implementations. The spec doesn't make
them blocking, and any users reading the spec will see that they just wrap up a
bunch of relaxed accesses into a convenient function call, but the functions
generally are blocking and performance-conscious users must think of them that
way. "
To address your specific point about existing upc_mem* behavior, there is a
very important semantic difference between "blocking" (ie synchronous) and
strict (ie surrounded by fences that prevent access movement). These may happen
to have similar performance characteristics under naive translation on a
current distributed system, but are quite different on current systems with
hardware shared memory support. One could imagine future systems with better
hardware support for UPC where the difference could be even more significant.
The difference is also quite important as far as the compiler is concerned -
the relaxed semantics of upc_mem* allows for a good optimizer and/or a smart
runtime system to intelligently reorganize and schedule the data transfer,
using only serial/local data analysis. The appearance of any fences severely
limits what an optimizer can do, because full parallel analysis with complete
program information is usually required for provably correct transformations
around fences. The fact that some implementations make no effort to exploit
this semantic does not mean that the spec should be written to preclude such
optimizations, which is why upc_mem* has the semantic specification that it
does.
" To continue that example, I believe that users will view a non-blocking call
as initiating the copy before it returns because most implementations will do
that. If the spec does not require that behavior, then we're again in the same
confusing situation where there are basically two standards: (1) the UPC spec,
and (2) how most UPC implementations work."
I see no compelling reason to require implementations to issue all accesses
before returning from initiation, even in software. I can easily imagine
implementations that could improve throughput under high load by delaying
initiation based on network status. At the hardware level, we WANT the
operations to be "in-flight" for as much of the transfer interval as required
(that's the entire point of using an async library), and the asynchronous agent
(eg the RDMA engine) should have the freedom to initiate accesses that perform
the transfer when appropriate based on network resource status.
Original comment by danbonachea
on 6 Aug 2012 at 10:54
Cray's proposal is trying to solve the problem that the same-address
restriction prevents the compiler/run-time library from making the existing
upc_mem* routines non-blocking on machines where the hardware provides multiple
paths to (remote) memory, and thus must rely on software to make ordering
guarantees. Most high-performance scalable networks (including both Cray's and
IBM's current offerings) are designed in this way, as it provides greater
bandwidth and improved resilience against hardware failures. Looking ~10 years
out, we don't see this situation changing significantly, as most networks are
moving more and more in this direction. We therefore don't believe it is
reasonable to expect hardware support on large distributed memory systems for
the foreseeable future.
To enforce the ordering in software, an implementation must track operations
that are "in-flight" and resolve conflicts in some way. One proposed approach
to this is software caching of relaxed accesses. However, we do not believe
this is a viable approach (in the context of this discussion) for large systems
for the same reason it's not done it hardware: lack of memory. The size of
your cache determines the upper limit on the amount of data you can have
in-flight. Non-blocking operations are most useful when you have a lot of data
to move, and the cache must be relatively small so there's still enough room
for user data. It is also complex to implement and can easily hurt performance
more than it helps without per-application tuning.
Another approach is to track which shared memory locations have operations that
are in-flight, and insert syncs of some kind when a conflict is detected.
There's still a memory problem, but instead of large contiguous transfers being
the problem, smaller "random-access" transfers kill the scalability of this
approach, as the implementation can't efficiently store lots of "random"
scattered memory addresses, and must therefore rely on much more coarse
tracking. I believe this is what IBM claimed to be doing (with a bit-vector
permitting a single "in-flight" operation per remote thread/node if I recall
correctly?) on one of the earlier phone conferences. Cray does something
similar to this, with the caveat that upc_mem* routines are always synced
before returning for various reasons. However, there is a noticeable overhead
to this tracking, particularly on some important (to our customers) access
patterns.
Other approaches either can't handle relatively common corner cases (static
compiler analysis) or don't take advantage of available hardware offload
mechanisms and have other scalability issues (active messages/RPC). We
therefore need some help from the user to get around this.
The "half-fence" that we proposed on the global sync formally provides acquire
semantics on relaxed accesses. This is necessary to permit pairwise
synchronization with a remote thread via relaxed operations to notify that
thread that the non-blocking operation is complete. It is important that this
be done with relaxed operations, as using strict operations would unnecessarily
sync other non-blocking operations (which may include much more than simply the
user's explicit use of the proposed routines!). If another method of providing
this functionality is made available, either via a new type of fence
(upc_fence_acquire/upc_fence_release?) or Berkeley's semaphore proposal (which
I haven't read yet), then I don't think we'd have a problem dropping this part
of our proposal.
In terms of spec changes, I believe our proposal is much more conservative than
Berkeley's. Importantly, the new restrictions on accessing memory locations
involved in a call to one of the proposed routines apply ONLY to the calling
thread in our proposal. As far as all the other threads are concerned, the
proposed routines behave just like the existing upc_mem* routines, and thus no
changes to the memory model are required--minus the "half-fence", which I think
Dan has convinced me could be better provided in a different manner. The
proposed routines are simply another way to perform relaxed shared memory
accesses, with the benefit/caveat that the same-address restriction is lifted
between the initiation of the operation and the sync. We believe this behavior
is sufficient to provide the amount of communication/computation overlap users
desire without adding significant additional complexity to the memory model.
We DO NOT believe permitting non-blocking operations to continue beyond a fence
provides any useful additional functionality (perhaps you could provide an
example where this is necessary?). We DO believe that allowing it will confuse
users who expect upc_fence (or worse, a UPC barrier!) to be a full memory
barrier. Additionally, it is a non-trivial task for the implementation to
detect and warn users when they've (hopefully accidentally) written "illegal"
code that accesses memory locations involved in a call to one of the proposed
routines on a thread other than the calling thread before the sync, and will
therefore be hard-pressed to aid the user in debugging the problems that this
will cause. We previously proposed adding a class of "super-relaxed"
operations, which were relaxed operations that didn't have the same-address
restriction. It was rejected because of concerns it'd be too confusing to
users, and added too much complexity to the memory model. I can't imagine this
is any less confusing, given that the legality of a users code won't be
immediately obvious nor easily provable in all cases.
"Taking a step back and looking at the currently archived discussions, it seems
to me that at the core of the disagreement is that the Cray and Berkeley
proposals are trying to meet different needs. The analogy that comes to mind
is automotive: Berkeley has designed a "manual transmission" API and Cray has
designed "automatic transmission". Each has its place in the world, but we are
now trying to pick exactly one to go into the UPC spec."
I think this is exactly the case, though I don't quite understand your
automatic versus manual transmission analogy. To my mind, a better analogy
would be traffic at a street light. Cray proposed a system that allows the
user to say "trust me, I'll make it through before it turns red" to allow
vehicles to continue when the light turns yellow, but doesn't allow anyone
through a red light. Berkeley proposed letting some vehicles go right through
a red light, and denying insurance claims if an accident occurs due to the
"illegal driving" of a vehicle with a green light hitting them.
Original comment by sdvor...@cray.com
on 7 Aug 2012 at 2:13
Because our common goal is to develop a consensus proposal, may I propose the
following: let's discuss the disagreement points one by one instead of
referring to the whole proposal. I think there are good points on both sides
so why not combine and agree on the best.
Here is my attempt to summarize the current disagreements:
1) Should upc_fence (strict memory ops in general) guarantee the completion of
outstanding non-blocking memory operations?
A subcommittee of 5 people (including myself) had agreed to "Yes".
But since there are some different opinions now, let's revisit this issue.
2) Should the "sync" calls have fence/half-fence semantics?
3) Should there be both local and global sync functions?
4) Function naming (minor)
Please add and/or change the discussion points if you have any others. I hope
the list of disagreements will converge to zero as our discussion goes along.
Original comment by yzh...@lbl.gov
on 7 Aug 2012 at 4:27
[deleted comment]
"let's discuss the disagreement points one by one instead of referring to the
whole proposal. I think there are good points on both sides so why not combine
and agree on the best."
Yili- You've summarized the low-level technical differences between the two
approaches, but I don't think that's the correct level of discussion at this
time. I think what these discussions have revealed is the reason the two
proposals differ in the details is because they were designed with a different
set of high-level goals and to satisfy a different set of user needs. The
technical details mostly follow logically from those differing goals. We cannot
arrive at a consistent and well-designed interface by resolving technical
points in a vacuum, without first straightening out the high-level goals of the
interface.
sdvormwa@cray.com:
"The proposed routines are simply another way to perform relaxed shared memory
accesses, with the benefit/caveat that the same-address restriction is lifted
between the initiation of the operation and the sync. We believe this behavior
is sufficient to provide the amount of communication/computation overlap users
desire without adding significant additional complexity to the memory model."
I think Paul is correct that we need a high-level discussion about goals of the
interface. Alleviating the same-address restriction is nice, but is NOT the
major goal the Berkeley proposal was trying to accomplish. Conflicting writes
to the same address from a single thread with no intervening fence is not a
pattern we expect in well-tuned applications, because it represents a neglected
opportunity for communication coalescing. That being said, it may occasionally
happen and still needs to be handled correctly, but it's not the case we're
most interested in tuning for. Neither are we designing the async memcpy
library to specifically serve as a "signalling put" - this is an important
usage case that we feel deserves its own separate library interface and should
not be conflated with pure asynchronous data movement.
Our goal with the Berkeley async transfer library was to enable far more
aggressive overlap of communication with unrelated computation and other
communication. We are trying to overlap the entire cost of a communication, and
allow it to asynchronously continue undisturbed without interference from
unrelated operations. The boundaries of the asynchronicity are defined by the
init and sync library calls (as part of the "contract" between the app and
library), not by random fences that may happen to occur in the unrelated code.
The need we are trying to meet is the user explicitly asserts "perform this
transfer in the background, and I will explicitly call you again when I need to
ensure it has completed" - this is a familiar paradigm in other parallel
libraries. I think it would be more surprising to the user who has invoked the
async library that when he calls an unrelated application module written in
UPC, suddendly the async transfers for his module are no longer achieving
overlap; because the callee module uses a fence somewhere to synchronize some
completely unrelated data.
Original comment by danbonachea
on 7 Aug 2012 at 6:21
"Conflicting writes to the same address from a single thread with no
intervening fence is not a pattern we expect in well-tuned applications,
because it represents a neglected opportunity for communication coalescing."
That is not the problem though. The issue is that unless the implementation
can PROVE there are no conflicting writes, it must conservatively assume there
are, which impacts just about all codes. Good compiler analysis can help in
some cases, but there are important cases that it can't help with, usually due
to other language design decisions--separate compilation probably being the
most obvious. Runtime caching / tracking / coalescing can all help sometimes
as well, but the memory overhead limits their usefulness, and they tend to not
scale well beyond a certain number of threads.
Original comment by sdvor...@cray.com
on 7 Aug 2012 at 8:07
I'm not sure if there is any substantial difference in the high-level goals of
this extension -- skipping the adjectives, isn't the high-level goal the same
on both sides: enable communication/computation and communication/communication
overlaps?
(Note: I would like to save the discussion about half-fence-at-sync in a
different post.)
Actually, for many common cases where no fence is used between nb init and nb
sync, both the original Berkeley and Cray proposals behave similarly, if not
the same. The main disagreement is on how to handle the special case when a
fence is used between an init and the corresponding sync.
danbonachea:
"Alleviating the same-address restriction is nice, but is NOT the major goal
the Berkeley proposal was trying to accomplish. Conflicting writes to the same
address from a single thread with no intervening fence is not a pattern we
expect in well-tuned applications, because it represents a neglected
opportunity for communication coalescing. That being said, it may occasionally
happen and still needs to be handled correctly, but it's not the case we're
most interested in tuning for. "
I think "alleviating the same-address restriction" is NOT a goal but a
Mechanism to achieve the goal of overlapping. Because of the same-address
restriction, the UPC compiler/runtime cannot perform reordering optimization
for 99% of common cases where there are actually no same-address accesses but
the compiler/runtime just cannot prove its absence. Another way to view the nb
memcpy functions is that they provide a library approach for users to express
"super relaxed" data accesses.
I like Steve's analogy of "allowing outstanding non-blocking memory operations
to pass a fence is like allowing cars to pass a red light". While there could
be special situations to justify such violations, I generally prefer to obey
the traffic laws.
Original comment by yzh...@lbl.gov
on 7 Aug 2012 at 8:26
"The issue is that unless the implementation can PROVE there are no conflicting
writes, it must conservatively assume there are, which impacts just about all
codes."
I completely agree - this is ONE of the main motivations for an explicitly
asynchronously library. My point is that it's not the ONLY reason for using
such a library and not the sole design goal, as your text I quoted in comment
#43 seems to indicate. Specifically, it is not "sufficient" for the library to
provide a tool to suppress the "same-address" restriction, we also want the
semantics to enable full overlap of the communication with other, fully-general
and unrelated activity (which the user asserts does not touch the transfer
buffers).
Original comment by danbonachea
on 7 Aug 2012 at 8:38
" isn't the high-level goal the same on both sides: enable
communication/computation and communication/communication overlaps?"
Both sides probably agree to that broad statement, but we need a more detailed
and concrete description of the types of usage cases we wish to support, and
how the library fits into those cases.
"I like Steve's analogy of "allowing outstanding non-blocking memory operations
to pass a fence is like allowing cars to pass a red light". While there could
be special situations to justify such violations, I generally prefer to obey
the traffic laws. "
I don't think we should be debating formal semantics by analogy.
However since people seem seduced by the analogy, I think Steve's
characterization is flawed. I think the Berkeley semantics are better described
as an overhead expressway - it bypasses all the city traffic below and is
unaffected by city traffic lights, because the laws of the road guarantee the
cars below cannot even SEE the highway traffic, let alone interact with it. The
on-ramps and off-ramps are clearly defined by the library calls which define
where cars enter and exit the normal flow of relaxed operations on the city
streets, but while they're "in-flight" on the expressway they operate
completely independently of everything else.
Original comment by danbonachea
on 7 Aug 2012 at 9:17
"Conflicting writes to the same address from a single thread...not the case
we're most interested in tuning for"
Same for us. It's a rare case that has unfortunate performance consequences for the common case in at least two vendor implementations. We don't optimize for it happening; we try to deal with it in a way that minimizes the impact that its very existence has on the common case.
"Neither are we designing the async memcpy library to specifically serve as a
signalling put"
Cray calls this a put-with-notify and we're interested in that functionality becoming part of UPC. If it is separate from the _async/_nb functions, then so be it, but it does mean introducing more library functions than if _async/_nb could be used instead.
"The boundaries of the asynchronicity are defined by the init and sync library
calls...not by random fences"
"Because all executions of valid programs are prohibited from observing any
violations, by definition the memory model is preserved and the executions are
"UPC Consistent". This is the essence of how the memory model works - if VALID
programs cannot tell the difference, then the model is not violated." [Comment
#30]
Let me paraphrase that to make sure that I've got it and then come at this from a slightly different manner than I have before. I still have my previous objections about the async fence behavior, but I want to look at upc_barrier because I think users will find that more surprising...
The BUPC async proposal adds something to UPC that violates the memory model and then hides the fact that the memory model is being violated by declaring that otherwise legal programs that could observe the violation are now illegal. For example, normally it is legal for two threads to modify the same data from opposite sides of a barrier and I could use this legal behavior to detect the async memory model violation, but instead it is declared that if there is an unsynchronized async to this data, then my program is illegal; i.e., even if I can run my program and demonstrate the memory model violation, the evidence is inadmissible.
I don't think that this approach is valid for extending UPC (at least not in the backwards compatible manner that we want for UPC 1.3) because it could break the intent of existing code by removing the only mechanism that the programmer has to ensure that there is no ongoing communication: upc_barrier. If I have a collective function in an existing library, I may have used a upc_barrier upon entry and exit to ensure that I can do what I want with any memory in between. Currently this is a foolproof way of guarding against what comes before and after my collective library function and the only burden on my client is to call the function in a collective context. With asyncs added, my barriers no longer offer complete protection and the burden shifts to the library client to ensure that any asyncs touching the data do not cross a call to this function somewhere in their call graph.
I can see an argument that the library code is still legal and client code just needs to be more careful with the new language feature, but I don't think it's a very nice thing to do to people in a 1.2 -> 1.3 change because it essentially changes contracts of existing functions. The contract here changing from "I promise to call this function in a collective context" to "I promise to call this function in a collective context and further promise to not be asynchronously touching any memory that the function may touch." This change is particularly awkward if the the client doesn't have complete knowledge of all memory locations that the function may touch.
Original comment by johnson....@gmail.com
on 8 Aug 2012 at 4:42
I have four major concerns with allowing the routines to continue past fences.
The first two are philosophical, while the final two are potential future
problems I see as an implementer.
1. Allowing it adds restrictions on threads other than the calling thread.
This is counter-intuitive, at least to me, as the one-sided model implies
threads are independent outside of explicit inter-thread synchronization. If
the routines are synced by fences, other threads are not impacted by a thread's
use of these routines at all.
2. The existing memory model is difficult to understand, but complete. With
this change, the memory model is no longer complete, as we've introduced a
relaxed access with special rules that aren't reflected by the memory model.
We can (and did) go back and forth all day about whether or not this breaks the
memory model, but it certainly complicates the task of trying to understand it.
3. Violations of the access rules are relatively easy to detect on the calling
thread, either through static analysis or fairly cheap run-time checks.
Detecting violations on other threads is a much more difficult problem, as
every thread must be aware of every other thread's non-blocking operations.
This will make debugging extremely difficult.
4. I think this will eventually create a de facto memory model for the
"illegal" codes, which like it or not, users will end up writing. They'll find
that the undefined results are acceptable on one implementation, and then other
implementations will have to provide the same behavior for compatibility when
the users port their code. Since this could have very significant performance
(not to mention implementation design) implications, I'd much prefer to hammer
this out ahead of time rather than be stuck with a de facto definition that
hamstrings us later.
Additionally, I still don't see a motivating need for allowing these to pass
fences. While Dan's vague "what-if" scenario could indeed cause problems, I'm
having trouble coming up with a specific situation that it would apply to
(ignoring signalling puts/gets, which we've agreed to handle separately).
Could someone give a more concrete example where this functionality would be
required? Without some way of addressing the concerns I listed above, I don't
think we should be adding this to the spec unless we have a specific use-case
in mind--one that can't be done any other way. Undefined behavior should be a
last resort for specification writers, particularly when the trigger is so hard
to detect.
"I don't think we should be debating formal semantics by analogy."
Agreed. I just put it in there to lighten up the conversation after I didn't
understand Paul's analogy. That said, yours was pretty good, though highways
generally have actual physical barriers preventing city traffic from
interacting with its own traffic.
Original comment by sdvor...@cray.com
on 8 Aug 2012 at 4:51
I have been asked to contribute an opinion here. It is a long thread, and a
passionate. Of the several possibilities discussed, I extracted two that seemed
reasonable.
1) asynchronous memory operations have local completion semantics (i.e. waiting
for an async memput only guarantees that the send buffer is reusable after a
put). Fences operate on asynchronous operations just like on "normal" blocking
ones.
2) asynchronous memory operations have global completion semantics (i.e.
waiting for an async memput guarantees that it is now remotely complete as
well). Fences do not operate on asynchronous operations - indeed, there is no
point since just waiting already does the job.
There were others half-mentioned (or maybe I misunderstood the heated dialogue)
- like remote memory ops that don't fence at all - that is, we *never* know
whether they have ever remote completed. I will not consider such scenarios.
I prefer (1) over (2) (which puts me in cahoots w/ the Cray guys rather than
Dan, I think). Here is why: because the (1) semantics is unsurprising. It is in
line with what I already know about UPC - that relaxed writes have local
completion semantics - I only know that send buffers can be reused when the
write returns. (1) is *also* in line with MPI and shmem, to the best of my
understanding - this may not be an argument for you, but sure is for me.
I'm not sure what you will say about processor ordering of asynchronous puts to
the same remote address. I would love it if you could make yourself say that
the *start* of the operation is what determines order - not when you wait for
completion. This, again, would be unsurprising. It can be implemented with some
effort - I will claim that the effort is the same as we are already making to
order blocking puts on an unordered network.
You spent a lot of time talking about fences and their interaction with
half-finished asynchronous operations. This seems like a red herring to me - if
you are a crazy enough programmer to use non-blocking operations - need I
elaborate on the perils of non-blocking remote operations? - well, in that case
making sure that there are no pesky strict accesses, barriers and so forth
between the op start and the wait should be child's play.
If you end up going for (2), it's still kind of OK ... it's different, but
still has a kind of internal consistency. Fences would simply ignore
non-blocking operations. You would order remote puts w.r.t each other based on
when you wait for them - not when you start them. You could order remote puts
w.r.t. normal blocking puts by employing strategic fences (although you'd be
kissing goodbye to performance if you did that). It's serviceable ... but
personally I don't really like it; it's a much larger change relative to what
UPC users are used to in terms of ordering and fences.
My $0.02 in 1966 issue pennies ... if you have to flame me, do it gently.
Original comment by ga10...@gmail.com
on 10 Aug 2012 at 2:44
Original issue reported on code.google.com by
yzh...@lbl.gov
on 22 May 2012 at 11:41