Closed GoogleCodeExporter closed 9 years ago
I'm glad to hear we're making some progress on teasing out the "signalling-put"
usage scenario, and perhaps agreeing that it deserves a separate, dedicated
library call, to avoid conflating it with the desire for pure asynchronous data
movement free from inter-thread synchronization. If we at least have consensus
on that goal, then perhaps we can make progress on both extensions
independently.
However, I'm very dismayed the Cray team is insisting on continuing to claim
the Berkeley semantics somehow "violate" or "break" the memory model. This
claim is totally and completely FALSE. Repetition of this unproven and patently
false claim only demonstrates a lack of understanding of the Berkeley
semantics, the memory model, or perhaps both, and is not conducive to
productive discussion. I sincerely HOPE there is an honest lack of
understanding here, and that this is not some kind of underhanded negotiation
tactic attempting to discredit an opposing view with disparaging emotional
claims. In any case I hope we can drop this false preconception and
unsubstantiated mode of argument and have an open-minded discussion of the
factual issues in precise technical terms, devoid of sloppy generalizations.
This will be my last textual attempt to try and describe how the Berkeley
semantics fit into the memory model. If someone is still not "getting it" then
we should table the memory model discussion until the call. I have limited
internet access for the next week until the call anyhow.
The UPC memory model is not some touchy-feely generalization or hand-wavy set
of properties, nor is it anything like a virtual machine abstraction that one
might sketch on the standard UPC shared memory diagram. It is a formal
mathematical model that defines consistent execution traces of shared
read/write accesses in valid programs. The conditions for validity are very
explicitly and mathematically defined, and the guaranteed properties of strict
accesses (and by extension fences) are corollaries that follow from the formal
definition. The memory model is not automatically "violated" simply because it
exhibits a property that someone may find surprising or counter-intuitive, nor
because it may allow behaviors that have no analogue on one's favorite
architecture under naive translation. This is not a weakness in the definition,
it reflects an inherent complexity of relaxed memory models, which are by
nature subtle and often surprising or counter-intuitive.
Now all that being said, we all obviously strive to present users with language
and library tools that neatly wrap up useful functionality with an interface
that is as intuitive and unsurprising as possible, while still efficiently
providing the desired functionality. However programmers that insist upon
writing invalid or perverse programs will end up with undefined or unintuitive
behaviors, respectively. We should not cripple the performance or utility of
our expected target usage scenarios in order to better accomodate these
outliers. Parallel programming is hard, and C is a low-level language -
programmers need considerable sophistication to use UPC correctly and
effectively, and that goes double for "advanced" features like explicitly
asynchronous data movement within a relaxed memory model. Debugging tools and
automated error detection are valuable tools, but ultimately we are dealing
with a non-type-safe HPC language with numerous "pointy edges", and the
feasability of automated correctness checking should not take precedence over
functionality and performance concerns in evaluating new language features.
A VERY important and deliberate property of the memory model is that it does
not make any guarantees about the possible results of operations that did not
occur. Stated another way, if the execution trace did not directly "observe" a
violation of the model, then the execution is consistent, regardless of what
tricks the implementation may be playing under the covers (whose effects were
not observed by any application read operations).
The Berkeley semantics for async are that it leads to UNDEFINED BEHAVIOR for
any application thread to modify the source buffer or in any way access the
destination buffers during the transfer interval, between the library
initiation call and the successful return of library sync call. A program that
"cheats" and touches these buffers in the forbidden interval is an INVALID UPC
program, and the memory model does not provide ANY guarantees whatsoever for an
invalid program. Valid UPC programs Shall Not touch those buffers within the
transfer interval, and this property makes it IMPOSSIBLE for them to observe
exactly how those buffers were accessed by the library, and how those accesses
may or may not have been affected by other non-related fences or
synchronization constructs. Because all executions of valid programs are
prohibited from observing any violations, by definition the memory model is
preserved and the executions are "UPC Consistent". This is the essence of how
the memory model works - if VALID programs cannot tell the difference, then the
model is not violated.
Troy wrote:
"I don't think that this approach is valid for extending UPC (at least not in
the backwards compatible manner that we want for UPC 1.3) because it could
break the intent of existing code by removing the only mechanism that the
programmer has to ensure that there is no ongoing communication: upc_barrier.
If I have a collective function in an existing library, I may have used a
upc_barrier upon entry and exit to ensure that I can do what I want with any
memory in between. Currently this is a foolproof way of guarding against what
comes before and after my collective library function and the only burden on my
client is to call the function in a collective context. With asyncs added, my
barriers no longer offer complete protection and the burden shifts to the
library client to ensure that any asyncs touching the data do not cross a call
to this function somewhere in their call graph."
I agree the property you mention is CURRENTLY true of pure UPC 1.2 programs as
a consequence of combining the memory model with the highly restricted set of
communication operations available in 1.2. It's not a direct property of the
memory model, it just happens to be true based on what you can currently
express in the 1.2 language. The property is notably NOT preserved once you
link in anything outside UPC, including MPI operations and even asynchronous
file I/O, both of which are common-place in enterprise applications. It's
debatable whether that property was originally intentional or even if its a
desirable high priority moving forward. In any case the fact that admitting the
Berkeley semantics into the 1.3 toolbox can remove this guarantee does not
constitute a violation of the memory model, or any officially stated corollary
thereof. It's not a backwards compatibility issue because legacy programs don't
use the new features and therefore remain unaffected. Programmers who add calls
to the async library are burdened with ensuring the buffers remain untouched
throughout the transfer interval - this is precisely the additional complexity
that sophisticated users are accepting when they choose to use any async
library. If users cannot figure out how to leave the buffers untouched during
the transfer interval, then they have no business using an asynchronous
library.
However regarding this point, I think it's worth separating discussion of the
property no-asyncs-crossing-barriers (mentioned above) from
no-asyncs-crossing-fences (which includes barriers as a special case, but MANY
more operations as well). We could conceivably decide to require the init and
sync call to appear in the same barrier phase (and this could be easily checked
in a debug implementation), independently of what we decide regarding fences.
My inclination is that explicitly async transfer intervals should be permitted
to span either one (for reasons of composibility), but I would like some user
feedback about how important Troy's property is to them.
ga10502 said:
"1) asynchronous memory operations have local completion semantics (i.e.
waiting for an async memput only guarantees that the send buffer is reusable
after a put). Fences operate on asynchronous operations just like on "normal"
blocking ones.
2) asynchronous memory operations have global completion semantics (i.e.
waiting for an async memput guarantees that it is now remotely complete as
well). Fences do not operate on asynchronous operations - indeed, there is no
point since just waiting already does the job."
We all appreciate your input, although unfortunately I don't think either of
your "summaries" above corresponds to the possibilities we're currently
discussing. The Berkeley semantics are very *close* to option #1, except the
fences you mention are only relevant to the async operation when they occur
outside the transfer interval, and fences during the transfer interval are not
relevant to the completion of the async. Option #2 is close to the Cray
proposal of several weeks ago, but I think consensus is moving towards
loosening those built-in fences and instead introducing a dedicated
signalling-put library to meet the need it was trying to address. Both sides
should probably summarrize the semantics they are currently proposing - I'll
post a followup shortly that does this.
"I'm not sure what you will say about processor ordering of asynchronous puts
to the same remote address. I would love it if you could make yourself say that
the *start* of the operation is what determines order - not when you wait for
completion. This, again, would be unsurprising. It can be implemented with some
effort - I will claim that the effort is the same as we are already making to
order blocking puts on an unordered network."
This "effort" you mention is one of the primary performance bottlenecks we are
trying to alleviate by introducing this library. The programmer asserts there
are no conflicting accesses to the source or destination buffers for the entire
period delimited by the init/sync calls (the "transfer interval"), and the
implementation does a better job of communication overlap as a result. If you
need a stronger ordering guarantee, the existing upc_mem* libraries already
provide that semantic. Burdening the async library with such a semantic would
make it nearly indistinguishable from the existing calls and defeat the purpose
of introducing a new library.
"You spent a lot of time talking about fences and their interaction with
half-finished asynchronous operations. This seems like a red herring to me - if
you are a crazy enough programmer to use non-blocking operations - need I
elaborate on the perils of non-blocking remote operations? - well, in that case
making sure that there are no pesky strict accesses, barriers and so forth
between the op start and the wait should be child's play."
Your statement is a very strong argument for Berkeley's view regarding the
behavior of fences and other strict ops that appear in the transfer interval.
This is the largest point of remaining contention. However let me clarify that
neither proposal is currently *prohibiting* fences in the transfer interval -
the disagreement centers around what effect they have on the async operation in
progress. If you never place a fence or strict op in the transfer interval,
then the difference is irrelevant and both sides agree on behavior. However, I
think both sides want to allow strict ops during the transfer interval for
reasons of overlap and composibility - ie so that UNRELATED computation and
communication can continue.
Original comment by danbonachea
on 10 Aug 2012 at 4:10
In preparation for the upcoming telecon discussion, I'm splitting out the key
sections of the old Berkeley proposal, which included a number of additional
functions that are not currently under consideration for inclusion. Both Cray
and Berkeley proposals MOSTLY agree on the use of handle types (and the
different flavors of handle-based synchronization), and the flavors of
memput/memget that should be provided - so to focus on the important issues and
avoid tedious duplication I'll show a single exemplar function. This text
obviously still needs alot of work in translation to formal "spec-ese", but for
now I'm tending towards verbosity in the interests of clarity:
#include <upc_nb.h>
upc_handle_t upc_memcpy_nb(shared void * restrict dst, shared const void *
restrict src, size_t n);
Semantics
----------
1 This operation has the same data movement effect as the corresponding
upc_memcpy call defined in the UPC Language
Specification section 7.2.5, except it is split-phase.
2 The specified operation is initiated with a call to the above function which
returns an explicit handle representing the operation in-flight.
3 The operation remains "in-flight" until after a successful call to
upc_waitsync or upc_trysync on the returned handle.
4 The contents of all affected destination memory is undefined while the
operation is in-flight, and if
the contents of any source memory changes while the operation is in-flight, the
result is undefined.
5 The order in which non-blocking operations complete is intentionally
unspecified - the system is free
to coalesce and/or reorder non-blocking operations with respect to other
blocking or non-blocking operations,
or operations initiated from a separate thread - the only ordering constraints
that must be satisfied are those
explicitly enforced using the synchronization functions (i.e. the accesses
comprising the non-blocking operation are only guaranteed
to be issued somewhere in the interval between initiation and successful
synchronization on that operation).
6 The effect on conflicting accesses OUTSIDE the transfer interval is AS-IF the
transfer were performed as a set of relaxed shared reads and relaxed shared
writes of unspecified size and order, issued at an unspecified time within the
transfer interval by the calling thread. Conflicting accesses INSIDE the
transfer interval are prohibited by semantic 4. "INSIDE" and "OUTSIDE" are
defined by the Precedes() program order for accesses from the calling thread;
accesses from other threads are considered "INSIDE" unless every possible and
valid <_strict relationship orders them before the init call or after the sync
call.
Key properties of the Berkeley async semantics:
----------------------------------------------
1. The program is prohibited from changing the source buffer between the
initiation call and the return of a successful sync call (the "transfer
interval").
2. Similarly, the contents of the destination buffer are UNDEFINED during the
entire transfer interval (until a successful sync call returns).
3. Fences, barriers, and other operations (to unrelated memory locations) are
fully permitted during the transfer interval, but the transfer interval is
unaffected and extends until the matching sync call (stated differently, the
operation that was explicitly initiated remains "in-flight" until the program
explicitly calls the sync operation for it).
4. Because the contents of destination memory remain UNDEFINED until the return
of a sync call, it is impossible for any valid program to "observe" any effect
on the destination buffers caused by any fences or strict operations issued
during the transfer interval.
5. After a successful sync call returns, subsequent read/writes issued by the
thread that invoked the transfer are immediately guaranteed to observe the
effects of the transfer without any fences. This is exactly the same semantic
guarantee as the existing upc_mem* calls.
6. There are no strict operations, fences or "global completion" guarantees
built-in to the library calls. The library performs pure data movement, not
inter-thread synchronization. If such fences or inter-thread synchronization
are required (eg to guarantee other threads see the effects of the transfer),
they can be added around the library calls by the program using existing UPC
mechanisms. These are exactly the same consistency guarantees as the existing
upc_mem* calls, and the "best practice" documented in B.3.2.1.
7. The formal semantics of the library are defined in terms of
minimally-constrained relaxed operations, exactly analogous to the existing
upc_mem* library, as described in spec v1.2 section B.3.2.1. The size, order
and temporal position of these operations within the transfer interval is
deliberately unspecified. However, due to property #2 and #4, such details
cannot be observed by any valid program anyhow.
8. Because of #6 and #7, the expression wait_sync(upc_mem*_async()) is
semantically identical to the blocking version upc_mem*(), and is expected to
exhibit a very similar performance profile in high-quality implementations.
Notable differences from the Cray proposal:
----------------------------------------------
1. In Berkeley asyncs the library transfer interval does not end when a thread
issues a fence or other strict operation (possibly from within an unrelated
library call or other callee). This enables deeper communication overlap in
large, multi-module/multi-author applications.
2. Berkeley asyncs do not assume the application always wants to couple thread
synchronization with data transfer. In this way they are more primitive and
also potentially lighter-weight, and designed to efficiently apply to a broader
set of usage cases. They enable applications to coalesce synchronization
overheads for multiple data transfers (eg think bulk synchronous HALO
exchange), rather than incuring thread synchronization costs on a per-operation
basis.
3. The Berkeley asyncs are semantically very close to the existing upc_mem*
library. They intentionally add the minimal possible semantics to enable
explicitly asynchronous data transfer, and remain orthogonal to issues of
thread synchronization. The memory model formalism used to describe the
upc_mem* library also applies to the Berkeley asyncs.
Original comment by danbonachea
on 10 Aug 2012 at 4:13
"However, I'm very dismayed the Cray team is insisting on continuing to claim
the Berkeley semantics somehow "violate" or "break" the memory model."
Let UPC 1.2 have memory model M. I claim that the BUPC async proposal, by
adding things to the language that are immune to certain aspects of M, like
fences, creates memory model M'. The BUPC async proposal is fully compatible
with M'.
I believe that BUPC considers M == M' so that when we say that the BUPC asyncs
violate M, it is interpreted as nonsense, since M' was constructed for the BUPC
asyncs and they are compatible with it. I do not believe that M == M' because
if I try to reason about a program using only M and that program contains BUPC
asyncs, then I can come to incorrect conclusions about my program's behavior.
I understand your point that the memory model "is not some touchy-feely
generalization or hand-wavy set of properties," but based on responses to other
Issues on this site, the UPC 1.3 memory model should remain equivalent to the
UPC 1.2 memory model for backward compatibility and to avoid confusing users.
We might be quibbling over what "equivalent" means. I think you are saying
that the "formal mathematical model" in the appendix has not changed and the
new stuff that the BUPC async proposal would add is declared to be independent,
and therefore M == M'. I'm saying that I can't necessarily apply M and M' to
the same program and reach the same conclusions; therefore, to me, as someone
trying to apply the memory model as a tool for program interpretation, M != M'.
The original Cray proposal and the newer consensus proposal stay completely
within M. The half-fence idea is already part of M because upc_fence already
has two roles in blocking movement of relaxed accesses; there is simply no
user-accessible syntax for using one of its two roles without the other. In
this instance, we needed only one of the two existing roles, similar to the way
that upc_barrier can be divided into upc_notify and upc_wait.
Hopefully that clears up what I mean by "break" or "violate."
Original comment by johnson....@gmail.com
on 10 Aug 2012 at 5:21
" I do not believe that M == M' because if I try to reason about a program
using only M and that program contains BUPC asyncs, then I can come to
incorrect conclusions about my program's behavior."
The memory model has not "changed" in any way. The Berkeley asyncs do not
propose to change it at all. What is changing is the set of OPERATIONS that you
can express in the language, and this is true of both proposals. Furthermore,
the "reasoning" properties you refer to remain true for all VALID UPC programs
(ie those that do not break the library interface and access undefined values).
The memory model only defines behaviors for VALID UPC programs and that remains
true as well.
Pure 1.2 programs do not use the new libraries at all, so trivially meet the
validity requirement of the library. 1.3 programs that add calls to the new
libraries must follow the rules of the interface and not touch the buffers
during the transfer interval, and any properties still remain true. The only
way to observe the differences you claim would be through an INVALID program
(ie one that touches buffers in the forbidden interval). I challenge you to
write a VALID program using the Berkeley library semantics that breaks any
property of interest. We're not interested in the behavior of INVALID programs,
ie those that access undefined values or break rules which lead to undefined
results, because therein lies the road to madness.
Where the two proposals differ is the point at which the transfer interval
ends, ie the extent of the undefined buffers. Berkeley states it
unconditionally extends from the explicit init to the explicit sync. Cray
states it can conditionally end earlier if the calling thread performs a strict
operation. This difference is a philosophical one and really has NOTHING to do
with the formal memory model at all - it's simply the declared boundary of the
asynchronicity of the library we're ADDING. Either option can fit perfectly
inside the existing memory model without changes. We should focus our
discussion on the user impact of this difference and the pros and cons of
either way, instead of wasting time inventing falsehoods about how either way
"breaks" the memory model.
"The half-fence idea is already part of M because upc_fence already has two
roles in blocking movement of relaxed accesses; there is simply no
user-accessible syntax for using one of its two roles without the other. "
I encourage you to carefully re-read the formal memory model. The terms
"half-fence", "local completion" and "global completion" do not appear anywhere
in the entire 1.2 spec. Any library spec that wishes to use such terms to
specify consistency semantics would need to introduce FORMAL mathematical
definitions that fit within the existing formalism.
I'd also remind you that consistency semantics of upc_fence are clearly defined
in B.5.3.1:
"A upc fence statement implies a strict write followed by a strict read:
SW(l_synch, 0) ; SR(l_synch, 0)"
The "half-fence" you're looking for is probably just a strict write:
SW(l_synch, 0)
but since it sounds like we're probably dropping that semantic anyhow it's not
really relevant anymore.
Original comment by danbonachea
on 10 Aug 2012 at 8:42
I think we should just wait for the conference call now. :) Neither one of us
is convincing the other and I think that we'll need to have a majority of
non-BUPC, non-Cray folks help solve this issue, preferably with input on what
users want.
Original comment by johnson....@gmail.com
on 10 Aug 2012 at 8:53
"I think we should just wait for the conference call now. :) Neither one of
us is convincing the other and I think that we'll need to have a majority of
non-BUPC, non-Cray folks help solve this issue, preferably with input on what
users want."
I agree - as I mentioned earlier this textual discussion doesn't seem to be
progressing us towards resolving anything. I was just hoping to narrow the
discussion points to the precise technical differences so we can focus on
resolving what actually matters.
In preparation for the call, I'd like everyone to consider the following
program:
1: shared int p = 0; // assume no other threads are touching these locations
2: shared int q = -10;
3: ...
4: if (MYTHREAD == 2) {
5: #ifdef __UPC_NB_MEM_
6: upc_handle_t h = upc_memcpy_async(&p,&q,sizeof(int)); // start an async
copy that eventually writes -10 to p
7: #endif
8: p = 20
9: int l_p = p;
10: printf("%i",l_p);
11: }
Under UPC 1.2 (without __UPC_NB_MEM_), I can state the following property
concerning lines 8-9:
"Because no other thread is touching p in this synchronization phase, and this
thread performs a relaxed write of the value 20 to p on line 8, then the
immediately subsequent relaxed read of p on line 9 from this thread is
guaranteed to return 20.". More concisely and intuitively, "No other thread is
touching p, so when I write p and then immediately read p I get back the value
I wrote".
When you enable 1.3 __UPC_NB_MEM_ and add the EARLIER call to upc_memcpy_async
on line 6, this property is no longer true. The spec allows implementations
which sometimes print 20, sometimes print -10, and sometimes print something
completely different. This is because the transfer interval for the unsynced
library call has made the contents of p UNDEFINED, and this program is writing
and reading that buffer while it has an undefined value, leading to undefined
behavior. This is an incorrect use of the library, and it has invalidated a
very simple property derivable from the 1.2 memory model regarding accesses
elsewhere in the program.
Note moreover that everything stated about the program above is true for BOTH
the Berkeley and Cray libraries. Under BOTH semantics this program is reading
undefined values by the addition of the earlier library call and now has
undefined behavior. This is an example of how the addition of the library
features proposed by Berkeley OR Cray can invalidate simple reasoning about the
memory model in programs that use the new libraries incorrectly. This doesn't
mean the memory model is "broken", it just means there is a new way to write
programs with undefined behavior that the memory model does not govern.
Programmers who embark upon using a new library feature are burdened with using
it CORRECTLY, otherwise they get undefined behavior and commonly derived
properties of the memory model may not hold true.
Original comment by danbonachea
on 10 Aug 2012 at 11:01
The following is NOT intended to pour gasoline on any fires.
I just wanted to take a moment to restate a key point of the Berkeley proposal
that might help reduce confusion. In particular, we keep stating that the
Berkeley proposal does not "break", "modify" or "circumvent" the existing
(Appendix B) memory model. Yet so far the explanations we have offered have
not been effective at convincing Troy of Steven of that assertion. So, I want
to try again (twice in fact). I hope this will help.
To us (at least Dan and I) it is natural to define an "asynchronous" transfer
as taking place at an UNdefined point in time (or more than one) between the
init and sync calls. This is what naturally happens when the init call queues
work to the network and the sync call blocks for completion of that work.
Making that UNdefined nature of the transfer part of the specified semantics of
a UPC-level async interface leads "naturally" to a situation (descried two ways
below) in which an implementation can legally perform the async transfer
entirely without interaction with strict accesses which take place in the
transfer interval. So, we believe that this is in no way an "unnatural"
behavior, even if it may be "unintuitive" to some/many.
Note that the following are just two approaches to explaining the SAME idea,
not two alternative/competing ideas.
EXPLANATION #1:
The Berkeley async data movement functions do NOT declare themselves as having
a special exemption from the memory model's requirement that a strict operation
(fences included) will order all preceding relaxed operations. However, the
semantics are constructed such that an implementation may LEGALLY perform the
transfers without interaction with strict references, as follows...
What they DO is define a "transfer interval" which extends from the start of
the init call to the end of the sync call, and the define the transfer as being
a performed by an UNDEFINED sequence of (one or more) relaxed operations which
may take place at any time (and in any order) for the duration of that interval.
BECAUSE the time(s) at which the relaxed operations occur is NOT defined, there
is no way to "prove" that any portion of the transfer is the result of a
relaxed operation which precedes the strict operation. Therefore, an ordering
in which all of the relaxed operations FOLLOW the strict operation (which
itself must PRECEDE the sync call, or we'd not be concerned about it in this
discussion) is "admissible" (for lack of a better word) under the CURRENT
memory model. The constraint that the destination memory is undefined during
the interval (just as with an MPI_Irecv, FWIW) prohibits a correct program from
performing reads which could observe any ordering other than one in which the
transfer occurs entirely after the strict operation(s).
EXPLANATION #2:
1) The specified semantics allow a "move-all-data-at-sync" implementation (of
course they do not require it) in which the ENTIRE relaxed transfer takes place
IN the sync function and thus FOLLOWS any strict operations executed WITHIN the
transfer interval.
2) No legal UPC program (meaning it obeys the specification's prohibition
against reading the destination or writing the source) may OBSERVE the actual
order/timing in which the data is written to the destination, NOR may it modify
the source data (for instance to "sneak" information about the timing of the
transfer into the destination memory).
3) Therefore, an implementation is permitted to transfer the data without
"forced completion" by strict operations which occur in the transfer interval
because the only OBSERVABLE (by legal programs) execution is indistinguishable
from move-all-data-at-sync (in which the data movement FOLLOWS the strict
operations).
Original comment by phhargr...@lbl.gov
on 11 Aug 2012 at 1:02
As another interesting data point, the UPC I/O library in spec 1.2 section
7.4.7 already includes asynchronous modification of memory, and is the closest
example of explicitly asynchronous library behavior in the current spec:
"For asynchronous read operations, the contents of the destination memory are
undefined until after a successful upc_all_fwait_async or
upc_all_ftest_async on the file handle. For asynchronous write operations,
the source memory may not be safely modified until after a successful
upc_all_fwait_async or upc_all_ftest_async on the file handle."
The behavior of the async I/O library is consistent with the Berkeley approach,
in that the transfer interval and undefined buffers ends at the library sync
call, not at earlier arbitrary fences or other strict operations. Using async
I/O in UPC 1.2 it is ALREADY possible to construct scenarios where even a
upc_barrier does not "quiesce" all in-flight changes to memory, and a library
continues to modify destination buffers after a fence or barrier.
Original comment by danbonachea
on 11 Aug 2012 at 2:09
RE Comment 57:
Hi Paul. I think Dan previously made the argument in your Explanation #1.
Here's my problem with it...
The existing upc_mem* functions are already collections of relaxed accesses
that can issue in any order between two fences; however, to obey the
same-address ordering rule, a upc_mem* implementation generally must wait until
the accesses fully complete before returning. It would not be correct for a
UPC compiler to reorder a upc_mem* call with respect to a upc_fence, either
before the call or after the call:
upc_memput( ... );
upc_fence; --x--> upc_fence;
upc_memput( ... );
upc_memput( ... );
upc_fence; --x--> upc_fence;
upc_memput( ... );
The major utility of a upc_mem*_nb function is that it can start issuing
accesses and not wait for them to complete before returning, allowing both
overlapping communication and overlapping communication with computation.
Ideally the accesses issue as soon as possible, but it would NOT be correct for
a compiler to issue them before the previous fence:
h = upc_memput_nb( ... );
upc_fence; --x--> upc_fence;
h = upc_memput_nb( ... );
sync( h ); sync( h );
I haven't seen anyone argue that the above is legal using the BUPC async
functions, and that's good.
Now, going the other direction, it also should not be legal to allow the
accesses to issue AFTER the next fence. BUPC claims that for the async
functions the accesses could all issue after the fence, or even all issue at
the sync. This situation is as if the _nb call was made after the fence:
h = upc_memput_nb( ... );
upc_fence; --?--> upc_fence;
h = upc_memput_nb( ... );
sync( h ); sync( h );
For this transformation to be legal with an async call when it is not legal
with the existing upc_mem* calls is truly bizarre. If I'm an application
programmer and I want to incrementally introduce greater overlap, I'd probably
start by doing this:
upc_memput( ... ); --> h = upc_memput_nb( ... );
sync( h );
And then move the upc_memput_nb call upward and move the sync call downward in
the program as far as I could. Fences are brick walls here and that's fine --
that is to be expected. (In the original Cray proposal, if the sync made it
all the way to the fence, you could remove it.) Then, after doing that, I
might get a bit fancier and start software pipelining, which is where we get to
the idea of providing an lsync as well as a gsync so that buffers can be reused
and completion can be guaranteed for individual copies.
To summarize, an asynchronous upc_mem call plus a sync (lsync) that doesn't
behave like its synchronous counterpart would be severely unintuitive and not
what we want to offer our users.
Original comment by johnson....@gmail.com
on 11 Aug 2012 at 4:07
"The major utility of a upc_mem*_nb function is that it can start issuing
accesses and not wait for them to complete before returning, "
No, that's only half the benefit. It also allows the implementation to delay
issue based on network conditions. RDMA systems usually have a limited-depth
queue for network injection, and it can be advantageous under contention to
post operations to a software queue which is serviced later as the network
drains. Even if the hardware accepts the operation immediately, an RMA offload
engine may continue to read source buffers after the init returns, resulting in
injections which are "issued" after init returns and other client code has run.
In extreme cases of contention, injection might not even occur until within the
sync call, although this is hopefully not a common case as it would defeat
overlap for that operation.
I think this may be a core point of philosophical confusion - when we say the
relaxed operations may be issued ANYWHERE in the transfer interval, we really
mean that. This is not some semantic dodge, we really envision platforms that
will exhibit this behavior. Cray needs to loosen its preconception that all
operations are issued before the init call returns, as this is NOT a
requirement we want to add - doing so would severely inhibit the overlap
potential of this library on systems of interest.
"For this transformation to be legal with an async call when it is not legal
with the existing upc_mem* calls is truly bizarre."
The existing library calls are monolithic and inseparable. The new library
calls are asynchronous and require two related calls, init and sync.
Asynchronous calls introduce the concept of a transfer interval, which have no
analogue in the blocking call. The behavior of this transfer interval is what
we're discussing, so the "behavior" of a non-existent transfer interval in the
blocking calls is irrelevant.
"asynchronous upc_mem call plus a sync (lsync) that doesn't behave like its
synchronous counterpart would be severely unintuitive and not what we want to
offer our users."
If they "behave" identically, then there's no point in introducing a new
library. The new library adds the ability to express important semantic
loosening which an advanced user can exploit to achieve overlap and improved
performance. This is not a UPC 101 feature - this is a library for advanced
users who want to maximize the performance of their app and are willing to
learn the slightly more complicated rules for using them correctly. Perhaps
Cray's goal in this regard is different.
"want to incrementally introduce greater overlap..move the upc_memput_nb call
upward and move the sync call downward in the program as far as I could.
Fences are brick walls"
You're looking at this as a compiler writer performing an automatic
transformation based on static analysis. I agree that a static optimization
which attempts to move the init and sync calls would have to stop movement at
fences, without a great deal more parallel analysis. But this interface is NOT
a UPC compilation target - implementations have their own internal interfaces
for that. This is a user API where the programmer has full knowledge of buffer
access patterns, and can place his init and sync library calls correctly to
ensure he doesn't touch buffers during the transfer interval where they are
undefined. He can add fences and other unrelated computation and communication
during the transfer interval which do not touch the in-flight buffers, and
everything still works. Even if he takes the approach of starting with
init();sync(); and slowly spreading the calls, he CAN safely move them past
unrelated computation, communication and even syncs using his Programmer
Knowledge to ensure he doesn't move them past the code which
consumes/overwrites the transfer buffers or synchronization which signals other
threads to do so. The fact this transformation cannot blindly and automatically
be performed by an optimizer in most interesting cases is a large part of the
motivation for exposing an explicitly asynchronous library in the first place
whereby an advanced user can do this manually.
Original comment by danbonachea
on 11 Aug 2012 at 5:12
Troy wrote, in part:
> Now, going the other direction, it also should not be legal to allow the
accesses
> to issue AFTER the next fence. BUPC claims that for the async functions the
> accesses could all issue after the fence, or even all issue at the sync.
> This situation is as if the _nb call was made after the fence:
In that last sentence the phrase "as if the _nb call" distills the essence of
the difference between our lines of thought. In your line of thought the
relaxed accesses are firmly associated with the _nb call. In the Berkeley line
of thought the association is weaker, with the _nb call only defining one end
of an interval.
So, I do NOT agree that the transformation you describe (moving the _nb call
past the next fence) is illegal. In fact, it is my position that it is allowed
PRECISELY because the _nb call's semantics (as given in the Berkeley proposal)
are designed to "give license" to perform this transformation (but ONLY with
respect to fences within the transfer interval). In the Berkeley proposal use
of an async transfer is an assertion by the user that the source and
destination memories are untouched within the transfer interval, and the
transformation is permitted.
This *is* what we (Berkeley) want to offer to our users, and where our
"philosophies" differ.
I don't feel this behavior is "unintuitive" when one thinks about asynchronous
data movement by network hardware or other "asynchronous agent". I AGREE that
it does not follow "intuitively" from thinking about "sync(upc_memFOO_nb(...))"
as equivalent to "upc_memFOO(...)", but that is NOT an equivalence we provide,
and this could be made clear(er) in the specification/documentation if desired.
Let's look at this difference in another way.
Given the following:
h = upc_memput_nb( ... );
other_work1();
upc_fence;
other_work2();
sync( h );
Cray's proposal PROHIBITS transformation into
other_work1();
upc_fence;
other_work2();
h = upc_memput_nb( ... ); // has moved DOWN 3 lines
sync( h );
and it REQUIRES a transformation equivalent to:
h = upc_memput_nb( ... );
other_work1();
sync( h ); // has moved UP 2 lines
upc_fence;
other_work2();
So, why is moving the sync() across the fence REQUIRED while moving
upc_memput_nb() is PROHIBITED? To me THAT is an unintuitive aspect of the Cray
proposal.
In the Berkeley proposal, the init and sync are endpoints of an interval with
neither one "anchoring" the transfer. Thus both transformations are legal, but
neither is required (again this only applies to movement across fences WITHIN
the transfer interval).
Original comment by phhargr...@lbl.gov
on 11 Aug 2012 at 5:43
"Cray needs to loosen its preconception that all operations are issued before
the init call returns"
We do not have a preconception that all are issued; that's not even how our own
hardware works. We expect that the implementation will start issuing accesses
in the init call. It very well may offload the whole transfer to an entirely
separate piece of hardware and, sure, that hardware may be overwhelmed and not
do anything for a while, but the point is that it tries to start the transfer.
The asynchronous call is essentially the programmer saying please try to start
this transfer right away and return as soon as you can so that I can do
something else in the meantime.
"The behavior of this transfer interval is what we're discussing, so the
"behavior" of a non-existent transfer interval in the blocking calls is
irrelevant."
No, it is quite relevant because there are existing programs that use the
blocking upc_mem* calls that will migrate to the non-blocking calls for better
performance and the programmer needs a semantic equivalence to get started.
"If they "behave" identically, then there's no point in introducing a new
library."
They should be identical until you separate them. This is the equivalence that
I'm talking about. The fact that the transfer window vanishes to nothing for
the blocking call does not invalidate this equivalence. The utility comes from
being able to separate them and that is the point of extending the library
interface. The separation process is familiar and similar to replacing a
upc_barrier with a upc_notify plus upc_wait and then stretching them apart.
Before they are separated they are by definition identical to the original
upc_barrier.
"You're looking at this as a compiler writer performing an automatic
transformation based on static analysis. ... The fact this transformation
cannot blindly and automatically be performed by an optimizer in most
interesting cases is a large part of the motivation for exposing an explicitly
asynchronous library in the first place whereby an advanced user can do this
manually."
The advanced user can achieve overlap while still respecting fences and
barriers. I don't believe that whatever additional overlap might be made
possible by being able to suspend the normal fence and barrier rules for the
duration of the async transfer would be worth the ensuing weirdness of not
having a blocking/non-blocking+sync equivalence and not having a upc_barrier
complete all accesses.
Original comment by johnson....@gmail.com
on 11 Aug 2012 at 6:23
"I AGREE that it does not follow "intuitively" from thinking about
"sync(upc_memFOO_nb(...))" as equivalent to "upc_memFOO(...)", but that is NOT
an equivalence we provide, and this could be made clear(er) in the
specification/documentation if desired."
It's an equivalence that we intentionally provided. Are you suggesting that we
have the UPC 1.3 spec not guarantee the equivalence but as a vendor document
that ours is equivalent? That's possible to do, but less standardization is
worse for users and I'd like to avoid that if possible.
"So, why is moving the sync() across the fence REQUIRED while moving
upc_memput_nb() is PROHIBITED? To me THAT is an unintuitive aspect of the Cray
proposal."
We don't allow any access to cross any fence. It does not matter that the
access is being performed for the UPC thread via some asynchronous agent. The
accesses are still logically being performed by a particular UPC thread and its
accesses are ordered with respect to its fences. We follow this rule
consistently. As another example of an asynchronous agent, if a UPC thread has
OpenMP threads underneath it and the OpenMP threads issue accesses, then a
upc_fence completes all accesses issued by all OpenMP threads that are part of
the UPC thread. (I know this isn't possible for GASNet-based UPC
implementations because you aren't allowed to sync a handle from a different
pthread.)
Original comment by johnson....@gmail.com
on 11 Aug 2012 at 7:08
While I clearly haven't converted Troy, if feels to me like both sides now
understand how the respective specification's goals/requirements differ.
Hopefully things are now clear enough to an "outside observer" that others can
make up their minds prior to the conf call.
Troy wrote:
> Are you suggesting that we have the UPC 1.3 spec not guarantee the
equivalence but as a vendor
> document that ours is equivalent?
No. What I was suggesting was that if the Berkeley design is chosen and one is
concerned that users would expect the equivalence, then the spec could make a
clear(er) statement about the lack of equivalence.
> We don't allow any access to cross any fence.
Berkeley instead leaves undefined which side of the fence the access(es) occur
on, thus removing the issue of "crossing" one. This is a fundamental
difference and each side has presented arguments for their own approach and
against the other. I think these past few comments have been more "level
headed" than some previous ones and allowed us to state our positions w/o
name-calling or "mischaracterization" of the other.
> It does not matter that the access is being performed for the UPC thread via
some asynchronous agent.
As Dan mentioned, the UPC I/O specification also has a model of asynchronous
data movement similar to what the Berkeley async memcopy proposal contains.
Does the Cray implementation also complete async I/Os at fences and other
strict references? If NO, what is the reasoning behind this different
interpretation from how you now want async memcopy operations to be defined?
If YES, does any other vendor share the interpretation that upc_fence must
complete UPC I/O operations? (Perhaps a tough question to answer, since I seem
to recall that no other vendor has a UPC I/O implementation at the moment).
Original comment by phhargr...@lbl.gov
on 11 Aug 2012 at 8:41
"They should be identical until you separate them. This is the equivalence
that I'm talking about."...
""I AGREE that it does not follow "intuitively" from thinking about
"sync(upc_memFOO_nb(...))" as equivalent to "upc_memFOO(...)", but that is NOT
an equivalence we provide, and this could be made clear(er) in the
specification/documentation if desired."
Let me clarify here, because the discussion has become quite arcane for the
casual reader. Under the Berkeley proposal, the exact line:
sync(upc_memFOO_nb(...);
IS SEMANTICALLY EQUIVALENT to:
upc_memFOO(...)
This is the special case of an EMPTY transfer interval, so all the discussion
about how opposing views handle that interval trivially collapse to no
difference. Both lines of code encapsulate a synchronous transfer that consumes
and produces the transfer buffers before returning, so surrounding code from
this thread are guaranteed to observe updated values.
Cray's current "half-fence" semantics for the first line additionally force
some synchronization with other threads before sync() returns, which means the
split-phase version generates MORE synchronization (and presumably more
performance overhead) than the purely blocking version. This
over-synchronization can never cause a correct program to become incorrect, but
it IS a difference nonetheless. So unless they follow-thru removing that part
of the semantics, it is the CRAY proposal that lacks semantic equivalence to
the blocking version.
Now when it comes to TRANSFORMING the split-phase call and separating the init
and sync, the interval becomes non-empty and that's where more care is
required. Note that once you start moving ANY non-comment lines of code around,
the program is by definition no longer the same program - what we're talking
about is transformations which are guaranteed to have no semantically
observable effect on the BEHAVIOR of the program, relative to the one
containing a blocking call. This is very different from saying the programs are
EQUIVALENT, because the entire point is to overlap communication and improve
performance (so at the very least the network traffic and performance will not
be "equivalent", and we want that).
Neither proposal allows movement of init or sync past accesses which
potentially conflict with the buffers in-flight - this could potentially cause
the program to access undefined data, and thereby demonstrate a change in
behavior and correctness. Neither proposal allows movement past a fence IN
GENERAL, although under the Berkeley proposal that transformation is SOMETIMES
valid, if the fence is synchronizing unrelated data and provably does NOT
affect access to the transfer buffers by some other thread. Both types of
transformation require information that compilers usually cannot (or
realistically, can never) prove, so we're talking about manual transformations
by an advanced user who understands his data dependencies. An advanced user
could use the same knowledge to move a blocking upc_mem* call past fences in
his program that he knows are unrelated to the data transfer.
"We don't allow any access to cross any fence. ... We follow this rule
consistently. ... a upc_fence completes all accesses issued by all OpenMP
threads that are part of the UPC thread. "
What about with MPI non-blocking operations?:
Thread 0 Thread 1
MPI_Irecv(&destbuffer,...,1,&handle);
...
upc_fence;
MPI_Send(...) /* to T1 */ MPI_Recv(...) /* from T0 */
... MPI_Send(...) /* to T0 */
MPI_Wait(&handle,...)
I hope your fence implementation doesn't stall to await completion of that
Irecv (which is asynchronously writing to destbuffer), or one could easily
construct many examples where that additional synchronization creates deadlock..
Original comment by danbonachea
on 11 Aug 2012 at 6:13
MPI functions are a red herring because they do not perform UPC shared accesses.
"As Dan mentioned, the UPC I/O specification also has a model of asynchronous
data movement similar to what the Berkeley async memcopy proposal contains.
Does the Cray implementation also complete async I/Os at fences and other
strict references?"
First, I'll directly answer your question by explaining that our UPC async I/O
isn't very asynchronous at this point, so fences and strict accesses don't
worry about this problem, but even if it was fully asynchronous, technically,
the fences don't have to worry about this problem...
UPC I/O asyncs are weird and different from either the Cray or the BUPC
proposal. A UPC thread calling a UPC I/O async function actually doesn't
perform any UPC shared accesses as part of the call!
"Each call to upc_all_{fread,fwrite}_shared[_async] with a common file pointer behaves as if the read/write operations were performed by a single, distinct, anonymous thread which is different from any compute thread (and different for each operation)." [UPC 1.2 A.1.1.3 Paragraph 3]
According to UPC I/O, if the calling thread executes a fence, then it doesn't
affect the ordering of the accesses being done by this other anonymous thread
because a fence is an operation local to one thread. Personally I think the
UPC I/O spec is pretty wacky and this particular paragraph raises a bunch of
questions like: What data has affinity to the anonymous thread? Why isn't it
counted by THREADS? Why does the description of collective functions in the
spec not have to exclude these threads -- presumably they don't have to
participate in a barrier? Where does this thread come from?
But I digress. Fact is, the current I/O appendix deals with the async issues
by introducing the concept of anonymous UPC threads and neither the BUPC
proposal nor the Cray proposal went that route. With the existing upc_mem*
functions and the Cray non-blocking functions, the UPC thread calling the
function makes relaxed accesses that respect fences. The UPC thread calling a
BUPC async function also makes relaxed accesses, but ignores any fences or
barriers between the init and the sync. The claim is that the accesses occur
at an undefined point relative to the intervening fences so that the fence has
no effect, but the calling UPC thread is still logically the entity performing
the accesses.
If BUPC would like to redefine their async functions as causing accesses to be
performed by an anonymous thread, then I will concede that a BUPC async gets to
ignore the fences, but then BUPC needs to flesh out this concept of an
anonymous thread a lot more. I really don't like that the anonymous thread
concept is mentioned in passing in the I/O appendix as if it were a trivial
thing because it's not.
If the definition of anonymous threads can be pinned down, then we need to see
if users expect the calling thread to be the thing logically responsible for
issuing the accesses or if they're comfortable thinking in terms of anonymous
threads.
Personally I don't like the anonymous thread concept and I hope we don't have
to go there. UPC programs (currently) are supposed to have exactly THREADS
number of threads for their entire lifetime. I much prefer saying that the
accesses are performed by the calling thread and hope that we end up with that.
Original comment by johnson....@gmail.com
on 12 Aug 2012 at 2:57
"MPI functions are a red herring because they do not perform UPC shared
accesses."
MPI is NOT a red herring, because MPI functions are very commonly used in
hybrid UPC/MPI programs to modify shared memory locations. Hybrid UPC+MPI apps
are a very important usage class, and this property of using upc_fence to
quiesce all changes to shared memory that Cray seems so married to simply
doesn't hold true for those apps.
"Personally I don't like the anonymous thread concept and I hope we don't have
to go there. UPC programs (currently) are supposed to have exactly THREADS
number of threads for their entire lifetime. I much prefer saying that the
accesses are performed by the calling thread and hope that we end up with that."
I agree that we should not extend the anonymous thread concept from UPC-IO. It
is well-motivated for the parallel I/O library for a number of different
reasons, mostly having to do with how collective parallel I/O libraries work,
but those aren't applicable to non-collective upc_mem*_async. My only point in
mentioning the UPC-IO example was to demonstrate that 1.2 already includes
operations that asynchronously modify shared memory that are not governed by
upc_fence and upc_barrier. So the fairy-tale of using upc_barrier to quiesce
the whole world doesn't even work in pure UPC-1.2, let alone
UPC+MPI+POSIX-async-IO... This derived and undocumented property only works in
very a narrowly limited subset of UPC, so we should stop worrying about
preserving it.
"The UPC thread calling a BUPC async function also makes relaxed accesses, but
ignores any fences or barriers between the init and the sync. "
False - as we've said over and over the thread does not IGNORE them, in fact
any fences or barriers continue to govern normal relaxed accesses issued by the
application. However the library's relaxed accesses which comprise the async
memcpy are permitted to take place anywhere inside the transfer interval, and
the transfer buffers remain officially "off-limits" for the duration of the
transfer interval (until the sync), so no valid UPC program can PROVE whether
or not the fences in the interval had any effect on the accesses that comprise
the transfer.
Original comment by danbonachea
on 12 Aug 2012 at 3:44
"MPI is NOT a red herring, because MPI functions are very commonly used in
hybrid UPC/MPI programs to modify shared memory locations. Hybrid UPC+MPI apps
are a very important usage class, and this property of using upc_fence to
quiesce all changes to shared memory that Cray seems so married to simply
doesn't hold true for those apps."
Correct -- we don't expect a upc_fence to affect MPI. I said that MPI is a red
herring because Paul asked how our fences handled MPI_Send. The answer is that
UPC fences have no effect on MPI communication. If you have a UPC program and
it calls MPI_Send, the shared qualifiers get casted off. From the perspective
of the UPC program, the call is to a function that can touch only local data
and at most the upc_fence is obligated to do an mfence() to enforce ordering,
which generally won't do anything to an MPI transfer, but that's fine.
"So the fairy-tale of using upc_barrier to quiesce the whole world doesn't even
work in pure UPC-1.2, let alone UPC+MPI+POSIX-async-IO"
I never said that upc_barrier covered other communication models. It quiesces
UPC shared accesses for the UPC threads involved in the barrier (the anonymous
UPC I/O threads are evidently not participants in upc_barrier). We should
preserve this property. MPI calls do not make UPC shared accesses. An OpenMP
thread or a POSIX thread could make a UPC shared access, and those have to be
covered by fences and barriers.
From the above discussion, it has become apparent that we want the following
things that part of the BUPC team does not want:
- Full equivalence between existing upc_mem* functions and their upc_mem*_nb counterparts plus a local sync. (I believe Yili agreed with this point when we worked on the consensus proposal.)
- upc_barrier should continue to quiesce all shared accesses.
- Non-blocking shared accesses are ordered by fences, no matter if the fence is before, within, or after the init-sync interval, just like normal shared accesses.
- The ability to fully (globally) complete an individual nb transfer without forcing completion of other communication.
Note that as part of the compromise approach, we conceded requiring the sync
call to be made for the _nb calls instead of allowing the user to omit it
because BUPC needed that to free resources, and BUPC conceded that the sync
call could be optional for _nbi calls.
Original comment by johnson....@gmail.com
on 12 Aug 2012 at 5:20
" - Full equivalence between existing upc_mem* functions and their upc_mem*_nb
counterparts plus a local sync. (I believe Yili agreed with this point when we
worked on the consensus proposal.)"
See comment 65 - the Berkeley proposal provides this exact equivalence, the
CRAY proposal does not (because it adds extra synchronization which is not
present in the blocking version).
" - upc_barrier should continue to quiesce all shared accesses."
Programmers who care about this property can achieve it by placing their
barriers outside transfer intervals. Comment #56 demonstrates why programmers
need to be aware of the boundaries of their transfer intervals to write correct
programs with this library, under either proposal.
In my opinion this unmoderated textual conversation has become a waste of time
on both sides. Neither side is barely listening to the other, and we're
certainly not making progress. Nobody aside from the two arguing parties seems
to be reading it. I'm done with this thread.
Original comment by danbonachea
on 12 Aug 2012 at 5:41
"See comment 65 - the Berkeley proposal provides this exact equivalence, the
CRAY proposal does not (because it adds extra synchronization which is not
present in the blocking version)."
I want equivalence for the nb call plus the lsync, not the nb call plus the
gsync. It's the gsync call that has the half-fence, so I think you've confused
the two proposed sync calls.
Another example to think about before the call...
// assume THREADS == 2
shared int x[THREADS];
int delivered;
...
if ( MYTHREAD == 0 ) {
x[0] = 1;
/* BUPC: Starts an async transfer window that is terminated by a sync call.
*
* Cray: Starts an async transfer window that is terminated by a sync call
* or by any fence.
*/
upc_memput_[async/nb] with target x[1], source x[0], for sizeof( int ) bytes
...
/* Computation or other communication. */
...
/* BUPC: Has no effect on the async transfer.
*
* Cray: Forces completion of the async transfer.
*/
upc_barrier;
upc_barrier;
upc_barrier;
/* BUPC: Required to terminate the async transfer window and to free
* resources.
*
* Cray: Concedes that the sync can be required in UPC 1.3 to give other
* implementations an opportunity to free resources, but does not currently
* require it because any fence will terminate the window and Cray has
* sufficient resource tracking without it.
*/
upc_waitsync / upc_gsync
}
else {
upc_barrier;
/* BUPC: This access is illegal, even though a barrier separates it from
* Thread 0's x[0] = 1 assignment because Thread 0 has not yet sync'd a
* copy that uses x[0] as a source buffer.
*
* Cray: This access is legal and has no effect on Thread 0's async copy
* because the barrier already forced completion of the copy.
*/
x[0] = 2;
upc_barrier;
/* BUPC: This local access is illegal because Thread 0 has not yet sync'd
* a copy that uses x[1] as a destination buffer. The value of x[1] may be
* 0, 1, 2, or even something else.
*
* Cray: This local access is legal and will read that x[1] == 1.
*/
delivered = ( x[1] > 0 );
upc_barrier;
if ( ( delivered == 0 ) && ( x[1] == 1 ) ) {
/* BUPC: The claim is that this program is already illegal because it
* contains illegal accesses so this test proves nothing. Cray
* considers the accesses to be legal, such that if this block executes,
* it proves that the async read the x[0] value from one side of a
* barrier but wrote it to x[1] on the other side of the barrier,
* thereby detecting a relaxed access that crossed a barrier. BUPC
* claims this situation is possible but not legally observable and
* therefore not a memory model violation.
*
* Cray: This block will never execute because the x[1] value was
* delivered back at the first barrier.
*/
}
}
Original comment by johnson....@gmail.com
on 12 Aug 2012 at 8:34
A related issue is the semantics for non-blocking data movement as they will be
applied to an eventual non-blocking collectives proposal. This is related to
the current topic because it has been stated (By George in comment #4) that it
is strongly desired that the "Collectives 2.0" proposal, when revised, use the
same model/semantics for asynchronous data movement as the NB memcpy family (to
the extent that can be done, given that one is point-to-point while the other
is collective).
It is my position that the Berkeley semantics (destination(s) undefined during
a transfer interval spanning from init to sync) is equally/easily applicable to
specification of semantics for non-blocking collectives. I don't have a
similar confidence that the Cray semantics would work well for collectives, but
feel that to be fair Cray should make that judgement, not me. So, here are two
"points of concern" which come to mind regarding NB collectives.
1) Unlike the point-to-point NB memcopy interfaces, where is no distinguished
initiator. For instance, given a Gather operation: is the root thread the
initiator of all data movement (Pull), or are all threads initiators of the
pieces with local affinity (Push). This is important to define with the Cray
semantics because it determines which strict operations would be required to
complete the collective operation. "All strict operations on all participating
threads" is one possible answer that is completely unambiguous. Another
unambitious option might be "strict operations fence all data movement to and
from memory with affinity to the issuing thread".
2) Even in the presence of a clear (and agreeable to all parties) definition of
which strict operations completed which portions (or the entirety) of a
collective operation, I wonder if such a semantic is practical to implement
given that most scalable implementation involve spanning trees and thus
multiple intermediaries along the data movement path (again envisioning a
Gather as an example). Thus, a strict reference may require global
communication/coordination even when issued by a single thread (thus NOT
collectively).
Original comment by phhargr...@lbl.gov
on 12 Aug 2012 at 9:09
In comment #68, Troy wrote:
"I said that MPI is a red herring because Paul asked how our fences handled
MPI_Send."
No, I did not.
No reference to MPI_Send appears in any comments I have authored.
Keeping this discussion civil is hard enough without people putting words in my
mouth. Perhaps it was an honest mistake, since Dan (who DID post a mixed
UPC/MPI example) and I have similar viewpoints on most things, but that mistake
just reinforced Dan's observation that "Neither side is barely listening to the
other".
Original comment by phhargr...@lbl.gov
on 12 Aug 2012 at 9:38
"No, I did not. No reference to MPI_Send appears in any comments I have
authored."
Oops, yep, that was a Dan comment in #65. My apologies. 72 comments is a lot
to keep track of!
I'm sorry if you and Dan don't feel that I'm listening. I've tried to respond
to all comments as quickly as possible (it's the weekend, even...) because I
had hopes that we could resolve more issues prior to the call. This online
discussion is a wonderful tool and has shed a lot of light on how different
people interpret the UPC 1.2 spec as well as plans for changes. The fact that
a discussion occurred that did not cause any of the three of us to switch sides
is unfortunate, but it happens, and believe it or not there are other issues on
this site that I'm MUCH more interested in resolving than this one. Hopefully
there is enough info here now that others who will be on the call can come in
knowing which design they support, or even come in with a new third idea.
Original comment by johnson....@gmail.com
on 12 Aug 2012 at 11:02
Dan wrote: "In my opinion this unmoderated textual conversation has become a
waste of time on both sides. Neither side is barely listening to the other, and
we're certainly not making progress. Nobody aside from the two arguing parties
seems to be reading it. I'm done with this thread."
Lest my lack of interaction on this topic is interpreted as implicit
agreement/disagreement or apathy -- I have followed the exchanges with interest
and formed my own opinion. Although the debate has been sometimes heated, I
agree with Troy that I think that many good points have been made on both sides
and issues raised.
For the record, as I've stated before: I'm not generally in favor of adding any
async. operations to UPC, because to me they feel too "MPI like" and because
they introduce a degree of anachronism (by design) that make parallel programs
written in UPC even more difficult to understand and analyze. Separately, I
questioned the suitability of some enhancements made in the NPB benchmark suite
that used Berkeley's async. memory copy operations. At the time, it was a
fairly easy call that those non-standard vendor-only extensions should be
excluded from the NPB suite given its role as a reference benchmark. Even when
an async. memory op. library is accepted as a required library, I'd probably
question its use the default implementations of the NPB benchmarks, but I'd
have less ground to stand on.
That said, given that the decision has been made to add an async. memory op.
library to the required UPC library, I support Berkeley's position that the
transmission interval between an init and sync. call should be defined in a way
that it is unaffected by strict accesses or upc_fence statements that occur
within the transmission interval and conversely those strict accesses and
upc_fence statements do not affect the semantics of the async. memory op.
library calls.
My main reason for supporting Berkeley suggested definition is based on the
assumption that a programmer who uses these async. memory op. library calls is
likely a more expert/advanced UPC programmer, and that this programmer has
decided to use the async. memory operation calls to improve program
performance. To gain the most performance improvement that the async. mem. op.
library has to offer, I think that the programmer will willingly accept the
responsibility to ensure that the source data buffers remain untouched during
the transfer interval and that the destination buffer is not valid until the
relevant sync. call has returned.
I also support the addition of a signalling put capability, perhaps as a
separate library. I would in fact like to see it added as a required library
in the 1.3 spec. if it has been determined to be both suitably well-defined and
understood.
If the decision is made to maintain the "half fence" (a term coined in comment
#24), and then Berkeley holds to their position of maintaining their bupc_*
calls as a response, I would view this as a failing of the UPC language design
and specification process. In fact, as a UPC compiler and runtime implementer,
my reaction is: "why bother?" when considering the implementation of the new
required library. Keeping separate bupc_ and cray_ forms of the library call
seems superior to me, because at least it emphasizes the differences and
doesn't offer the illusion of standardization.
Given that the current Cray definition of the async. mem. ops. is more
restrictive than Berkeley's, there is no backwards compatibility issue for
their users if the Berkeley semantics are accepted. That is: programs written
to the Cray definition of the library will continue to work as they did before.
Thus it seems that decision of imposing additional restrictions on the use of
the async. mem. ops. reduces to a philosophical difference, as noted in one of
the preceding comments.
In forming my opinion, I found the following points to be most compelling:
1. Comment #23: "One of my many criticisms of the current async proposal is
that it cannot achieve this level of overlap on the important case of
cache-coherent shared-memory systems, because the semantics require the
insertion of "half-fences" and guarantees of local/global visibility, which are
significantly more heavyweight and would inhibit the hardware-provided
reordering optimizations. As such, the simple case of upc_mem*_async();gsync()
would be expected to perform SLOWER and provide LESS overlap than the existing
analogous upc_mem* call for those systems. It makes no sense to introduce an
"async" library whose semantics for memory overlap are often MORE restrictive
than the existing synchronous counterparts."
2. comment #43: "Our goal with the Berkeley async transfer library was to
enable far more aggressive overlap of communication with unrelated computation
and other communication. We are trying to overlap the entire cost of a
communication, and allow it to asynchronously continue undisturbed without
interference from unrelated operations."
3. comment #47: "I think the Berkeley semantics are better described as an
overhead expressway - it bypasses all the city traffic below and is unaffected
by city traffic lights, because the laws of the road guarantee the cars below
cannot even SEE the highway traffic, let alone interact with it. The on-ramps
and off-ramps are clearly defined by the library calls which define where cars
enter and exit the normal flow of relaxed operations on the city streets, but
while they're "in-flight" on the expressway they operate completely
independently of everything else."
4. comment #56: "This doesn't mean the memory model is "broken", it just means
there is a new way to write programs with undefined behavior that the memory
model does not govern. Programmers who embark upon using a new library feature
are burdened with using it CORRECTLY, otherwise they get undefined behavior and
commonly derived properties of the memory model may not hold true."
Although I appreciate that some users have indicated their preference for the
"half-fence" semantics proposed by Cray, I wonder if they would still support
that preference if they were told that the performance of the library call
might be impacted by this restriction?
Original comment by gary.funck
on 13 Aug 2012 at 5:14
Hi Gary. Thanks for offering another perspective. I would like to point out
that the async plus the gsync is not equivalent to the existing upc_mem* call
because the existing call guarantees only local completion wheras the gsync is
global completion (and the gsync is where the half fence is). So by converting
an existing upc_mem*call call to an async plus an lsync, which is equivalent,
there is no new performance hit.
Original comment by johnson....@gmail.com
on 13 Aug 2012 at 5:52
Correction: "introduce a degree of anachronism" should read "introduce a degree
of asynchronism". (My mis-use of the spell checker.)
Original comment by gary.funck
on 13 Aug 2012 at 5:58
In Comment 75, Troy wrote (in part): "So by converting an existing upc_mem*call
call to an async plus an lsync, which is equivalent, there is no new
performance hit."
Understood. Thanks for the clarification.
Original comment by gary.funck
on 13 Aug 2012 at 6:00
Gary Funk wrote (in part):
> If the decision is made to maintain the "half fence" (a term coined in
comment #24), and then Berkeley
> holds to their position of maintaining their bupc_* calls as a response, I
would view this as a failing of
> the UPC language design and specification process. In fact, as a UPC
compiler and runtime implementer,
> my reaction is: "why bother?" when considering the implementation of the new
required library.
Gary,
Berkeley UPC has an obligation to backwards compatibility for our existing
users' codes which demands that we retain these entry points at least for some
reasonable deprecation period durring which uses can convert their codes to the
standardized interfaces. This would be true regardless of what semantics are
chosen for the standard. How close or far the standard's semantics are from
those of our current interfaces will determine if there are two distinct
implementations within our runtime, or just a layer of #defines.
I fully expect that Cray has a similar commitment to keeping their users' codes
operating unchanged, at least in the near term. I don't think any of us are
retaining legacy interfaces for the purpose of undermining the standardization
process.
> Although I appreciate that some users have indicated their preference for the
"half-fence" semantics
> proposed by Cray, I wonder if they would still support that preference if
they were told that the
> performance of the library call might be impacted by this restriction?
It is our hope that the signaling-put (which will have both blocking and async
flavors) will prove applicable to the same usage case (producer-consumer
notification) that the "half-fence" or "gsync" are intended to address.
Quoting now from Steven in comment #40:
+> The "half-fence" that we proposed on the global sync formally provides acquire semantics on relaxed
+> accesses. This is necessary to permit pairwise synchronization with a remote thread via relaxed
+> operations to notify that thread that the non-blocking operation is complete.
While the signaling-put is intended to "package" the relaxed (blocking or
non-blocking) operation and the pairwise synchronization into a single
operation.
Original comment by phhargr...@lbl.gov
on 13 Aug 2012 at 6:10
An addition to my comment 74: If a consensus can't be reached where both Cray
and Berkeley agree to deprecate their flavors of the async. mem. op libraries
in favor of the new standard required library, I would move to defer inclusion
of the async. mem. op. library in version 1.3 of the specification.
A separate editorial note: I understand the advantage of vendors offering new
libraries prefixed with something that identifies the vendor, to avoid
confusion that the new library appears to be part of the UPC standard, for
example. But from my perspective this has created a Tower of Babel where for
example, we have separately named libraries for returning cpu timer ticks, or
we have async. mem. op. libraries with differing semantics (and differing
names). What I would much prefer is that vendors conferred and then drafted a
single proposed library that appears in the Proposed library section of the UPC
specification. It is this form of the library that users might experiment with
so that ultimately the definition is finalized and moved to "required" or
"optional". Users who use the proposed library accept the responsibility to
change their code as the library definition evolves before it is finalized.
Original comment by gary.funck
on 13 Aug 2012 at 6:12
Paul wrote (in part): "Berkeley UPC has an obligation to backwards
compatibility for our existing users' codes which demands that we retain these
entry points at least for some reasonable deprecation period durring which uses
can convert their codes to the standardized interfaces. This would be true
regardless of what semantics are chosen for the standard. How close or far the
standard's semantics are from those of our current interfaces will determine if
there are two distinct implementations within our runtime, or just a layer of
#defines."
I understand that those vendor-specific interfaces will have to be preserved --
for a while. The question is: when/if those old vendor-specific interfaces are
deprecated and then removed. My guess is that if they offer distinct
capabilities separate from the new standard library that they will never be
deprecated. It is this potential for divergence that bothers me and in my view
significantly lessens the value of a standard definition.
Original comment by gary.funck
on 13 Aug 2012 at 6:20
Gary wrote (jn part):
> My guess is that if they offer distinct capabilities separate from the new
standard library that they will
> never be deprecated. It is this potential for divergence that bothers me and
in my view significantly
> lessens the value of a standard definition.
There must be a balance struck between the individual implementer's freedom to
"innovate" and the "good of the community". I suspect that the economic
realities faced by commercial vendors will make them less quick to remove the
implementation-specific interfaces than the not-for-profits. So, there may be
no "right" answer. However, that won't prevent me from putting forth a set of
SUGGESTIONS:
+ As long as a new library is Optional there should be no "pressure" for
removal of a vendor-specific (near) equivalent. This is intended to encourage
head-to-head comparison which will help form a community consensus as to
whether the Optional library should be promoted to Required status or
potentially dropped if a better solution is to be found by embracing an
alternative in a future Spec revision.
+ Once a library has reached Required status, the first compiler release by a
given vendor to claim to support the new spec version should document any
"competing" vendor-specific library/feature as deprecated.
+ Once a library has reached Required status, a vendor should REMOVE
"competing" extensions with their first complier to claim to support the NEXT
spec, or after 2 calendar years (whichever is sooner). Example: a compiler for
UPC spec v1.4 drops vendor-specific libs which compete with Required libs
introduced in the 1.3 spec.
Since there is no UPC "branding" there is no manner by which we can "enforce"
these suggestions, but I'd like to think we can all agree to behave as members
of a cooperative community.
Thoughts?
Original comment by phhargr...@lbl.gov
on 13 Aug 2012 at 6:48
I spent a lot of time thinking about this over the weekend (in between moving
stuff into the new apartment, so I missed all the new comments...). I did end
up finding a significant use case for allowing non-blocking calls to continue
beyond a fence that I could not come up with a good way of restructuring to
avoid the fence. In light of that, I sat down with Troy this morning and
discussed what we could do to make this use-case work. We still have some
reservations about the Berkeley proposal, but I think Cray can get a lot closer
to it now that we have a good concrete use-case to work with. We haven't fully
fleshed things out yet, but here's the major changes that we're considering at
the moment:
1. Remove the *_gsync() and "half-fence" from the non-blocking proposal. There
would only be a single sync call mostly matching the semantics of the original
Berkeley proposal. As Dan pointed out, most of the desire for this call on
Cray's part is subsumed by the signalling put proposal from Berkeley (Could
this be added to the issue tracker? Also, I'd also like to see a signalling
get in there...). I still think that the functionality provided by the
"half-fence" could be generally useful in the future, so I'll make a separate
proposal for language level support for the addition of less restrictive
fences. And yes, that proposal will include the proposed formal changes to the
memory model.
2. Permit non-blocking operations to continue beyond a fence with some
restrictions.
3. Only the calling/initiating thread would be explicitly restricted from
accessing memory locations involved in the call until the sync. I'm still
working out the formal modifications to the memory model, but the gist of it is
that the read and write operations involved in a non-blocking call occur at
some unspecified time between the initiation and the sync, and are treated as
relaxed (or local for the non-shared arguments to memget/memput) operations,
thus they can be re-ordered with respect to other relaxed operations, including
those occurring "after" the sync in the source. However, all threads must
agree on their order with respect to strict operations issued by the calling
thread, and thus must also agree on the set of relaxed operations with which
they can be re-ordered.
I think these changes would suffice to enable the functionality and performance
that we're looking for in non-blocking routines, hopefully without
significantly burdening implementations or adding undue complexity to memory
access rules. Importantly, this restricts the possible results of overlapping
accesses to involved memory locations rather than making such programs illegal,
which should make debugging and formal analysis of programs using the
non-blocking operations simpler.
Programs using the existing Berkeley routines could switch to using the
standard ones with no changes. Programs using the existing Cray routines might
require changes, though I think we could safely make this change as the
routines were mostly used to avoid the overheads caused by the same-address
restriction.
This is still a rough idea, so comments are welcome.
Original comment by sdvor...@cray.com
on 13 Aug 2012 at 4:44
To help clarify comment 82, consider the following short UPC program:
#include <upc_relaxed.h>
relaxed shared [2] int src[2*THREADS];
relaxed shared [2] int dst[2*THREADS];
strict shared int phase = 0;
int main()
{
int i;
upc_handle_t handle;
if ( THREADS < 2 ) return -1;
upc_forall(i=0;i<2*THREADS;++i;&src[i]) src[i] = dst[i] = MYTHREAD;
upc_barrier;
if ( MYTHREAD == 0 ) {
phase = 1;
handle = upc_memcpy_nb( &dst[2], &src[0], 2 * sizeof(int) );
phase = 2;
upc_sync( handle );
phase = 3;
}
else {
int my_phase[2], my_dst[2];
my_phase[0] = phase;
my_dst[0] = dst[2];
my_dst[1] = dst[3];
my_phase[1] = phase;
}
return 0;
}
Obviously, if a thread's first read of phase (into my_phase[0]) observes the
value '3', then that thread is guaranteed to observe the value '0' for both
locations of dst, as the sync call occurs before thread 0's strict write of the
value '3' into phase. Likewise, if a thread's final read of phase (into
my_phase[1]) observes the value '0', then that thread is guaranteed to observe
the value '1' for both locations of dst, as the initiation call occurs after
thread 0's strict write of the value '1' into phase.
If my_phase[0] reads any of '0', '1', or '2' from phase and the value read into
my_phase[1] doesn't prevent it, or my_phase[1] reads any of '1', '2', or '3'
and the value read into my_phase[0] doesn't prevent it, then the values read
into my_dst[0] and my_dst[1] could be either '0' or '1' (or something else
entirely, as noted in UPC 1.2 B.3.2.1 paragraph 2). This is due to the relaxed
operations associated with the upc_memcpy_nb() occurring at an unspecified time
between the initiation and the sync, and thus the values aren't guaranteed
except outside of that region.
However, the values must be consistent across threads. Therefore, the
following behavior is disallowed:
Thread 1 observes
my_dst[0] is 0
my_phase[1] is 1
Thread 2 observes
my_phase[0] is 2
my_dst[0] is 1
Because all threads must observe the operations occur in the same order with
respect to strict operations on the calling thread, it is not permitted for
thread 2 to observe the original value of dst[2] after observing the value '2'
for phase if thread 1 has observed the updated value of dst[2] before observing
the strict write of the value '2' into phase.
Formally, thread 1 and thread 2's orderings include (with the first argument
representing the thread issuing the operation)
T1: RW(T1,dst[2],1) SW(T0,phase,1) RW(T0,dst[2],0) RR(T1,dst[2],0)
SR(T1,phase,1) SW(T0,phase,2)
T2: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) SR(T2,phase,2)
RR(T2,dst[2],1) RW(T0,dst[2],0)
Since the thread 1 and thread 2's orderings disagree with when RW(T0,dst[2],0)
occurs relative to SW(T0,phase,2), the orderings are incompatible, and thus
disallowed.
The following similar behavior is allowed though:
Thread 1
my_dst[0] is 0
my_phase[1] is 2
Thread 2 observes
my_phase[0] is 2
my_dst[0] is 1
There are now at least two potential orderings for thread 1, one matching the
scenario above, where the write to phase happens after the write to dst,
T1: RW(T1,dst[2],1) SW(T0,phase,1) RW(T0,dst[2],0) RR(T1,dst[2],0)
SW(T0,phase,2) SR(T1,phase,2)
and one where the write to phase happens before the write to dst
T1: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) RW(T0,dst[2],0)
RR(T1,dst[2],0) SR(T1,phase,2)
Notice that both generate the observed behavior, as thread 1 doesn't have
enough information to show where RW(T0,dst[2],0) occurred relative to
SW(T0,phase,2). Thread 2 still has
T2: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) SR(T2,phase,2)
RR(T2,dst[2],1) RW(T0,dst[2],0)
which is compatible with the second option for thread 1's ordering. Thus,
there is a valid ordering on all threads and the behavior is therefore allowed.
Original comment by sdvor...@cray.com
on 13 Aug 2012 at 7:59
"1. Remove the *_gsync() and "half-fence" from the non-blocking proposal.
There would only be a single sync call mostly matching the semantics of the
original Berkeley proposal."
I'm glad to hear we may be making some progress here.
"As Dan pointed out, most of the desire for this call on Cray's part is
subsumed by the signalling put proposal from Berkeley (Could this be added to
the issue tracker? Also, I'd also like to see a signalling get in there...)."
See issue 80: http://code.google.com/p/upc-specification/issues/detail?id=80
"However, all threads must agree on their order with respect to strict
operations issued by the calling thread,"
I still strongly believe it's a mistake to go down that road. It's simpler and
less confusing to state the contents of the destination buffer are undefined
for the entire transfer interval (until after the sync), which also gives
implementations more flexibility wrt in-flight operations. I don't see a
motivation for accommodating programs that want to touch the transfer buffers
(from ANY thread) before the explicit sync.
Regarding your example in comment 83:
If non-zero threads read my_phase[1]==0, then the reads of dest are guaranteed
to have seen the initial values. We agree there.
However in all other cases, the Berkeley semantics state that if non-zero
threads read anything except '3' into my_phase[0], then the subsequent reads of
dst are reading undefined values (because the program failed to ensure they
occurred after the successful upc_sync). The contents of the destination buffer
remain undefined for ALL threads until after the upc_sync, which in practice
means other threads cannot safely read the contents until the initiating thread
has used a strict operation or other synchronization method to signal that the
upc_sync has successfully completed.
Furthermore, "undefined" should not be read as "the prior or subsequent value",
or "something I can explain via the memory model", it really is COMPLETELY
undefined (ie garbage). In particular, there is no guarantee that two different
threads should read the same undefined values, or that anything read should
look anything like the original or final values. Once a program has strayed
into reading undefined values, the memory model provides no guarantees
whatsoever, and there is no need for the undefined values to be "explainable"
in terms of the memory model. As a consequence of this, VALID programs can
never even observe "where" in the transfer interval the "unspecified" data
transfer operations occurred, let alone agree on where they occurred.
This philosophy gives implementations the maximal freedom within the transfer
interval to perform the transfer via the most efficient means possible, with
guaranteed freedom from conflicting accesses by perverse programs. We want the
destination buffer to remain completely "off-limits" until the explicit sync
succeeds. This flexibility enables a number of optimizations that would not
otherwise be possible. We expect anything less than this to have a measurable
performance impact on the common case for platforms of interest, and therefore
unacceptable.
"restricts the possible results of overlapping accesses to involved memory
locations rather than making such programs illegal, which should make debugging
and formal analysis of programs using the non-blocking operations simpler."
This is a valid consideration, but in my opinion should be completely secondary
to any performance-related concerns. The advanced user elected to use this
advanced interface solely to gain performance - that's the entire motivation
for the added complexity of this interface. That being said, there are various
potential approaches to debugging the use of this library - the simplest being
to replace non-blocking calls back to their blocking equivalents and see what
changes. Another approach, taken by dynamic race condition detectors, would be
to explicitly write "trap" values into the destination buffers during the
transfer interval and signal when threads touch them.
Original comment by danbonachea
on 16 Aug 2012 at 12:57
"However in all other cases, the Berkeley semantics state that if non-zero
threads read anything except '3' into my_phase[0], then the subsequent reads of
dst are reading undefined values (because the program failed to ensure they
occurred after the successful upc_sync). The contents of the destination buffer
remain undefined for ALL threads until after the upc_sync, which in practice
means other threads cannot safely read the contents until the initiating thread
has used a strict operation or other synchronization method to signal that the
upc_sync has successfully completed."
This implies that the non-blocking forms, if synced before any fences, are NOT
equivalent to the "blocking" forms. Threads can touch memory involved in
upc_mem{cpy|get|put}() during the operation, and there are well-defined, though
very loose, rules regarding the results. This should also be true of the
non-blocking variants.
"In particular, there is no guarantee that two different threads should read
the same undefined values, or that anything read should look anything like the
original or final values."
The memory model already explicitly permits this: 'Furthermore, because the
shared writes implied by the library call have unspecified size, thread 1 may
even read intermediate values into local_z0 and local_z1 which correspond to
neither the initial nor the final values for those shared objects.'
"This philosophy gives implementations the maximal freedom within the transfer
interval to perform the transfer via the most efficient means possible, with
guaranteed freedom from conflicting accesses by perverse programs. We want the
destination buffer to remain completely "off-limits" until the explicit sync
succeeds. This flexibility enables a number of optimizations that would not
otherwise be possible. We expect anything less than this to have a measurable
performance impact on the common case for platforms of interest, and therefore
unacceptable."
As defined, it permits using a (non-UPC) thread to do the transfer
asynchronously. It permits use of RDMA hardware. It permits memory reordering
on SMPs. It permits caching/aggregation, assuming flushing occurs on fences,
as is required for regular relaxed accesses. What non-blocking optimizations
does it prohibit?
"That being said, there are various potential approaches to debugging the use
of this library - the simplest being to replace non-blocking calls back to
their blocking equivalents and see what changes. Another approach, taken by
dynamic race condition detectors, would be to explicitly write "trap" values
into the destination buffers during the transfer interval and signal when
threads touch them."
Neither of which can conclusively answer the question "Is my program a valid
UPC program?" in all cases. No other memory operation in UPC places USER
restrictions on other threads' memory operations. There are restrictions
placed on the implementation, but never the user. The code may not behave the
way the user expects or intends, but it is still a valid UPC program.
Prohibiting the calling thread from accessing the memory is ok, as it is
relatively easy to prove that the user's code is valid or invalid with that
restriction.
Original comment by sdvor...@cray.com
on 16 Aug 2012 at 3:23
"This implies that the non-blocking forms, if synced before any fences, are NOT
equivalent to the "blocking" forms. Threads can touch memory involved in
upc_mem{cpy|get|put}() during the operation, and there are well-defined, though
very loose, rules regarding the results. This should also be true of the
non-blocking variants."
I disagree. B.3.2.1 specifies that blocking upc_mem* updates the destination
buffer with "relaxed shared writes of unspecified size and ordering", which
implies that threads concurrently reading the destination buffer without
synchronization are effectively reading garbage (it may not look anything like
the initial or final values). This is exactly the same as a Berkeley
upc_sync(memput_async()), ie an empty transfer interval. In both cases the
destination values read by other unsynchronized threads are completely
indeterminate. Equivalence of undefined values is meaningless.
"As defined, it permits using a (non-UPC) thread to do the transfer
asynchronously. It permits use of RDMA hardware. It permits memory reordering
on SMPs. It permits caching/aggregation, assuming flushing occurs on fences,
as is required for regular relaxed accesses. What non-blocking optimizations
does it prohibit?"
It suffers from the same performance pitfall as the original proposal - namely
that random strict operations issued by the caller for unrelated computations
during the transfer interval globally destroy the overlap potential. This is
especially a problem in multi-module/multi-author/multi-domain applications
where one module wants to initiate some communication in preparation for the
next time step, then call a different, independent module to compute on its own
independent data in the overlap interval. If the programmer has explicitly
asserted a given transfer can proceed independently of everything until the
explicit sync and has promised not to touch the transfer buffers until then, we
should not stall on the first random strict operation in the transfer interval
to await completion (thereby silently defeating the programmer's attempt at
overlap).
This pitfall creates the possibility of "performance bugs" whenever the program
wants to call a function they did not write during the transfer interval, if
that function happens to include a strict operation. The user has taken pains
to arrange an overlapped transfer and everything in his code is kosher, but the
unrelated library call is stalling to complete his overlapped transfer for no
reason. This is a "global" kind of bug that crosses all module boundaries, and
may be impossible for the programmer to diagnose in the case of separate
compilation. This seems far more surprising and problematic to me than programs
which "break the contract" of the non-blocking transfer interval and issue
unsynchronized reads to the destination that result in undefined values; that
is usually a "local" kind of bug because escape analysis can at least tell you
which modules may possibly be touching the data buffers, and in common cases it
should only be the current module.
"Neither of which can conclusively answer the question "Is my program a valid
UPC program?" in all cases."
UPC is not a type-safe language. That property is impossible to prove for most
interesting programs, even without this library. That's not even a goal of
C-based languages. Programmers who want/need that type of guarantee should be
using a type-safe, high-level language. Otherwise, they need to fully
understand the semantics of their language/library and use their brains to
avoid straying into undefined behavior.
"No other memory operation in UPC places USER restrictions on other threads'
memory operations. There are restrictions placed on the implementation, but
never the user. The code may not behave the way the user expects or intends,
but it is still a valid UPC program. Prohibiting the calling thread from
accessing the memory is ok, as it is relatively easy to prove that the user's
code is valid or invalid with that restriction."
Disagree again. Even in plain 1.2, any program which contains data races can
result in reading indeterminate values. Every read-after-write data race
between threads (ie without intervening synchronization) results in
indeterminate values being read.
It's important to note the bulk of the formal memory model is concerned with
individual reads and writes of MATCHING scalar size, where word-tearing is
implicitly assumed to not occur (this is a weakness of the current model which
remains to be solved - issue 61). Under those conditions and assumptions the
"indeterminate" values returned by those racing reads are guaranteed to return
either the initial or final value. All of the memory model examples in the
appendix work under that regime. However once you throw byte-oriented bulk
transfers into a race, that small assurance also goes out the window and racing
reads may observe values which are neither the initial nor final value, and are
just plain undefined. This is why B.3.2.1 specifies that byte-oriented library
calls make relaxed read/writes of "unspecified SIZE and order", to make it
clear that conflicting reads (of even blocking upc_mem* calls) will get
undefined values (this should perhaps be made more explicit). This is ugly but
it reflects the reality of implementation - it may be possible (in most cases)
to prevent word-tearing for scalar read/writes, but it's unavoidable for
untyped bulk transfers.
It's impossible in general to prove a lack of data races, even using runtime
techniques. This is just another form of data race - the calling thread is
performing writes of undefined size and order throughout the entire transfer
interval, and if another thread reads the destination during that interval (ie
without synchronization protecting it from doing so) then it gets undefined
garbage values. Even if we adopted the Cray proposal and forced strict
operations to terminate the transfer interval, you can still easily construct
data races where unsynchronized threads read garbage from the destination
before the caller issues a strict op. This potential for data races is a
fundamental property of UPC 1.2 (and most shared-memory programming), and
something the programmer must already keenly avoid. The only difference with
the new library is the destination buffer contains undefined values for the
several lines of code until the sync, rather than the duration of the single
line of code of the blocking upc_mem*.
I think we're just avoiding the core difference here, which is when the
transfer interval ends. Arguments about races and consistency apply equally to
conflicting unsynchronized accesses in the transfer interval regardless of
where it ends. Berkeley wants the transfer interval to terminate only at the
explicit sync. Cray wants it to additionally terminate at any earlier random
strict operation issued by the calling thread during the transfer interval.
Performance and modularity concerns both argue for the Berkeley version. The
Cray version seems to primarily be motivated by perverse programs that issue an
explicitly async transfer and then want to read guaranteed results before it
has been explicitly synced. I have yet to see a realistic example of a program
that needs this additional guarantee, that cannot be rewritten to explicitly
sync the transfer before the synchronizing strict op (or use a signalling
memput instead, for that usage case). The secondary Cray motivation seems to be
programs that "accidentally" violate the contract of the library interface and
want stronger semantics to aid in debugging, but Berkeley is not willing to
sacrifice the performance of correctly-synchronized production codes to
accomodate the behavior of insufficiently-synchronized buggy programs.
Original comment by danbonachea
on 16 Aug 2012 at 10:29
"It suffers from the same performance pitfall as the original proposal - namely
that random strict operations issued by the caller for unrelated computations
during the transfer interval globally destroy the overlap potential."
I think you're misunderstanding what I was proposing, which is probably my
fault ;). I'm proposing that the relaxed writes involved in a non-blocking
transfer may occur at any time between the initialization and sync, but that
all threads must agree on when any given individual relaxed write occurs
relative to fences on the calling thread. The writes are still completely
independent and of unspecified size, so all this really says is that if you
observe the final result at a particular location before a fence, then observe
the fence, you are guaranteed that that any relaxed writes you issue TO THAT
LOCATION after observing the fence are guaranteed to be "correctly" ordered on
all other threads, even if the sync hasn't actually occurred yet. It doesn't
tell you anything about any other memory locations involved, as they are
independent of one another.
"Disagree again. Even in plain 1.2, any program which contains data races can
result in reading indeterminate values. Every read-after-write data race
between threads (ie without intervening synchronization) results in
indeterminate values being read."
UPC still permits valid programs to contain data races though. Our point of
contention was that we understood your proposal to say that programs with data
races on memory involved in non-blocking transfers are not valid UPC programs.
We have no problems with the results of the race being indeterminate, but we do
not want to say that such programs are no longer UPC programs merely because of
the race condition.
"Even if we adopted the Cray proposal and forced strict operations to terminate
the transfer interval"
This is completely misunderstanding what I was proposing. All I intended was
that once a thread (other than the calling thread) observes the final result at
a particular memory location AND a fence on the calling thread that provably
occurred after it, then all threads are guaranteed observe the final result at
that memory location IF they have also observed that fence.
Original comment by sdvor...@cray.com
on 17 Aug 2012 at 3:38
"I'm proposing that the relaxed writes involved in a non-blocking transfer may
occur at any time between the initialization and sync, but that all threads
must agree on when any given individual relaxed write occurs relative to fences
on the calling thread.... if you observe the final result at a particular
location before a fence, then observe the fence, you are guaranteed that that
any relaxed writes you issue TO THAT LOCATION after observing the fence are
guaranteed to be "correctly" ordered on all other threads"
There are several implicit assumptions in your statements above that I think
are problematic:
1. It assumes the initial and final values of the given memory location differ,
so there's a change to notice at all. I think you'd agree that when they are
the same value, no conclusion can be drawn.
2. It assumes lack of word-tearing, which can easily occur between the
byte-oriented library writes and any reads larger than a byte. Consider this
case of a 4-byte memory "location" that appears somewhere in the middle of the
destination buffer:
Byte address: 0 1 2 3
initial value (hex): 00 00 00 00
final value (hex): 00 77 00 00
If a conflicting 4-byte read from a remote thread gets back "0x00770000", that
LOOKS like the final value has been written, but the library may in fact still
be writing the last two bytes (as a separate "relaxed write of unspecified
size"). If the thread mistakenly assumes the transfer is "done" and issues a
concurrent write to the word, then the value it writes may later get clobbered
when the library writes the second half of the bytes.
3. It implicitly assumes that any given byte in the destination buffer is only
written once, and that there aren't multiple "relaxed writes of unspecified
size and order" that affect the same byte.
4. Finally it assumes the destination bytes are only written with the "new"
data and doesn't temporarily hold "something else entirely"
I see no reason to restrict implementations to require 3 and 4, simply to
accommodate "perverse" programs that want to try and infer transfer completion
using data races, rather than using the explicit sync call and proper,
guaranteed-correct synchronization. We have a real example of at least one
network that can accelerate transfers by using the destination buffer as
"scratch space" for storing temporary metadata while the transfer is in-flight.
Looking forward to the non-blocking collectives, it will likely be standard
practice to utilize destination buffers as scratch space during the transfer
interval, and programs attempting the type of inference you describe by peeking
at the destination will do so at their own peril. As mentioned in an earlier
comment, we should at least attempt to devise a general semantic philosophy
that will also make sense for the closely-related non-blocking collectives.
For all these reasons I continue to argue that the contents of the destination
buffer should remain fully and truly undefined until after the successful
explicit upc_sync (and for other threads, until they've used strict operations
or other provided synchronization features to guarantee upc_sync has returned
on the caller). The same goes for the restriction that the source buffer remain
unmodified until the transfer is upc_sync'd. Bulk synchronous programs will
probably accomplish such synchronization trivially at the next barrier.
Programs that need more fine-grained point-to-point data delivery with
synchronization should be using the signalling put library. We should stop
worrying about the behavior of programs that read and act upon undefined values.
"We have no problems with the results of the race being indeterminate, but we
do not want to say that such programs are no longer UPC programs merely because
of the race condition."
Programs that read undefined values and then act upon them in any non-trivial
way will exhibit undefined behavior (garbage in, garbage out). When we're
dealing with untyped bulk library transfers racing with conflicting
unsynchronized scalar accesses, the indeterminacy goes way beyond
"initial-value or final-value", the contents of the destination buffer are
completely undefined.
For those who like analogies, my best "simple analogy" is a program which does
something like:
char *garbage = malloc(20);
printf("%s",garbage);
The contents of that uninitialized memory are technically "indeterminate" (ISO
C99 7.20.3.3), but many/most interesting actions consuming those values will
lead to undefined behavior (in this case, anything from zero output to a
segmentation fault).
Original comment by danbonachea
on 17 Aug 2012 at 6:15
You're still misunderstanding my intention. When I said observed the final
value, I meant the final value, not some intermediate value that happens to be
the same. More formally, all I'm proposing is that the set StrictOnThreads(M),
which enforces the ordering of relaxed operations issued by a thread relative
to strict operations on that thread, must be globally consistent and complete,
just like it currently is, rather than allowing threads to define it
differently or excluding some pairs from it. Thus, from the perspective of the
memory model, for each relaxed operation involved in the transfer Rop_nb, for
every strict operation Sop_ct on the calling thread between the initiation and
the sync, either (Rop_nb, Sop_ct) is a member of StrictOnThreads(M) OR (Sop_ct,
Rop_nb) is a member of StrictOnThreads(M) and it is undefined which is (as long
as it is consistent with the other pairs in StrictOnThreads(M)).
"Programs that read undefined values and then act upon them in any non-trivial
way will exhibit undefined behavior (garbage in, garbage out). When we're
dealing with untyped bulk library transfers racing with conflicting
unsynchronized scalar accesses, the indeterminacy goes way beyond
"initial-value or final-value", the contents of the destination buffer are
completely undefined."
Yes, we agree on this completely. All we're asking is that it not be ERRONEOUS
for UPC programs to access the memory before the sync as you said in comment
30, but rather that it results in undefined values read/written (note, not
completely undefined behavior). If those undefined values are used later in
the program, as in your printf example, it may or may not lead to undefined
behavior, depending on how they are later used.
Original comment by sdvor...@cray.com
on 17 Aug 2012 at 1:41
"from the perspective of the memory model, for each relaxed operation involved
in the transfer Rop_nb, for every strict operation Sop_ct on the calling thread
between the initiation and the sync, either (Rop_nb, Sop_ct) is a member of
StrictOnThreads(M) OR (Sop_ct, Rop_nb) is a member of StrictOnThreads(M) and it
is undefined which is (as long as it is consistent with the other pairs in
StrictOnThreads(M))."
I agree with the statement above.
However, because the relaxed operations that comprise Rop_nb are of unspecified
size (leading to word tearing), can occur anywhere in the transfer interval,
might overlap or conflict with each other (writing the same location multiple
times), or even temporarily write a completely unrelated value to the
destination, there is effectively no useful property that can be inferred by
reading the undefined values in the destination buffer during the transfer
interval.
I want to avoid giving programmers the impression they can accomplish anything
useful by reading the undefined values during the transfer interval. I'll admit
my language in comment 30 is a bit too strong - the implementation shouldn't
crash the program for simply issuing these "sketchy" reads (or otherwise
invalidate the unrelated parts of the program), but the observed values are
effectively meaningless. That being said, I'd like to specify that conflicting
writes in the transfer interval that overwrite the destination buffer lead to
"undefined behavior" - not only "undefined results" in the destination, because
implementations who use it as scratch space might easily crash or do other
nasty things if their meta data is suddenly overwritten by random code while
the library "owns" the buffer.
The only point in specifying the transfer in terms of relaxed ops is to provide
well-defined semantics to code OUTSIDE the transfer interval; so the calling
thread is guaranteed to observe updated values via relaxed reads immediately
after returning from upc_sync, and if it signals a second thread using a strict
write issued AFTER the upc_sync, that thread can observe the fence using a
strict read and subsequently also see guaranteed updated values in the
destination via relaxed reads. Converse statements can also be made wrt relaxed
writes of the source before the initiation.
Original comment by danbonachea
on 17 Aug 2012 at 3:08
"However, because the relaxed operations that comprise Rop_nb are of
unspecified size (leading to word tearing), can occur anywhere in the transfer
interval, might overlap or conflict with each other (writing the same location
multiple times), or even temporarily write a completely unrelated value to the
destination, there is effectively no useful property that can be inferred by
reading the undefined values in the destination buffer during the transfer
interval."
Exactly. All I was trying to get at in comment 82 was that StrictOnThreads(M)
should still be globally consistent and complete (and that we'll probably have
to patch up the definition of Precedes() to work this in). I admittedly wasn't
very clear about that, but at the time, I was still going through the memory
model trying to figure out exactly how to say what I meant. B.3.2.1 would
still prevent the programmer from making use of this fact in any meaningful
way, since as you pointed out in comment 88, they could observe an intermediate
value that happens to match the final value.
Original comment by sdvor...@cray.com
on 17 Aug 2012 at 3:24
"That being said, I'd like to specify that conflicting writes in the transfer
interval that overwrite the destination buffer lead to "undefined behavior" -
not only "undefined results" in the destination, because implementations who
use it as scratch space might easily crash or do other nasty things if their
meta data is suddenly overwritten by random code while the library "owns" the
buffer."
Consider the following program:
#include <upc.h>
relaxed shared int array[ELEMS_PER_THREAD*THREADS];
strict shared int phase = 0;
int main()
{
upc_forall( int i=0; i<ELEMS_PER_THREAD*THREADS; i++ ; &array[i] ) array[i] = MYTHREAD;
upc_barrier;
upc_memcpy( &array[1], &array[MYTHREAD], ELEMS_PER_THREAD * sizeof(int) );
upc_barrier;
return 0;
}
This is a legal UPC program, even if it is completely useless, and has no
undefined behavior. However, the final value of the elements of array with
affinity to thread 1 are undefined because the relaxed operations involved in
the upc_memcpy() are of indeterminate size and could be arbitrarily reordered.
If I change the upc_memcpy() to a non-blocking form with an immediate sync:
#include <upc.h>
relaxed shared int array[ELEMS_PER_THREAD*THREADS];
strict shared int phase = 0;
int main()
{
upc_forall( int i=0; i<ELEMS_PER_THREAD*THREADS; i++ ; &array[i] ) array[i] = MYTHREAD;
upc_barrier;
upc_handle_t handle = upc_memcpy_nb( &array[1], &array[MYTHREAD], ELEMS_PER_THREAD * sizeof(int) );
upc_sync( handle );
upc_barrier;
return 0;
}
It should be equivalent to the first program. With your undefined behavior
clause though, this program is not equivalent to the first, because it now has
undefined behavior and could crash or initiate global thermonuclear war(*).
That is unacceptable.
* http://www.imdb.com/title/tt0086567/
Original comment by sdvor...@cray.com
on 17 Aug 2012 at 5:37
"It should be equivalent to the first program. With your undefined behavior
clause though, this program is not equivalent to the first, because it now has
undefined behavior and could crash"
I agree this would be an additional small usage restriction, and figured you
probably wouldn't agree (which is why I prefaced it with "I'd LIKE to
specify"). If we were to add this restriction on conflicting, unsynchronized
writes to the destination it would make sense to also add a similar restriction
to the blocking variants for the same reason, but I'm guessing this will be a
hard sell.
This corner case unfortunately probably prohibits implementations that use the
destination buffer for scratch space, unless they are Very Careful to prevent
corrupted metadata from spreading garbage beyond the destination buffer. But
this is not game-breaking for us, I think we can agree to work without that
restriction. The main reason I brought it up was to expose another way in which
the destination buffer contents are completely undefined during the transfer
interval.
Original comment by danbonachea
on 17 Aug 2012 at 5:47
All "brand new" library proposals are targeted for starting in the "Optional"
library document. Promotion to the "Required" document comes later after at
least 6 months residence in the ratified Optional document, and other
conditions described in the Appendix A spec process.
Original comment by danbonachea
on 17 Aug 2012 at 5:53
Set default Consensus to "Low".
Original comment by gary.funck
on 19 Aug 2012 at 11:26
Official change proposal mailed to the lists on 10/4/2012.
Description:
------------------
Attached is an updated proposal for the non-blocking library spec, also
committed in the proposal area of SVN. I believe it represents our current
semantic consensus, based on the issue thread and my telecon notes. I've spent
many hours polishing the verbiage and I believe it is complete, but it's
entirely possible I've missed something or left an ambiguity - constructive
comments are welcome. I fully expect it will undergo minor revisions before
being merged into the optional library document, but in the interests of
expediency and in accordance with our official procedures I'm starting the
mandatory 4 week comment period now.
Some highlights:
* The bulk of the semantics are specified under the common requirement section,
to avoid tedious duplication. It defines several important terms used
throughout. Each function section merely includes semantics specific to that
function and a reference to the common requirements.
* I've used the term "transfer" throughout to denote the non-blocking
operations, since "copy" could be misread to exclude put and get, and
upc_memset is in no way a "copy".
* There are currently two code examples - providing an example for every
function seemed like overkill and didn't add anything useful. This is
consistent with library examples in the C99 spec (which are somewhat rare), eg
see fprintf and fscanf.
* Tweaked the spelling of the header file (upc_nb.h) and the feature macro
(__UPC_NB__)
* Section organization and synopses updated to match the core lib spec.
Original comment by danbonachea
on 4 Oct 2012 at 10:43
Attachments:
Thanks for all of the hard work that went into this proposal. It looks great!
Here's some feedback:
Pg. 5, #10: The semantics with respect to source memory should also be
specified. It seems like the intent is to allow concurrent reads from source
memory -- for comparison concurrent reads from source memory are erroneous in
MPI (e.g. for Isend/Irecv or RMA operations).
Pg. 5, #12: Still trying to process this paragraph.
Pg. 6, #4: Suggest s/shall/must/ to strengthen this statement.
Pg. 6, #4: What happens if a UPC thread exits (e.g. returns from main) without
completing explicit or implicit transfers? Given the current text, this is an
error and anything can happen (hang, crash, catch fire, etc).
Pg. 11, 7.4.5.3, #2: Args are missing in upc_memput_nbi in this paragraph.
Should be upc_memput_nbi(dst, src, n).
Sec. 7.4.6: The handle is an IN parameter. It might be helpful if this handle
were INOUT and were reset to UPC_COMPETE_HANDLE once it has been synchronized
(since the old value is invalid). This is helpful if the user is trying to
complete an array of handles or is polling on a handle location that gets
reused. The user can, of course, implement this behavior with the current
interface, so this is not a strong suggestion.
Sec. 7.4.6: Error handling -- what happens if the user passes an invalid
handle? If we return a complete/incomplete flag, there's no opportunity to
return an error. Have we adequately defined what happens if the handle is
invalid?
Sec. 7.4.7.1/2 #4: This statement suggests that implicit handle synchronization
functions are forbidden from progressing explicit handle operations; I don't
think this is the intended meaning. A clearer statement might be:
The upc_synci(_attempt) function does not complete explicit-handle transfers.
Original comment by james.di...@gmail.com
on 5 Oct 2012 at 9:18
I'm leery of 7.4.2.10 and 7.4.2.11. These to me seem to conflict with the
relationship of the non-blocking routines to the existing routines in the even
that a remote thread attempts to access memory involved in the transfer during
the transfer interval and there are no fences in the transfer interval. I
thought we'd agreed that we would simply note that the relaxed operations
making up the transfer (as defined in B.3.2.1 of the 1.2 spec) occur at an
unspecified time in the transfer interval specifically so that the non-blocking
routines would match the existing routines behavior in this case.
Original comment by sdvor...@cray.com
on 5 Oct 2012 at 9:36
Reponses to Jim and Steve:
"Pg. 5, #10: The semantics with respect to source memory should also be
specified. It seems like the intent is to allow concurrent reads from source
memory -- for comparison concurrent reads from source memory are erroneous in
MPI (e.g. for Isend/Irecv or RMA operations)."
Regarding MPI, their stated rationale for prohibiting concurrent reads of
source memory is to support RDMA on non cache-coherent systems, which I believe
are firmly outside the scope of UPC.
The source buffer is const-qualified, and concurrent reads of it are implicitly
permitted. I'm adding the following paragraph to explicitly state this for
clarity, and also clarify what happens when multiple concurrent transfers
overlap in memory:
\np The source memory specified in a transfer is not modified by the transfer.
Concurrent reads of source memory areas by any thread are permitted and behave as usual.
Multiple concurrent transfers initiated by any thread are permitted to specify overlapping source memory areas.
If a transfer specifies destination memory which overlaps its own source, or the source or destination of a
concurrent transfer initiated by any thread, the resulting values in all destination memory specified
by the affected transfers are indeterminate.
"Pg. 6, #4: Suggest s/shall/must/ to strengthen this statement."
This sentence appears very early in the semantic descriptions while definitions
are still being established.
I intentionally prefaced the sentence with "Generally" and did not use "shall",
because it's not a binding restriction - specifically in the case when the
explicit-handle initiation returns UPC_COMPLETE_HANDLE, the operation is
already complete and no sync call is required. However this is an unusual
corner case and I wanted to provide a conceptual overview paragraph to
familiarize the reader with the broad form of the interface, unclouded by such
corner-cases, before getting into the actual nitty-gritty of requirements.
"Pg. 6, #4: What happens if a UPC thread exits (e.g. returns from main) without
completing explicit or implicit transfers? Given the current text, this is an
error and anything can happen (hang, crash, catch fire, etc)."
This is deliberate. As an implementor I don't have a strong feeling about this,
but I wanted to allow other implementations freedom on this point. I can
imagine implementations on loosely-coupled systems that might produce an ugly
error if an RDMA packet arrives for a non-existing process. This is analogous
to MPI_Finalize, probably for similar reasons. I think the alternative would be
to augment exit() and friends with clauses similar to the current ones about
flushing all open file streams, but that seems like it might be nasty to
implement and could impact the performance of the common case (which I would
consider unacceptable for this library).
"Pg. 11, 7.4.5.3, #2: Args are missing in upc_memput_nbi in this paragraph.
Should be upc_memput_nbi(dst, src, n)."
Fixed - thanks for catching that!
"The handle is an IN parameter. It might be helpful if this handle were INOUT
and were reset to UPC_COMPETE_HANDLE once it has been synchronized (since the
old value is invalid). This is helpful if the user is trying to complete an
array of handles or is polling on a handle location that gets reused. The user
can, of course, implement this behavior with the current interface, so this is
not a strong suggestion."
We've gone back and forth on this detail in various drafts. As you say the user
can easily write a wrapper to provide one interface in terms of the other, so
it's mostly a matter of taste. I prefer this interface for reasons of
uniformity - the init functions return the handle by value, so the sync
function also takes it by value. This is directly analogous to the by-value
interface of malloc() and free() (and other UPC interfaces that mimic those).
The C99 libraries seem to generally favor by-value calling conventions for
types that are logically values, and there are very few examples of
by-reference interfaces for logical values (excluding types like FILE and
strings that are naturally pointers). If we ever add functions for syncing an
entire array of handles as a group, we would use by-reference calling
conventions - that's one of the enhancements we're considering for a future
version of this interface.
"what happens if the user passes an invalid handle? If we return a
complete/incomplete flag, there's no opportunity to return an error. Have we
adequately defined what happens if the handle is invalid?"
This is a violation of a "shall" requirement, which by C99 Sec. 4 means
behavior is undefined. That in turn implies by C99 3.4.3 that the
implementation is permitted to do anything from ignoring it, to crashing the
program and launching the nukes.
"The upc_synci(_attempt) function does not complete explicit-handle transfers."
I like this suggested wording and will add it.
Steve wrote:
"I'm leery of 7.4.2.10 and 7.4.2.11. These to me seem to conflict with the
relationship of the non-blocking routines to the existing routines in the even
that a remote thread attempts to access memory involved in the transfer during
the transfer interval and there are no fences in the transfer interval. I
thought we'd agreed that we would simply note that the relaxed operations
making up the transfer (as defined in B.3.2.1 of the 1.2 spec) occur at an
unspecified time in the transfer interval specifically so that the non-blocking
routines would match the existing routines behavior in this case."
I don't understand your objection. If the source or destination is modified by
any thread during the non-blocking transfer interval, the resulting value in
the destination is indeterminate. Similarly, reads of the destination memory
during the transfer interval will get indeterminate results until the operation
is synchronized. Note this is stronger than "undefined behavior" - the
implementation is not permitted to crash if the program violates these rules,
but the result is garbage. In comments 88-90 I believe we agreed upon these
properties, which are also true of every non-blocking interface I've ever read
and should be familiar to advanced users. 7.4.2.10-11 state these properties
directly and unambiguously, rather than leaving it to be inferred by the
reader. Unlike the blocking interface, it's very easy to write a program with
the nb library that violates these rules, and I want to make it very clear that
such a program gets inderminite results. The reader does not need a deep
understanding of the memory model and word tearing to understand the
straightforward and important restrictions stated in 7.4.2.10-11. These
paragraphs make it unambiguously clear that programs like the one in comment 83
read undefined values when they violate the rules.
Paragraph 12 defines the effects of the transfer interval as a whole with
respect to surrounding operations that are guaranteed to be ordered outside the
interval. Specifically, it explains that on-thread data dependencies are
preserved for conflicting relaxed accesses issued by the initiating thread
before the init or after the sync. Also, it defines the effects of the transfer
interval as a whole behave like relaxed accesses when there are strict
operations outside the transfer interval that synchronize with other threads
who then touch the buffers. Realistically this paragraph is mostly targetted at
compiler/optimizer writers, showing how the transfers fit into the memory model
to prevent ambiguities. Footnote 2 summarizes the most important implications
of this paragraph for end users.
Original comment by danbonachea
on 6 Oct 2012 at 7:11
"I don't understand your objection."
My objection is merely that there are no equivalent statements to 7.4.2.10 and
7.4.2.11 for the "blocking" library routines. Instead, this behavior is
described in the memory model. I therefore believe these statements are
unnecessary for the non-blocking routines, and that the memory model
(specifically B.3.2.1) should simply be updated to note that relaxed operations
that make up a non-blocking routine occur at an unspecified time during the
transfer interval. Otherwise, it may appear that the non-blocking routines
have stronger restrictions than the "blocking" routines do.
Original comment by sdvor...@cray.com
on 8 Oct 2012 at 10:23
Original issue reported on code.google.com by
yzh...@lbl.gov
on 22 May 2012 at 11:41