Library: non-blocking memory copy extensions

GoogleCodeExporter commented 9 years ago

This is to log the UPC non-blocking memory copy library extensions.

For more information, please see
https://sites.google.com/a/lbl.gov/upc-proposals/extending-the-upc-memory-copy-l
ibrary-functions

Original issue reported on code.google.com by yzh...@lbl.gov on 22 May 2012 at 11:41

GoogleCodeExporter commented 9 years ago

Alternatively, we could simply make 7.4.2.10 and 7.4.2.11 apply to the existing 
"blocking" routines as well.

Original comment by sdvor...@cray.com on 8 Oct 2012 at 10:28

GoogleCodeExporter commented 9 years ago

"My objection is merely that there are no equivalent statements to 7.4.2.10 and 
7.4.2.11 for the "blocking" library routines...Alternatively, we could simply 
make 7.4.2.10 and 7.4.2.11 apply to the existing "blocking" routines as well."

I don't object to adding some clarifying paragraphs to B.3.2.1, however I think 
it's important that these properties are directly stated in the nb library 
section. In the blocking case, there are by definition no conflicting accesses 
from the initiating thread, which automatically eliminates the easiest way for 
a programmer to "mess up". Programmers familiar with shared-memory programming 
already understand that synchronization is required when multiple threads touch 
the same data, so the data races that can arise when using the blocking library 
should be less surprising. Non-blocking transfers introduce new ways you can 
create a subtle data race and end up with indeterminate values, so I think it 
makes sense to be very clear about when that occurs. 

Also as Jim rightly pointed out it's worth clarifying that concurrent reads of 
source memory are permitted, since MPI's NB transfers notably prohibit that. 
Together these paragraphs neatly summarize the conditions under which 
conflicting operations are permitted and when they lead to indeterminate 
values. This provides all the information needed by the average user of this 
library, who will not need to consult the memory model and puzzle out the 
implications to decide if his program is correct.

Original comment by danbonachea on 8 Oct 2012 at 11:49

GoogleCodeExporter commented 9 years ago

I conferred with Pavan and he confirms my recollection that the MPI standard 
changed w.r.t. ISEND buffer reads.  In MPI-1 and MPI-2.0, the user was not 
allowed to touch the ISEND buffer before the request was completed.  However, 
because many users violated this prohibition and no implementation changed the 
send buffer before the request was completed, so MPI-2.1 and later standards no 
longer have this prohibition, hence it is now a restriction on the 
implementation that it not modify the ISEND buffer before the request is 
completed.

Jim is, of course, still correct about users expectations based upon MPI-1, 
which is obviously the one that is most widely known.

Original comment by jeff.science@gmail.com on 9 Oct 2012 at 2:35

GoogleCodeExporter commented 9 years ago

Responding to comment 99, from Dan:

--quote--

"Pg. 6, #4: Suggest s/shall/must/ to strengthen this statement."

This sentence appears very early in the semantic descriptions while definitions 
are still being established.
I intentionally prefaced the sentence with "Generally" and did not use "shall", 
because it's not a binding restriction - specifically in the case when the 
explicit-handle initiation returns UPC_COMPLETE_HANDLE, the operation is 
already complete and no sync call is required. However this is an unusual 
corner case and I wanted to provide a conceptual overview paragraph to 
familiarize the reader with the broad form of the interface, unclouded by such 
corner-cases, before getting into the actual nitty-gritty of requirements.

--quote--

I think we might not have lined up on the text to which I was referring.  I was 
looking at 7.4.3 #4 and I think you were looking at 7.4.2 #4.  Shall is proper 
legalese in 7.4.3 #4, you should ignore my suggestion.  7.4.2 #4 looks fine.

Original comment by james.di...@gmail.com on 9 Oct 2012 at 3:45

GoogleCodeExporter commented 9 years ago

"I don't object to adding some clarifying paragraphs to B.3.2.1, however I 
think it's important that these properties are directly stated in the nb 
library section."

I just want it to be clear that the blocking and non-blocking routines have 
exactly the same semantics regarding remote threads touching the buffers during 
the transfer interval.  Perhaps a footnote could be added to these paragraphs 
indicating that this is a direct consequence of the memory model (B.3.2.1) that 
also applies to the blocking routines, but is explicitly called out here 
because of the split nature of the transfer interval?

"In the blocking case, there are by definition no conflicting accesses from the 
initiating thread, which automatically eliminates the easiest way for a 
programmer to "mess up"."

This is only true if there is no threading layer (OpenMP, OpenACC, pthreads, 
etc) underneath UPC threads.  While that is outside the scope of the UPC spec, 
it is important to keep in mind as mixing programming models is quite common in 
HPC.

Original comment by sdvor...@cray.com on 9 Oct 2012 at 3:17

GoogleCodeExporter commented 9 years ago

"This is only true if there is no threading layer (OpenMP, OpenACC, pthreads, 
etc) underneath UPC threads.  While that is outside the scope of the UPC spec, 
it is important to keep in mind as mixing programming models is quite common in 
HPC."

I agree with this.  I think it's important to allow UPC threads, which may be 
mapped to OS processes, to interoperate nicely with OS threads (e.g., pthreads) 
whenever possible.  We have several applications using UPC+OpenMP/Pthreads, 
which is the most scalable way to use a NUMA multi-core cluster in our 
experiments so far.

Original comment by yzh...@lbl.gov on 9 Oct 2012 at 4:47

GoogleCodeExporter commented 9 years ago

I've not heard anyone in HPC talk about OpenMP or OpenACC as compilation 
targets, except perhaps from DSLs.  However, I think more explicit APIs like 
Pthreads and OpenCL are relevant.  It may also be prudent to think about 
user-level threads, e.g. Qthreads, as possible back-end components for UPC.  
Does Kyle Wheeler follow the UPC spec discussion?

Original comment by jeff.science@gmail.com on 11 Oct 2012 at 1:32

GoogleCodeExporter commented 9 years ago

Just added the footnote suggested by Steve in comment 105, as SVN r174:

--- upc-lib-nb-mem-ops.tex      (revision 173)
+++ upc-lib-nb-mem-ops.tex      (working copy)
@@ -131,7 +131,10 @@
 performed by a set of relaxed shared reads and relaxed shared writes of
 unspecified size and order, issued at unspecified times anywhere within the transfer
 interval by the initiating thread. Conflicting accesses {\em inside} the transfer interval
-have undefined results, as specified in the preceding paragraphs.  
+have undefined results, as specified in the preceding paragraphs.~%
+\footnote{The restrictions described in the three preceding paragraphs are a 
direct consequence of 
+[UPC Language Specifications, Section B.3.2.1], and also apply to the blocking 
\memstar functions.
+They are explicitly stated here for clarity.}
 Here {\em inside} and {\em outside} are defined by the {\tt Precedes()} program order for
 accesses issued by the initiating thread; accesses issued by other threads are considered {\em inside}
 unless every possible and valid $<_{strict}$ relationship orders them outside the transfer interval.~%

Original comment by danbonachea on 18 Oct 2012 at 9:57

GoogleCodeExporter commented 9 years ago

FYI, in MPI-3 One-sided communication, non-blocking puts and gets do Not pass 
synchronization points.

Quoted from mpi30-report.pdf from the MPI Forum 
(www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf)

Page 431, Line 9-14:

"The end of the epoch, or explicit bulk synchronization using
MPI_WIN_FLUSH, MPI_WIN_FLUSH_ALL, MPI_WIN_FLUSH_LOCAL, or 
MPI_WIN_FLUSH_LOCAL_ALL, also indicates completion of the RMA operations. How- 
ever, users must still wait or test on the request handle to allow the MPI 
implementation to clean up any resources associated with these requests; in 
such cases the wait operation will complete locally. "

For comparison, MPI_Win_flush_all is roughly is same as upc_fence and 
MPI_Rput/MPI_Rget are the counterparts of upc_memput_nb/upc_memget_nb.

Original comment by yzh...@lbl.gov on 29 Nov 2012 at 5:02

GoogleCodeExporter commented 9 years ago

Re: Comment 109

It's worth adding that /all/ one-sided operations in MPI are non-blocking and 
all outstanding operations are completed by passive target flush/lock 
operations at the target that is synchronized.  Request-generating operations 
(added in MPI-3) are not an exception, however the user is still required to 
clean up the request object that was returned by MPI.

Original comment by james.di...@gmail.com on 29 Nov 2012 at 5:42

GoogleCodeExporter commented 9 years ago

"all outstanding operations are completed by passive target flush/lock 
operations" should say "can be completed by passive...".  Obviously, they can 
also be completed by active target operations.

Original comment by jeff.science@gmail.com on 29 Nov 2012 at 6:14

GoogleCodeExporter commented 9 years ago

Jim and Jeff: thanks for the clarification.
This means that MPI_Put and MPI_Get actually behave like upc_memput_nbi and 
upc_memget_nbi, which are non-blocking memcpy operations without explicit 
handles.

Original comment by yzh...@lbl.gov on 29 Nov 2012 at 6:39

GoogleCodeExporter commented 9 years ago

This PendingApproval change appeared in the SC12 Draft 3 release.
It was officially ratified at the 11/29 telecon.

Original comment by danbonachea on 29 Nov 2012 at 8:03

Changed state: Ratified

Intrepid / upc-specification

Library: non-blocking memory copy extensions #41