Intrepid / upc-specification

Automatically exported from code.google.com/p/upc-specification
0 stars 1 forks source link

Progress guarantee of upc_notify and upc_wait #4

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
UPC 1.2 6.6.1 defines upc_notify and upc_wait as collective statements, but does
not specify whether upc_notify is permitted to wait on other threads.  For
example, an implementation of upc_notify for a barrier tree may wait on all
child threads to check in before notifying the parent thread.  The main GWU
Unified Testing Suite contains a test named compound_test1 with the following
pattern:

upc_lock( shared_lock );
upc_notify;
upc_unlock( shared_lock );
upc_wait;

This code may deadlock with such a upc_notify implementation.  If the intent of
the specification is that upc_notify shall not wait for any reason, then that
point should be clarified so that users will be assured that this code is legal.

Original issue reported on code.google.com by johnson....@gmail.com on 13 Mar 2012 at 4:40

GoogleCodeExporter commented 9 years ago
My opinion is that this code should not be legal, as I can't think of a case 
where this serves a useful purpose. 

As for the fix, my suggestion is an explicit statement in section 3.7 that 
states implementations may presume all threads enter each collective operation 
before any must leave.  I think this fixes it for all collectives, not just 
upc_notify.  I would not object to a statement in upc_lock/upc_unlock that 
notes that because of this a lock/unlock pair may not contain a collective 
operation.

Any other views on this?

Original comment by w...@uuuuc.us on 16 Mar 2012 at 10:30

GoogleCodeExporter commented 9 years ago
I propose adding to the descriptions of upc_lock and upc_lock_attempt (7.2.4.5 
and 7.2.4.6 in UPC 1.2):

7  The calling thread may not enter a collective operation while holding a lock.

I propose modifying the collectives description (7.3 in UPC 1.2):

5  Collective functions may not be called between upc_notify and the 
corresponding upc_wait, or while holding a lock.

Original comment by johnson....@gmail.com on 20 Mar 2012 at 7:54

GoogleCodeExporter commented 9 years ago
Some of the comments in issue 12 
(http://code.google.com/p/upc-specification/issues/detail?id=12) are relevant 
here.

In general I think we need to "fix" this entirely within the definition of 
upc_notify/upc_wait and the unspecified synchronization that may occur within 
any operation our document specifies as "collective". Relying upon spec 
language added to upc_lock and friends just means the same issue arises when 
the user arranges for any other kind of inter-thread synchronization around a 
collective op (eg using strict operations).

Original comment by danbonachea on 15 Jun 2012 at 8:46

GoogleCodeExporter commented 9 years ago
OK. Same problem as in issue 12. We don't control the people who write things 
like non-blocking barriers, and therefore it is wise not to expose UPC to the 
vagaries of any particular implementation of barrier. 

With my IBM hat on, I would have liked to continue the crusade, and could have 
used a stricter UPC language standard as a stick to beat up our PAMI guys. But 
I accept that this will not happen everywhere. 

Original comment by ga10...@gmail.com on 15 Jun 2012 at 2:47

GoogleCodeExporter commented 9 years ago
Hoping for 1.3.  This issue keeps coming up for us.

Original comment by johnson....@gmail.com on 15 Jun 2012 at 5:59

GoogleCodeExporter commented 9 years ago
I believe this case is already covered by the definition of "collective", 
although in order to make it clearer I propose the following clarification to 
that definition:

--- upc-terms-and-defs.tex      (revision 80)
+++ upc-terms-and-defs.tex      (working copy)
@@ -87,14 +87,16 @@
 \sterm{collective}%
      constraint placed on some language
      operations which requires evaluation of such operations to be
-     matched\footnote{A collective operation need not provide any
+     matched across all threads.%
+     \footnote{A collective operation need not provide any
      actual synchronization between threads, unless otherwise noted.
      The collective requirement simply states a relative ordering property
      of calls to collective operations that must be maintained in the
      parallel execution trace for all executions of any legal program.
      Some implementations may include unspecified synchronization between
      threads within collective operations, but programs must not rely upon
-     such unspecified synchronization for correctness.} across all threads.
+     the presence or absence of
+     such unspecified synchronization for correctness.} 
      The behavior of collective operations is undefined unless all threads
      execute the same sequence of collective operations.

upc_notify is elsewhere defined as a "collective" operation, so the program in 
comment 0 is invalid because its correctness relies upon "the absence of such 
unspecified synchronization" in the upc_notify call, and hence the program may 
deadlock on some implementations (and that deadlock is the programmer's fault).

As a closing note, the spec does NOT prohibit holding a lock while executing a 
collective operation, because that would be too strong a condition. For 
example, this code similar to comment 0 holds a lock during upc_notify:

upc_lock_t *shared_lock = upc_all_lock_alloc();
if (MYTHREAD == 0) { upc_lock( shared_lock ); }
upc_notify;
if (MYTHREAD == 0) { upc_unlock( shared_lock ); }

this code will not deadlock because it does not rely on the absence of 
synchronization in upc_notify for correctness - only one thread is acquiring 
the lock before entering the collective, and there is no cycle of dependencies 
preventing other threads from reaching the collective call.

Original comment by danbonachea on 13 Aug 2012 at 2:55

GoogleCodeExporter commented 9 years ago
Here's a more realistic example of using a lock across a barrier, to 
dynamically "elect" a single representative thread who arrives at the barrier 
early to do some additional work:

while (1) {
  int first_thread = upc_lock_attempt( shared_lock );
  if (first_thread) {
    // I reached the barrier first, do some serialized work to prepare for the next phase
    // eg read in file data for the next iteration
  }
  upc_notify;
  upc_wait;
  if (first_thread) { upc_unlock( shared_lock ); }
  // some work
  upc_barrier;
  // some more work 
}

Original comment by danbonachea on 13 Aug 2012 at 3:14

GoogleCodeExporter commented 9 years ago
I am okay with the language in Comment #6 (as opposed to Comment #2) based on 
the rationale in Comment #3.

Original comment by johnson....@gmail.com on 13 Aug 2012 at 2:46

GoogleCodeExporter commented 9 years ago
Mass-edit to mark all issues with an officially-announced resolution diff as 
"PendingApproval"

Original comment by danbonachea on 17 Aug 2012 at 3:25

GoogleCodeExporter commented 9 years ago
May or may not be directly relevant, but during a meeting yesterday I 
confronted Pavan Balaji (him of MPI3 fame) with this exact problem - MPI 
formulation - and he was fairly sure that MPI3 would survive MPI barrier forced 
through a critical section.

IIRC Thorsten Hoefler, and the IBM MPI guys, said the exact opposite :)

Original comment by ga10...@gmail.com on 17 Aug 2012 at 5:56

GoogleCodeExporter commented 9 years ago
Set Consensus:High on Pending Approval issues.

Original comment by gary.funck on 19 Aug 2012 at 11:31

GoogleCodeExporter commented 9 years ago
Change proposed in comment #6 has been applied in SVN r103.

Original comment by danbonachea on 13 Sep 2012 at 10:25

GoogleCodeExporter commented 9 years ago
For the record:
Public comment period for this change was 8/14/2012 - 9/14/2012
No substantial objections were raised or recorded during that period.
At the 9/7/2012 telecon, it was announced this change was imminent, feedback 
was explicitly solicited, and none was received.
The change was integrated into a working draft distributed on 9/13/2012 for 
consideration and draft ratification at the 9/14/2012 telecon.

Original comment by danbonachea on 14 Sep 2012 at 7:26

GoogleCodeExporter commented 9 years ago
Changes distributed as Draft 1.1 were ratified during the 9/14/2012 
teleconference, for inclusion in the next public draft.

Original comment by danbonachea on 14 Sep 2012 at 9:46