UPC progress guarantees

GoogleCodeExporter commented 9 years ago

What are the forward progress guarantees of UPC?

Background
==========

At the risk of oversimplifying the matter, let us say that there are two major 
ways in modern networks to handle incoming messages: explicit (polling) vs. 
implicit (interrupt based, special progress threads or anything else invisible 
to the programmer, acting behind the scenes).

MPI has, for its part, gathered ample evidence that polling based progress is 
"good enough", even for things like MPI-2 one-sided communication.

On the other hand most PGAS languages tacitly assume that one-sided operations 
are truly one-sided. The problem is an implementation one - even with today's 
modern networks asynchronous progress guarantees are tricky.

So where does UPC stand?
=========================

Absolute progress guarantee: are we ready to say that UPC should be able to 
make steady forward progress UPC thread X is engaged in a long lasting 
computation and other threads are accessing data affine to X? what would 
implementors say?

upc_poll(): Berkeley UPC (and IBM xlUPC) provide the upc_poll() function that 
users can call explicitly to make progress. Is a UPC program allowed to 
deadlock/livelock due to failure on the programmer's part to call upc_poll() at 
the appropriate time? should we look at upc_poll() as a tool for performance 
*optimization* or as a way to avoid embarrassing deadlocks on certain 
architectures?

Original issue reported on code.google.com by ga10...@gmail.com on 23 May 2012 at 1:47

GoogleCodeExporter commented 9 years ago

Tagging Usability and Performance.  Will defer to Owner on Milestone version, 
but suggest 2.0.

Cray UPC handles this issue by having upc_fence poll and explaining to the user 
that they may want to fence periodically during a long-running computation if 
they are doing things that aren't truly one-sided in our implementation (e.g., 
upc_free() of data with affinity to a different PE).

Original comment by johnson....@gmail.com on 15 Jun 2012 at 6:13

Added labels: Performance, Usability

GoogleCodeExporter commented 9 years ago

IBM and Berkeley[1] say: spin on upc_poll()
Cray says: spin on upc_fence()

In the Berkeley case, upc_fence() would also work, but we provide upc_poll() to 
"make progress" without also having the fencing property (becomes a no-op in 
the "pure pthreads" case where there is no network to poll).

So, I in favor of DISCUSSING whether we want upc_poll() in the language spec.  
There should be no problem with
  #define upc_poll upc_fence
as a trivially correct implementation.
To me the crux of the discussion is whether the inclusion of upc_poll() is 
useful, or just a horrible substitute for a true progress guarantee.

My initial thought is that if we believe that MPI's experience "proves" that 
explicit polling is good enough, then upc_poll() would just be an optimization 
as George suggests.  HOWEVER, I don't think the current UPC specification does 
anything that precludes writing CORRECT code that {dead,live}locks if one 
assumes true asynchronous progress is made.  So, I would argue that as the spec 
currently stands, any implementation (my own included) which requires insertion 
of poll/fence calls to ensure progress is BROKEN.  Therefore, I would argue 
that upc_poll() does NOT belong in the specification (as a mechanism to avoid 
implementation limitations).

[1] I've mentioned before that Berkeley avoids placing our extensions into the 
upc_* namespace.  The case of upc_poll() predates our realization of how doing 
this can lead to later headaches.

Original comment by phhargr...@lbl.gov on 18 Jun 2012 at 10:32

GoogleCodeExporter commented 9 years ago

We need to identify what parts of UPC require polling for progress in current 
implementations.  That information is useful to this discussion whether or not 
explicit polling is added to the UPC spec.  (If polling is added, then users 
need guidance on when they should poll.  If polling is not added, then we need 
the information to figure out how we can live without polling.)

For example, Cray does not need polling to handle Get, Put, or AMO operations, 
but we need it to handle the following:

1) upc_global_{lock_}alloc - One thread calls upc_global_{lock_}alloc and all 
threads must perform an allocation.
2) upc_free - One thread calls upc_free to deallocate memory that has affinity 
to a different thread.
3) upc_global_exit - One thread terminates all threads.

In all of these cases, one thread does something that requires action by other 
threads.  For (1), we discourage users from calling the function and advise 
them to use upc_all_{lock_}alloc for better performance.  For (2), we see it in 
test cases for upc_free, but have not seen it in a real application.  For (3), 
generally it is called after an error is detected and the function's 
description makes no guarantee about how quickly it must terminate the 
application, so performance is not a concern provided that the original thread 
continues to respond to other threads until they have exited.

I'll note that we used to require polling for upc_memcpy for the case where 
neither the source nor the destination had affinity to the calling thread 
because we used to cause a direct transfer from the source to the destination, 
but it turned out that if a user actually writes such code, then they expect 
the calling thread to perform the copy itself via temporary buffering.

Original comment by johnson....@gmail.com on 19 Jun 2012 at 4:26

GoogleCodeExporter commented 9 years ago

Tagged for the version 1.4 specification milestone.

Although we may be able to reach consensus on the progress guarantees, need (or 
lack thereof) for polling, I doubt that there is sufficient time to re-work 
implmentations in the near term, for example, if the decision is made to remove 
user-level polling requirements across the board.

Original comment by gary.funck on 2 Jul 2012 at 4:07

Added labels: Milestone-Spec-1.4

Intrepid / upc-specification

UPC progress guarantees #48