Improve UPC data layout options

GoogleCodeExporter commented 9 years ago

Some have observed that there seems to be little use for block sizes other than
0 and 1.  Should the language be simplified by disallowing or deprecating all
other block sizes?

Block size 1 is the default block size and is used widely.  Block size 0
("indefinite") frequently is used for a pointer-to-shared so that it can point
to data with affinity to a single thread.  The other block sizes can be emulated
by using a struct with block size 1, for the common case when the block size
evenly divides the array extent.  For example,

    shared [2] int X[10*THREADS];

    // versus

    struct S {
        int data[2];
    };
    shared struct S Y[5*THREADS];

The first declaration is smaller and permits direct access to the elements of X.
It requires understanding what block size 2 means in terms of data distribution.
Changing either the block size or the array extent may require considering how
they affect each other, such as whether the block size will continue to evenly
divide the array extent.

The second declaration is more verbose and requires accessing the elements via
two subscripts (e.g., Y[i].data[j]), but it uses the familiar default
distribution.  Members can be added to the struct or elements can be added to
the array without as much consideration for how one will affect the other.

Additionally, moving to only block sizes 0 and 1 would have the following 
implementation benefits:
  + Elimination of block size [*] and every issue associated with it.
  + Zero becomes the only valid phase, so pointer-to-shared arithmetic and
representation are simplified.

Original issue reported on code.google.com by johnson....@gmail.com on 19 Mar 2012 at 4:28

GoogleCodeExporter commented 9 years ago

Regarding the potential perforance benefits of dropping the layout qualifier 
(block size specifier), some implementations (BUPC, for example) use two 
differing internal representations for pointers-to-shared.  When the block size 
is 0 or 1, the phase is known to be zero, therefore no phase field is 
allocated.  The larger internal representation with the phase field is reserved 
for pointers-to-shared that have a block size > 1.

The GUPC compiler always allocates the space for the phase field, but will not 
use the phase value for shared types with block size <= 1.  Although there is 
some storage overhead and silght inefficiency due to ensuring that the stored 
phase value is zero, there is no additional computational overhead for shared 
pointer arithmetic involving types with block size <= 1.

Based on the above observerations, although there are some definite language 
simplifications derived from removing block sizes > 1, there need not be a 
storage efficiency of performance impact for block sizes <= 1.

Original comment by gary.funck on 19 Mar 2012 at 7:56

GoogleCodeExporter commented 9 years ago

Cray also always allocates space for the phase, but does not consider the phase 
in any code generated for block sizes <= 1.

When I wrote "implementation benefits," I was not really speaking about 
performance, but rather the complexity of the compiler or run-time code that 
handles the pointer-to-shared arithmetic.  The code is rather simple for block 
sizes <= 1, but gets to be tricky for block sizes > 1, especially when one must 
factor in unknown signs (at compile time) and the UPC division and mod 
operations that don't match the standard C ops.  Compiler-generated code for 
ptr + n where n has an unknown sign and ptr has a block size > 1 is ugly; the 
code inside a compiler to implement it isn't much fun either.  I suppose some 
implementations may push off the work to a run-time call, but using a function 
call for something as "simple" as pointer addition really feels wrong to me.  
The Cray compiler generates inline code for pointer-to-shared addition and 
we're interested in keeping it simple.

Original comment by johnson....@gmail.com on 20 Mar 2012 at 3:03

GoogleCodeExporter commented 9 years ago

If anything is done in our current round of spec changes with respect to 
blocksize >1, then I would say "deprecate" is the strongest action we can take. 
 To remove blocksize >1 completely would break too many applications.

The idea even of deprecating them bothers me significantly.
As an implementer I can agree w/ Troy's desire to keep the PTS arithmetic code 
as simple as possible.  However, the proposed alternative seems to be structs 
or user-provided-arithmetic (via macros perhaps).  Not to be insulting to Troy 
or to our user base, but the idea that we are going to get higher 
performance/quality pointer arithmetic from a UPC end-user than from the UPC 
compiler seems ridiculous to me.

So, I vote to "Allow".

Original comment by phhargr...@lbl.gov on 22 May 2012 at 12:29

GoogleCodeExporter commented 9 years ago

I can see both sides of the issue, but I would still like to see blocking 
factors gone. I have two major arguments for simplification, and a potential 
way to deal with Paul's argument.

Arguments for restricting blocking factors
=========================================

(1) Language clarity benefit. Maybe you don't appreciate how much simpler UPC 
would become:

* Cleaner syntax, obviously. Well, maybe except for [0].
* No more trouble with [*] blocking factor, thread-dependent blocking factors, 
maximum blocking factor and so on. The UPC type system compresses to something 
essentially C's own type system.
* The concept of "phase" disappears from language, including upc_phaseof
* All the funky special cases in the collective definitions. gone.
* Type casts become simpler to behold. The old rule of "phase shall be zero 
after cast" can go. No more trouble with actual to formal parameter 
translations in function calls. No more trouble with writing functions that 
hard-code the blocking factor.

(2) Implementation benefits. What Troy said :) In addition [pure selfish 
thought], on the PowerPC architecture, getting rid of a modulo/integer division 
pair is no mean feat.

How to deal with the backwards compatibility issue
==================================================

Paul rightly feels that the suggested change is drastic and will result in at 
least some code that will not work anymore. Oh, and he don't like deprecation 
either. Darn.

So how about a source-to-source translator that transforms fixed blocking 
factor code into BF==1 code? Gary's original message has almost the complete 
blue print for the transformation.

For codes with array indices the transformation would be fairly trivial. For 
codes with pointers-to-shared the transformation would have to generate a 
"pointer increment" function to allow pointer arithmetic to happen according to 
the original program's notions. This pointer increment function would then be 
inlined, essentially re-adding the complexity that Troy saved by simplifying 
the runtime. Thus, the runtime would be clean and high performance, but if the 
programmer wants to keep their hairy old code they can do that at a cost.

The source-to-source translator could transparently deal with casts to local, 
since the actual layout of data in memory would not have changed - only the 
indexing functions using pointers-to-shared would have been modified.

Original comment by ga10...@gmail.com on 24 May 2012 at 3:29

GoogleCodeExporter commented 9 years ago

This seems the appropriate place to make this point: I think we are missing the 
real issue here.

The way UPC handles distributed arrays is awkward at best.  This proposal and 
several other proposals are tinkering around the edges, rather than proposing 
any sort of wholesale change that actually improves the expressibility of the 
language.  Many of the proposals are aimed at eliminating some implementation 
challenges, or perhaps adding restrictions to eliminate confusing cases.  I 
agree with Paul that these do not seem to be of significant benefit, 
particularly to users of the language.

Perhaps, rather thank fiddling with block size related changes, people could 
propose new ways of specifying array geometry, perhaps that build on cyclic and 
indefinite pointer arithmetic, perhaps not.  That might be of more benefit to 
users than removing existing functionality.

For the question here, I'll vote "Allow".

Original comment by brian.wibecan on 25 May 2012 at 9:46

GoogleCodeExporter commented 9 years ago

Brian wrote:
> I agree with Paul that these do not seem to be of significant benefit, 
particularly to users of the language. 

I would go so far as the say that dropping the current distributed array 
layouts would be creating a NEW language.  What would your response be if I 
asked that arrays be removed from C entirely, since users can achieve the same 
things using only pointers?  It is not a perfect analogy, of course, but my 
point is that distributed array layouts in UPC are too fundamental feature of 
the language to remove them.

I second Brian's interest in perhaps ADDING mechanism for better 
controlling/using array layouts.

Original comment by phhargr...@lbl.gov on 25 May 2012 at 10:11

GoogleCodeExporter commented 9 years ago

I also vote for maintaining the status quo, with respect to block sizes > 1.  I 
also support Brian's suggestion that an "out of box" proposal that supersedes 
and generalizes layout qualifiers might overcome the limitations of block sizes 
(whatever they may be) might be a more productive avenue of inquiry.

Couple of things in this regard:
1) I have heard comments to the effect that: "if you're developing a library, 
you can't use block sizes other that 0, because block sizes are constrained to 
be compile-time constants".  To counter that objection, perhaps issue #40 
(block sizes as an attribute of a VLA") would provide sufficient generality to 
meet that objection.

2) Although there has been a rather persistent stated concern that block sizes 
> 1 are both confusing and not very useful, apart from stated compiler/runtime 
implementation issues, and the library development limitation mentioned in 1 
above, I am not aware of any further elucidation of why eliminating block sizes 
> 1 would be a good thing.  If there are other than implementation issues 
related to block sizes > 1, I'd suggest that they should be added to this 
issues as comments, so that we can better understand the problem.

3) Given that Co-array FORTRAN programs also provide distributed arrays, is 
there anything about UPC's block sizes (layout qualifiers) that either improves 
the fit between UPC and Co-array Fortran, or limits inter-operability (this 
topic might be worth a separate issue to track the discussion).

disclaimer: I happen to like UPC array blocking factors, and think they should 
be used more, not less.  If there are UPC language issues that limit their use, 
I'd prefer to see those limitations addressed rather than throwing out block 
sizes.  That said, there is also a great deal of appeal to the minimalist 
argument of simplifying the language where possible/practical.

Original comment by gary.funck on 25 May 2012 at 10:45

GoogleCodeExporter commented 9 years ago

Regarding Comment #7:

3) Fortran has fewer (basically one) data distribution options than UPC, so 
mapping a distribution from Fortran->UPC is easier than UPC->Fortran.  Fewer 
options can be viewed as a weakness or a strength.  I believe that it is a 
strength, partially because I think too many options are confusing and 
partially because of what Brian wrote in Comment #5 (i.e., the distribution 
options are fine but their presentation could be improved).

Original comment by johnson....@gmail.com on 5 Jun 2012 at 8:30

GoogleCodeExporter commented 9 years ago

Marking 2.0 and Usability and change title to better reflect the issues being 
discussed.  This is an issue where everyone seems to agree that something 
should be done but we need more time to form better proposals.

Original comment by johnson....@gmail.com on 15 Jun 2012 at 6:09

Changed title: Improve UPC data layout options
Added labels: Milestone-Spec-2.0, Usability

Intrepid / upc-specification

Improve UPC data layout options #8