Getting closer to hardware vector instructions directly from Chapel - a Vector Class

chapel-lang / chapel

a Productive Parallel Programming Language

https://chapel-lang.org

Other

1.76k stars 415 forks source link

Getting closer to hardware vector instructions directly from Chapel - a Vector Class #15812

Open damianmoz opened 4 years ago

damianmoz commented 4 years ago

Some modern instructions set architectures (ISAs) such as X86-64, Power9, MIPS and SPARCv9 support SIMD instructions. Others like ARMv8 and RISC-V support far more complete vector extensions (VE) while the NEC SX-Aurora Tsubasa is a full vector ISA. All of these, with a good compiler, allow varying degress of vectorization.

While compiler optimizations exploit these instruction transparently to the user, many languages provide mechanisms such as compiler vector intrinsics to allows user to get closer to, or even directly use, these instructions. Compiler intrinsics are NOT generic.

It is proposed to provide a vector class for this purpose within Chapel although this should designed to allow Chapel way more transparency than compiler intrinsics. This class will help take some of the workload off the compiler and result in faster code in a broad range of scenarios.

This is totally different what C++ calls a vector class which is a spoecialized list.

It may have some aspects in common with what t C++ calls a valarray class.

Has anybody already (in Chapel), drafted the design of such a class, designed such a class or even implemented such a class?

For performance reasons, the major driver for this proposal, this class should be able to map very closelyt to hardware vector register in some sense of the word 'map'.

Some ideas for discussions appear below but these need suggestions, insight and augmentation from others. It is deliberately incomplete and has some deliberate/known flaws to trigger discussion so feel free to comment.

This design need to considered in a somewhat portable context. So this means looking at SIMD in ISAs like x86-64, Power9, MIPS and SPARC, the potentially more powerful/compact 'scalable' vectors in an ARMv8 SVE (like the new A68fx) an RISC-V VE, or the fully blown vector ISA )VISA) of a NEC SX-Aurora Tsubasa. It should definitely be considered from the perspective of Fortran users.

Is the Tsubasa too novel to work? Do we stick to SIMD? Do we design it so that extension to using predicates is easy? Do we ditch SPARC because it is dying, or MIPS because their licensing model is not yet fixed? And these questions are just the start.

Is it a two level design, i.e. one that maps directly to hardware registers and one that builds on top of that and is able to address not only SIMD but also the true VEs and even a VISA.

Introducing GPUs into this discussion is likely to be highly counterproductive.

Simple details of class to follow. This needs keywords attached. And maybe a change of subject.

mppf commented 4 years ago

I'm pretty sure a vector type would need to be a value type, i.e. a record vs a class. I'm also curious to what extent existing types could handle the data storage portion. Tuples like4*int - or the stack-allocated fixed-sized array type like c_array(int, 4) could potentially handle the type portion. Would it use one of those or a different type?

GCC has for a long time supported intrinsics for vector operations. However I have personally found these to be not very useful because of the limited number of operations available. That leads me to think that in order to be useful such a mechanism needs to support lots of operations. But, to be portable, it needs to support a small number of operations. There is a tension there that I'm not sure it's possible to resolve.

Would operators like + between two of the vector types generate vector code? Or would users need to call a special function like vectorizedAdd ?

There is a strategy for this that one can do now. One can use C interoperability features, such as extern blocks, to write inline functions that include intrinsics or assembly for a particular architecture, and then call these from Chapel code. While this strategy is not particularly easy, it is more certain to work, because any hardware-supported vector instruction/intrinsic is likely to be available in this way (and the Chapel project does not need to do anything at all in order to make it available).

damianmoz commented 4 years ago

I'm pretty sure a vector type would need to be a value type, i.e. a record vs a class.

You are probably right. My brain is in C-mode currently which means it has pushed knowledge of Chapel and C++ to the back of the neural cell. I had this project on my back burner. But when your GSoC personal wanted to call the new specialized list a vector, I figured I need to come out and claim the name 'vector' before it got hijacked (as it had been in C++ with the STL)

I'm also curious to what extent existing types could handle the data storage portion. Tuples like4*int - or the stack-allocated fixed-sized array type like c_array(int, 4) could potentially handle the type portion. Would it use one of those or a different type?

No idea at this stage. I had not gotten into details. And I was hoping for some help at the detail level anyway as I figured other's input was going to be invaluable. By definition, my, or anybody's, HPC experience with vectors is limited. I am sure my knowledge and manual coding of SIMD is polluted by my past experience with vectors on Cray, Convex, NEC. My past experience is a bit old anyway so it could be corrupted by just the sheer age of that knowledge.

GCC has for a long time supported intrinsics for vector operations.

Yes. In my next big posting, or should it be a file I attach, there are references, two of which are:

[1] Fog, Agner, https://www.agner.org/optimize/vectorclass.pdf [2] Intel, https://software.intel.com/content/www/us/en/develop/articles/fortran-array-data-and-arguments-and-vectorization.html

However I have personally found these to be not very useful because of the limited number of operations available.

Probably enough for testing concepts. We can always write our own inline asm with C.

That leads me to think that in order to be useful such a mechanism needs to support lots of operations. But, to be portable, it needs to support a small number of operations. There is a tension there that I'm not sure it's possible to resolve.

Yessss! Therein lies the dilemma.

This is early days. I suggest we err on the side of small.

My thoughts were is to first write an implementation in Chapel and then see what assembler is created by the backend. Mind you, that assumes we have a nice way to read the assembler for individual routines without a whole lot of other assembler, i.e. like what is produced by

    g++ -S -O3 -fno-math-errno file.c

I have not figured out how to do that in Chapel cleanly yet. And please do not say that we should be disassembling the executable. Then rework the 'vectorclass' as noted above to reflect the Chapel for X86-64.

And then

    for cpu in [ X86-64, SomethingElse like an ARM with SVE ] do
    {
        for tries in 1..4 do
        {
            sit back;
            think a lot;
            get feedback;
            tweak both Chapel and C++ based on feedback and hard thinking;
            if no-more feedback then break;
        }
             write report (because that exposes brain fade);
             ponder a bit more;
             exploit it in something, e.g. some piece of Linear Algebra Code;
    }

Would operators like + between two of the vector types generate vector code? Or would users need to call a special function like vectorizedAdd

The former.

There is a strategy for this that one can do now. One can use C interoperability features such as extern blocks, to write inline functions that include intrinsics or assembly for a particular architecture, and then call these from Chapel code.

This is what I am trying to avoid.

Think 'elegant'. Think 'Chapel'

While this strategy is not particularly easy, it is more certain to work, because any hardware-supported vector instruction/intrinsic is likely to be available in this way (and the Chapel project does not need to do anything at all in order to make it available).

Yes to 'not particularly easy'.

Yes to 'more certain to work'.

But then you have to write one for X86-64, one for A64FX, and so on. And it means that you are not leveraging the benefits of the LLVM backend to the Chapel compiler or Chapel itself.

So, I would really like to try to avoid this approach if I can.

damianmoz commented 4 years ago

The following is arguably incomplete but more than enough work for now. Hopefully it covers all the areas where problems will occur and are sufficient to engender all the required discussion. It is really being released too early but it needs to be released into the wild now for reasons beyond my control.

Chapel needs a least a type that can fit into a hardware vector, or SIMD, register, or a small set thereof. This type could have the form within a declaration like:

    var v : vector(real(w), n);

where n is a small param integer and is limited to being an integral power of 2 at this stage.

So we could have:

    var x : vector(real(w), 2); // max using real(64) with SSE2
    var x : vector(real(w), 4); // max using real(32) with SSE2
    var x : vector(real(w), 8); // max using real(64) and AVX512
    var x : vector(real(w), 16); // max using real(32) and AVX512
    var x : vector(real(w), 32); // max using real(64) and ARMv8 + SVE
    var x : vector(real(w), 64); // max using real(32) and ARMv8 + SVE ?
    var x : vector(real(w), 128); // a bit of an orphan
    var x : vector(real(w), 256);   // max using real(64) NEC-SX for hardware
    var x : vector(real(w), 512);   // max using real(32) NEC-SX for hardware

What are the implications of this for a RISC-V's Vector Extensions or even ARMv8's Scalable Vector Extensions (SVE) ? There are some arguments to limit n to not exceed 256.

Does the choice of n need to be more limited in size to make it more manageable?

At this stage, integer vectors are being ignored to keep discussions short.

But, is handling integer vectors more important that floating point ones?

Defining the Algol68 nicities

    proc lwb(ref x[?D] : T) : int return D.dim(1).low;

    proc upb(ref x[?D] : T) : int return D.dim(1).high;

Or does this need to be dim(0) in a 0-based world? Grumble, grumble!

So given 'm' and 'n' are small integers, we could have

    var v : vector(real(w), n);
    var u : vector(real(w), n);
    var w : vector(real(w), n);
    var x : [1..n] real(w);
    var y : [1..m] real(w);
    var t : real(w);

    assert(m < n);

The three most basic methods to be defined for a vector are:

    size:   v.size return the number of elements of v, n
    core:   v.core returns the type of any v[i], real(w)
    wipe:   v.wipe(), v[0:n-1] = 0:x.core()

The assignment implied above is not real. It is likely to be an XOR.

There needs to be way to get data into and out of these data structures.

    pack:   v.pack(x), v = x assuming nx == n
    data:   v.data(x), x = v assuming nx == n
    pack:   v.pack(x, keymap), [i in 0:n-1] v[i] = x[lwb(x) + keymap[i]]
    data:   v.data(x, keymap), [i in 0:n-1] x[lwb(x)+keymap[i]] = v[i]

The last two are gather/scatter operations.

We may need something that handles bitmap indexing into an array. Or is this adding too much complexity to an already difficult design problem?

    trim:   v.data(k), v[k:n-1] = 0:v.core()
    load:   v.data(y), v[0:m-1] = x[1..m], v.trim(m)

We may need to pull out slices or a single element, or inject a slice or an element into a vector. These operations will be suboptimal, probably not vectorizable, but definitely should be provided for compact coding.

    get:    v.get(i, m), returns v[i;m]
    get:    v.get(i), return v[i]

    put:    v.put(i, m, x), returns v[i;m] = x[lwb(x);m]
    put:    v.put(i, t), returns v[i] = t

This allows us to have

    head:   v.head is the same as v.get(0, v.size() << 1);
    tail:   v.tail is the same as v.get(v.size() << 1, v.size() << 1)

There is a need for a broadcast operation:

    send:   v.send(t), v[0;n] = t
    send:   v.send(t, i), v[i;n] = t
    send:   v.send(t, i, m), v[i;m] = t

We need the orphan operation (which is probably highly inefficient)

    swap:   swap(i, j) effectively swaps v[i] and v[j]j

The following operations are defined between 2 vector operands:

    +   add
    -   subtract
    *   multiply
    /   divide
    +=  add
    -=  subtract
    *=  multiply
    /=  divide
    ==
    !=
    <=
    >=
    <
    >

There are 7 operations which return a vector :

    abs:    v.abs() returns the element by element absolute value
    neg:    v.neg() returns the element by element negated value
    sqrt:   v.sqrt() returns the element by element SQuare RooT
    muie:   v.muie() returns the element by element MUltiplicative InversE
    min:    v.min(u) returns the element by element minimum of 'u' with 'v'
    max:    v.max(u) returns the element by element maximum of 'u' with 'v'
    fma:    v.fma(u, w) returns the element by element FMA - v + u * w

Reductions but implemented as a method, not a Chapel reduction:

    maxval: minimum over 'v'
    minval: maximum over 'v'
    maxloc: minimum over 'v'
    minloc: maximum over 'v'
    sum:    v.sum compute the naive sum of all v[i], i in [0.v.size-1]
    dot:    v.dot(u) compute the naive sum of all v[i]*u[i], i in [0..vsize-1]

and their equivalent that considers absolute values:

    amaxval:    minimum over 'v'
    aminval:    maximum over 'v'
    amaxloc:    minimum over 'v'
    aminloc:    maximum over 'v'

Note that some instruction set architectures (ISAs)have totally broken FMIN/FMAX instructions that will stab us in the back here. The culprits are Intel and RISC-V. Other ISAs may also do this.

There is whole load of grief in permutations, blends, and friends:

    [rl]shift
    c[rl]shift
    permute
    blend
    lookup
    gather
    scatter

They should be addressed in the scope of this project but are left as placeholders here.

damianmoz commented 4 years ago

You said

I'm pretty sure a vector type would need to be a value type, i.e. a record vs a class.

Yes. You are correct. Brain-fade on my part.

You asked

I'm also curious to what extent existing types could handle the data storage portion.

Me too. Not sure how to evaluate the various options.

And also

Tuples like4*int or the stack-allocated fixed-sized array type like c_array(int, 4) could potentially handle the type portion. Would it use one of those or a different type?

I really have not gone into any detail there. I was more focused on the interface and whether is was complete enough to be useful to others and clean enough to be implementable. Also, what is the benefit of a c_array(real(32), 4) over a

var x : [0..3] real(32);

This array 'x' already has the built-in methods x.eltType to return the core type, can be vectorized already or assisted with vectorizeOnly. This gives me some hope that at least a prototype could be done in pure Chapel. By the time we got to such a prototype, say by the time of 1.23, would the Chapel compiler be further advanced in its ability to vectorize with reduced overhead.

Even then, If you then took this reference and replaced bits with compiler intrinsics where needed, would this hybrid get closer to the desired performance. At least if you could then contemplate this hybrid definition, you would have something that might give you pointers on what is needed in the compiler to allow the module that implements this vector type to be done in pure Chapel. Resorting to C for the actual details is not what I was thinking about for the long term but rather tweaking how you interact with LLVM. Remember I know close to zero about that back end so I could be rambling incoherently at this point.

A lot of my ideas need to be better influenced about how Chapel currently handles vectorization. And I do not have that knowledge yet and I figure it will only materialize with experimentation or assistance from those far more enlightened/experienced on the topic than I.

I am throwing up ideas at this stage. Enough from me. Other can comment for a while.

mppf commented 4 years ago

But then you have to write one for X86-64, one for A64FX, and so on. And it means that you are not leveraging the benefits of the LLVM backend to the Chapel compiler or Chapel itself.

Right, but actually I think that problem specifically could be solve by writting "assembly" in LLVM instead of specific assembly for each target. (I'm not saying that would be elegant).

For more information about LLVM IR you might look at

https://llvm.org/docs/LangRef.html#add-instruction https://llvm.org/docs/LangRef.html#vector-operations

(and imagine you are reading an instruction set reference).

What are the implications of this for a RISC-V's Vector Extensions or even ARMv8's Scalable Vector Extensions (SVE) ? There are some arguments to limit n to not exceed 256.

From https://developer.arm.com/docs/101726/0210/learn-about-the-scalable-vector-extension-sve -

Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains it to a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units.

So I am not quite sure that with SVE the programmer should be saying what the vector length is at all. As a result I think that the effective way to use SVE looks more like saying "This loop is vectorizeable" rather than using vector types. However there are SVE C extensions that allow one to write vector intriniscs and the intrinsics work with types that have unknown size at compile time.

pack: v.pack(x), v = x assuming nx == n

I'd probably provide the ability to go between tuples and vectors rather than arrays and vectors. (But there are issues if the tuple size is not known at compile time, which might not go so well with SVE).

damianmoz commented 4 years ago

I agree with you that we probably have to write "assembly" in LLVM. But when I read the LLVM docs, my eyes glaze over and I feel so far out of my comfort zone that I was wanting somebody to agree with me. Also, I was unsure whether it could be done purely at the LLVM level or whether I was going to be condemning some poor bunny to even more work on the compiler. I always feel guilty when I suggest the latter.

That is what I suggested that we do an implementation of the interface purely in Chapel, even if it is very suboptimal from a vectorization perspective. Get it right, tested, and used, and then we attack the back end. What do you think? Is this the sort of thing @ben-albrecht is into?

All I know on ARMv8's SVE is what I have read or heard on YouTube. It will be a challenge reconciling the ARM approach (and that of RISC-V) with other vendors SIMD ISA, maintaining transparency, and achieving optimality. Their will be have issues with supporting the various X86-64 SIMD instructions depending on whether a chipset support SSE2, AVX2 and AVX-512. And do you want the penalty of the clockspeed drop when you use AVX-512. I think we only have to worry about 128bit SIMD ISAs on other architectures, Do we need to expose some sort of base/minimal "vector length" to Chapel.

I notice you keep mentioning tuples? What is the magic there? There must be more than meets my eye. Anyway, in my brevity, I did not provide an overloaded pack that copies a tuple into a vector, even if it is mainly used at declaration time of the vector. For example

var f0 : real(32) = ....
.....
var v = vector(f0, f1, f2, f3);

would create a 128-bit aligned vector of four 32-bit floats that would map to a vector on just about any machine with SVE or SIMD.

For simplicity at this stage of the process I was going to stick which the vector length being a param, at least on the Chapel side. On the back end, usingthe more generic vector may reduce complexity.

Note that from my perspective, I would always be using these vectors as slices of larger 1D-arrays or rows or columns of 2D-arrays, which some people call respectively vectors and matrices but then we end up being caught in a semantic war. So they will generally align with vectors rather than tuples.

As a selfish aside, and because I use slices so much, I was wondering why Chapel could not use "start:finish:stride" notation instead of "start:finish by stride" It always looks messy? Just curious.

damianmoz commented 4 years ago

I need to go back to fundamentals as I have been going around in circles.

How is a tuple stored and manipulated internally? How do you guarantee it will get vectorized?

I looked at c_array and my head spun. I have no idea what it does internally. I cannot even see where the data is defined in the implementation. See below.

proc main
{
    var t : c_array(real(64), 4);
    var s : c_array(real(64), 4);
    var x : [1..4] real(64) = [ 1.0, 2.0, 3.0, 4.5 ];
    var y : [1..4] real(64);

    for i in 0..3 do t[i] = x[i + 1];
    for i in 0..3 do s[i] = sin(x[i + 1]);

    writeln(t);

    for i in vectorizeOnly ( 0..3 ) do s[i] += t[i];

    writeln(s);
    for i in 0..3 do y[i] = s[i];
    writeln(y);
}

mppf commented 4 years ago

The reason I was bringing up tuples and c_array is that an array (at least today) is necessarily heap allocated and that doesn't make so much sense for relatively small things (like 4 reals).

How is a tuple stored and manipulated internally? How do you guarantee it will get vectorized?

Tuples are allocated on the stack and the compiler can choose the memory layout. There's currently no particular guarantee of vectorization (but the later parts of compilation are pretty good at pattern-matching sequences of related operations, as with say adding two tuples element by element).

I looked at c_array and my head spun. I have no idea what it does internally. I cannot even see where the data is defined in the implementation.

c_array is supported by the compiler. It is akin to declaring say double arr[4] in C - in that it is allocated on the stack if it is a local variable - but the Chapel version is slightly more value-y, because something like var copy = myCArray; will copy the elements. The size must be param (ie. known to the Chapel compiler).

mppf commented 4 years ago

(This part is off topic, so if you want to delve into it further we need to move it to another issue)

I was wondering why Chapel could not use "start:finish:stride" notation instead of "start:finish by stride" It always looks messy

you mean start..finish by stride, right?

: is for specifying types and writing casts. I'm not so sure that the corresponding thing - start..finish..stride - would make any sense, personally.

damianmoz commented 4 years ago

My feeling is that we need something more like c_array than a tuple. I cannot yet understand enough of the internals of this record to be sure that it is totally adequate although some things might be overkill. As I said, I cannot figure out how the space is created or even how the indexing operator even finds that space. Or is that some magic in the compiler itself? But, at this stage, it is not crucial unless it hits a brick-wall.. I am curious as to the implications that c_array has in terms of any redundant temporary vectors that may get created in intermediate expressions. Remember that what we are trying to do here is to have code that helps the back-end keep things in vector registers.

I think I will proceed from here using c_array although knowing more details about how it works would make me feel more comfortable. Does it need any compiler support, and if so, what is it?

Knowing that a Chapel array is always allocated off the heap is interesting and useful to know but is off-topic. I would never have guessed that and cannot remembering reading it anywhere. Or at least it never jumped out at me if it was in what I was reading.

Note that in the CPtr.chpl source file, the comments prior to the definition refer to c_array as a class.

mppf commented 4 years ago

Or is that some magic in the compiler itself?

Yes, that's what I was trying to say.

Does it need any compiler support, and if so, what is it?

Are you talking about the C compiler? (It definitely needs chpl compiler support but should be portable across C compilers and the LLVM backend).

damianmoz commented 4 years ago

How do I write a (operator) proc that takes 2 c_arrays, both of type real(w) and length n, adds them element-wise and returns the result in a temporary of the same type?

I have tried several variations and I keep going nowhere?

What I am trying to achieve is:

proc +(u : [1..4] ?T, v : [1..4] T) : [1..4] T
{
    var t : [1..4] T;

    for i in vectorizeOnly(1..4) do t[i] = u[i] + v[i];

    return t;
}

That said, why does Chapel want to use this operator proc for anything other than a 1-D array having a domain of [1..4]. Or conversely, how do I force Chapel to use the above operator proc for the said 1-D array type.

bradcray commented 4 years ago

Here's a + overload that adds c_arrays (TIO):

proc +(x: c_array(real, ?n), y: c_array(real, n)) {
  var z: c_array(real, n);

  forall i in 0..#n do
    z[i] = x[i] + y[i];

  return z;
}

config param size = 10;

var a, b: c_array(real, size);
for i in 0..#size {
  a[i] = 3.0;
  b[i] = i:real;
}

var c = a + b;
writeln(c);

bradcray commented 4 years ago

why does Chapel want to use this operator proc for anything other than a 1-D array having a domain of [1..4]. Or conversely, how do I force Chapel to use the above operator proc for the said 1-D array type.

I'm not certain what else you're finding it gets applied to, but guessing (let me know if I'm wrong): Chapel currently only resolves array arguments based on static type information such as (1) the array's rank / dimensionality, (2) its element type. Thus, any 1D arrays will dispatch to this overload regardless of index set, though any that don't have the index set 1..4 will result in an execution-time error. The reason for this is that though the formal arguments' indices are specified with param indices (1 and 4), Chapel doesn't have any notion of param ranges or domains today, so makes no attempt to reason about the index sets of the actuals and formals in resolving the calls. This is something we'd like to improve upon in the future (supporting param arrays, ranges, and domains; and potentially using them to constrain legal calls for cases like this).

As examples, here are a bunch of cases that don't work, with explanations as to why:

proc +(u : [1..4] ?T, v : [1..4] T) : [1..4] T
{
    var t : [1..4] T;

    writeln("In Damian's + overload");

    for i in vectorizeOnly(1..4) do t[i] = u[i] + v[i];

    return t;
}

var A, B: [1..4] real;

writeln(A+B);

var C, D: [0..3] real;

// Matches Damian's overload, but generates an error due to domain              
// mismatch.  This is because Chapel doesn't have param ranges and              
// domains, so doesn't attempt to reason about index sets at                    
// compile-time, only execution time.                                           
//                                                                              
// writeln(C+D);                                                                

var E, F: [1..5] real;

// Same as above: execution-time index set mismatch                             
//                                                                              
// writeln(E+F);                                                                

var G, H: c_array(real, 4);

// + on c_arrays not supported by default                                       
//                                                                              
// writeln(G+H);                                                                

var I, J: 4*real;

// OK, but doesn't use Damian's overload (different types)                      
//                                                                              
writeln(I+J);

And here's a more general + overload that does work and is more flexible:

proc +(u : [?D] ?T, v : [D] T) : [D] T
{
    var t : [D] T;

    writeln("In Damian's + overload");

    for i in vectorizeOnly(D) do t[i] = u[i] + v[i];

    return t;
}

var A, B: [1..4] real;

writeln(A+B);

var C, D: [0..3] real;

 writeln(C+D);

var E, F: [1..5] real;

writeln(E+F);

var G, H: c_array(real, 4);

// + on c_arrays not supported by default                                       
//                                                                              
// writeln(G+H);                                                                

var I, J: 4*real;

// OK, but doesn't use Damian's overload (different types)                      
//                                                                              
writeln(I+J);

This more flexible approach could do things like take one code path if the array size was 4 and another code path if it wasn't, but currently that would result in an execution-time conditional for similar reasons: because the compiler doesn't try to reason about the size / membership of index sets at present.

damianmoz commented 4 years ago

Thanks heaps for this.

proc +(x: c_array(real, ?n), y: c_array(real, n)) {
  var z: c_array(real, n);

  forall i in 0..#n do
    z[i] = x[i] + y[i];

  return z;
}

I had actually tried a generic version of this, i.e. with real(?w) and with just a for loop with vectorizeOnly(0..n) and it gave me a compiler error. I am using 1.20 so I figured I was doing something stupid and never reported it.

Several questions for my guidance. Why you did not specify a return type or can Chapel always assume that from what you are doing? If vectorized code is desired from the loop, is the forall a better choice, especially considering n is small? Is the length of a c_array always known at compile time? Is using the count operator # likely to help or hinder vectorization or neither?

If Chapel was not a compiler but an introspective AI tool and it saw real(w, n), what would it think was being requested?

I personally think that the c_array approach as originally suggested by Michael is the way to go for now, although a real(w, n), and matching int(w, n) types might be the way to go in the long term, especially if you want the compiler to be smarter at how it looks at vectorization.

Thanks for the thoughts on using arrays with vectorization. I was trying to see if there was a way to force the compiler to use one operator + for short arrays and another for longer arrays. I actually had proven your approach worked for all arrays but I did not want to mess with the default Chapel operator for addition for 1D arrays, just for the short ones. I will pursue that a bit more. If I use the Chapel compiler and save the intermediate C files, if there a way to then manually compiler these with a "-S" flag to produce the assembler? An earlier suggestion to disassemble what is nearly a 10MB executable and go fishing through the result sounds like a sledgehammer to crack a nut to me. And less than productive. Surely you look at the assembler that Chapel generates yourselves?

mppf commented 4 years ago

If I use the Chapel compiler and save the intermediate C files, if there a way to then manually compiler these with a "-S" flag to produce the assembler? An earlier suggestion to disassemble what is nearly a 10MB executable and go fishing through the result sounds like a sledgehammer to crack a nut to me. And less than productive. Surely you look at the assembler that Chapel generates yourselves?

We could make a compilation flag that just does this (but it would actually disassemble the generated binary) and if that has significant value to you I recommend creating a feature request issue.

Assuming you are using something compatible with GCC, one thing you can do is this:

chpl hello.chpl  --savec=tmp --ccflags -save-temps

That will create _main.s with all of your program in assembly. Which is not really different...

However you can use the experimental --incremental to request the C compiler be used to compile separately:

chpl hello.chpl  --savec=tmp --ccflags -save-temps --incremental

For me, this created _main.s which contained the assembly listing for just that module. Perhaps this strategy will be a useful workaround.

bradcray commented 4 years ago

Why you did not specify a return type or can Chapel always assume that from what you are doing?

I tend to rely a lot on Chapel's type inference (where, in this case, it will look at the return z; statement and know that the return type matches z's type), but there's no downside to giving the return type, and if I were developing an official library routine, I typically would put the return type.

If vectorized code is desired from the loop, is the forall a better choice, especially considering n is small?

I haven't studied vectorization much, and wasn't thinking about it in sketching out this routine (was focusing more on getting the procedure's pattern correct). Specifically, I don't personally have much experience with what will or will not produce good vectorized code with our compiler today. For small n, you're probably right that we wouldn't want a forall (today) because it would likely spawn off numCores tasks, which would be overkill (ultimately, I'd like to hope that our compiler would do a good job of throttling that forall for fixed-size arrays like these). So if I only wanted vectorization, not multitasking, I'd probably start with just a for loop and see whether the back-end compiler (C or LLVM) could vectorize it. If it didn't, I'd try using forall + vectorizeOnly(), but I know there's been design uncertainty around vectorizeOnly(), so have gotten into the habit of not using it myself.

Is the length of a c_array always known at compile time?

Yeah, it is. Chapel's c_array type is similar to a stack-allocated C array, so has to be fixed size. We've discussed also adding some sort of user-facing "C heap-allocated array" type that would be arbitrary size, but that currently is future work (we have our own internal mechanisms for getting heap memory, but they aren't particularly nice; of course one could also use a c_ptr() type to get a similar effect using pointers.

Is using the count operator # likely to help or hinder vectorization or neither?

I don't believe it should have any effect, particularly in this case where n is known to the compiler. But using 0..n-1 would remove any question around this, so might be worthwhile.

If Chapel was not a compiler but an introspective AI tool and it saw real(w, n), what would it think was being requested?

I'm not understanding this question. Are you saying it might be interpreted as some sort of "is w real" logic predicate?

I personally think that the c_array approach as originally suggested by Michael is the way to go for now, although a real(w, n), and matching int(w, n) types might be the way to go in the long term, especially if you want the compiler to be smarter at how it looks at vectorization.

Works for me.

An earlier suggestion to disassemble what is nearly a 10MB executable and go fishing through the result sounds like a sledgehammer to crack a nut to me. And less than productive. Surely you look at the assembler that Chapel generates yourselves?

Michael already answered this, but I'll say that, personally, I rarely do. Other developers have, however, and I believe have taken the approach of (a) having the back-end compiler save the assembly using -S or an equivalent (rather than disassembling the binary), and (b) putting "obvious" markers surrounding their code around the part that they wanted to focus on in the assembly to help them search for and focus in on the kernel of the computation. For example, if there were some pattern that would be trivial to recognize in the assembly, you could put code that generated that pattern at the start and end of a custom + overload in order to focus in on how its guts were implemented.

damianmoz commented 4 years ago

Thanks @mppf , I think that either

chpl hello.chpl  --savec=tmp --ccflags -save-temps

or even the experimental -incremental will achieve what I need.

damianmoz commented 4 years ago

Thanks for the insightful feedback @bradcray. I will take your point on vectorizeOnly.

As to my poorly written point on a real(w) type or int(w) declaration for a conventional single data (SD) base type, it looks like such a declaration was almost carefully designed, pre-ordained or even begging, to be extended to be able to declare a multiple data (MD) near-base type

real(w, n)

of n real(w) elements that can be used in SIMD instructions.

Such an extension appears to be a far clearer and more compact way of specifying such MD near-base types than either the more verbose GCC style

typedef int v4si __attribute__ ((vector_size (16)));

or the Sierra LLVM extension

int varying(4) v = { 2, 4, 6, 8 };

or even LLVM's own vector instructions (of which I have no experience).

That suggested Chapel extension (or something like it) is just a thought for future compiler enhancements (which go beyond my skill set). While only add one extra argument to the type declaration, this extension needs serious compiler support and in that respect is quite different to the c_array approach being proposed.

For now, a c_array approach should provide a basis to find out what vectorizations are feasible with the current Chapel compiler.

Also, I will investigate if the latest compiler handles a generic type in the c_array parameter to the + operator proc. Or maybe that internal error is still there.

bradcray commented 4 years ago

it looks like such a declaration was almost carefully designed, pre-ordained or even begging, to be extended to be able to declare a multiple data (MD) near-base type real(w, n) of n real(w) elements that can be used in SIMD instructions.

It seems to me that real(*) is most naturally interpreted as "a single real value parameterized in such and such a way" and that to get multiple real values, one should use n*real(w) or one of the other collection types (array, c_array, record, etc.)

damianmoz commented 3 years ago

Does any other language have a high level concept that allows direct access to objects which can map directly to a SIMD or vector register? Such SIMD instructions have been around for over 2 decades.

I am talking of something that hides things at a higher level than gcc's

typedef double v4f64 __attribute__ ((vector_size(32));

or other compiler's equivalents. These would need to scale from systems which only have SIMD for 32 bit (or smaller) quantities through to the NEC Tsubasa which handles a vector of 256 real(64) quantities.

And I am curious about what we need to such data structures help us to exploit GPUs. Do these need us to put our brains back to the mindset which used old 1970s style vectors which worked directly on data in memory (rather than registers).

damianmoz commented 3 years ago

Not a lot of action/feedback happening here.

The idea is to treat short, fixed length arrays as special. With apologies for concentrating on real(w) arrays, this would ultimately allow something like a dot product to be written generically as

proc dot(r : range, const ref u : [] ?Real, const ref v : [] Real)
{
    param VL = vector(Real,0).optimalSize;
    type vReal = vector(Real, VL);
    const size = r.size;
    const f = r.first;
    const tailSize = size & (VL - 1);
    const bodySize = size - tailSize;
    const body = f .. #bodySize by VL;
    const tail = f + bodySize .. r.last;
    var s : vReal;

    s.wipe(); // clear the array
    for i in body do
    {
        const ref _u = u[i..#VL]:vReal, _v = v[i..#VL]:vReal;

        s += _u * _v;
    }

    var t = s.vsum(); // compute horizontal sum through the entire vector

    for i in tail do
    {
        t += u[i] * v[i];
    }
    return t;
}

I do not pretend that this algorithm is optimal. But if the compiler knows that there is effectively a 1D-array of a fixed length (that fits exactly into a vector register), then it should be able to optimize things without having to do a lot of work.

Note that the vector type can also return a minimumSize and a *maximumSize** which may indeed be the same (as with say an IBM Power ISA) but are all unique with an AVX-512 ISA. I would propose that

type vReal = vector(Real, 0);

defines a type which has the optimal length anyway from which one can deduce

param VL = vReal.size;

Am I right that the concept of length has disappeared in favour of size Then again, vectors have always had a length. But then again, an array with a base type of T has (in Chapel) an eltType of T. We probably should be aiming for consistency rather than using names that map directly to reality.

mppf commented 3 years ago

Does any other language have a high level concept that allows direct access to objects which can map directly to a SIMD or vector register?

This seems like a question that you (or anybody) could answer. I am not able to investigate this question right now because it would be too distracting from my other goals.

Not a lot of action/feedback happening here.

I already mentioned my concerns that having programmers write code specific to one vector length won't work for the newer ARM vector extensions. It seems to me that it's better if we can avoid code specific to vector length. The way that I know how to do that is to focus on vectorizing loops (and trying to be clear about when that is possible) rather than on creating a vector type. However it might be possible to solve by creating a vector type.

And I am curious about what we need to such data structures help us to exploit GPUs.

I am currently trying to figure out how to both vectorize and target GPUs in a way that is understandable to Chapel programmers. I do not feel I have all the answers there yet but I am pretty sure that it needs to look more like creating vectorizeable/GPU-kernel-izeable loops rather than data types with a fixed vector width.

What I would (personally) need to be convinced that we need vector types is:

some examples that show why marking loops as vectorizeable is not enough
a strategy for how a Chapel user could use the vector types without knowing the vector width at compile time

damianmoz commented 3 years ago

I was hoping for some input from somebody else. As you note, your goals are (rightly) elsewhere. To answer my own question about high level concepts,

This seems like a question that you (or anybody) could answer.

My limited experience is that these do not exist unless you count the __attribute__ (((vector_size(N))) seen in some C compilers and some of the C++ classes built on top of those. But none are what I would call complete although many satisfy the design aims of the author.You get a hint of fundamental vector operations in the 1980s VectorC where vector operations were done directly to (and from) memory, there is always the valarray container in C++ but that seems to have lost its way. And there is

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0193r0.html

which is the SIMD extension proposed for C++ but that seems to be moving very slowly. The initial ideas on that which came from Boost got rejected about 6 years ago. It seems a non trivial problem, or at least getting agreement certainly is. There is Krentz's thesis although that is trying to be way more complete than what I am suggesting.

Due to being starved of time today, except for your question:

a strategy for how a Chapel user could use the vector types without knowing the vector width at compile time

I will address the rest of your reply over the coming week. I notice the overlap with the recently raised #16403 issue although I question my ability to make any meaningful input to that conversation. I did look at #16404 to which I can at least add my 2c and feel like I have made a tiny contribution.

For your last question, but only in the context of my example quoted above, what you want can be achieved by changing that param to const. Or have I misunderstood your question?

mppf commented 3 years ago

I was hoping for some input from somebody else.

Ah. In that event it's probably going to work better to post something on a mailing list (or Gitter or the new Discourse site) because only the people involved already in this issue are likely to be notified when a new post shows up.

damianmoz commented 10 months ago

This seems to have died. Or is it just a case of lots of other, more pressing things.

bradcray commented 9 months ago

The latter.