How a user runs an Emu function

calebwin / emu

The write-once-run-anywhere GPGPU library for Rust

https://calebwin.github.io/emu

MIT License

1.6k stars 53 forks source link

How a user runs an Emu function #11

Closed calebwin closed 5 years ago

calebwin commented 5 years ago

A function in Emu operates on a "work-item" (work-item is a term OpenCL uses; I loosely use it here but we can refer to it differently if we come up with a better name).

multiply(global_buffer [f32], scalar f32) {
    global_buffer[get_global_id(0)] *= scalar;
}

With the above function, a work-item corresponds to a particular index in global_buffer. So the work can be thought of as a 1d grid with dimensions equal to the length of global_buffer. Let's consider another function.

multiply_matrices(m i32, n i32, k i32, global_a [f32], global_b [f32], global_c [f32]) {
    let row: i32 = get_global_id(0);
    let col: i32 = get_global_id(1);

    let acc: f32 = 0.0;

    for i in 0..k {
        acc += global_a[i*m + row] * global_b[col*k + i];
    }

    global_c[col * m + row] = acc;
}

When this function is run, a work-item corresponds to a pair of indices - one in global_a and one in global_b. So the work in this case is a 2d grid with dimensions equal to the product of the lengths of global_a and global_b.

Now here's the thing - both of these functions can be ultimately run with a binding to OpenCL. But only the first function can be run with the build! macro. This is because functions you intend to run with the build! macro operate on 1d grids of work where the dimension is by default the length of the first parameter to the function.

This is an important thing to note and I think it can help us answer the following key questions.

How should Emu functions be ultimately called by a user?
How should a user be using get_global_id()?
A user has a bunch of data - how do we support mapping and filtering and reducing?

calebwin commented 5 years ago

But I think the one main question we need to answer is - once you have a function written in Emu,

How do you run the Emu function?

You have a bunch of data stored in vectors and you want to take this function and somehow run it on the data. What would be the most sensible way that you should be able to do that?

calebwin commented 5 years ago

emu! {
    multiply(data [f32], coeff f32) {
        data[...] *= coeff;
    }
}

build! {
    fn multiply(data: &mut Vec<f32>, coeff: &f32);
}

fn main() {
    let mut data = vec![9.8, 3.8, 2.9, 4.6, 4.8];

    multiply(&mut data, &2.0);

    println!("{:?}", data);
}

emu! {
    multiply_matrices(rows: [i32], cols: [i32], m i32, n i32, k i32, a [f32], b [f32], c [f32]) {
        let row: i32 = row[...];
        let col: i32 = cols[...];

        let acc: f32 = 0.0;

        for i in 0..k {
            acc += a[i*m + row] * b[col*k + i];
        }

        c[col * m + row] = acc;
    }
}

build { 
    fn multiply_matrices(rows: &mut Vec<i32>, cols: &mut Vec<i32>, m: &i32, n: &i32, k: &i32, a: &mut Vec<f32>, b: &mut Vec<f32>, c: &mut Vec<f32>);
}

fn main() {
    let m: i32 = 3;
    let n: i32 = 3;
    let k: i32 = 3;

    let mut a = vec![3.7, 4.5, 9.0, 3.7, 4.5, 9.0, 3.7, 4.5, 9.0];
    let mut b = vec![3.7, 4.5, 9.0, 3.7, 4.5, 9.0, 3.7, 4.5, 9.0];
    let mut c = vec![0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0];

    multiply_matrices(&mut ![0, 1, 2], &mut vec![0, 1, 2], &m, &n, &k, &mut a, &mut b, &mutc).unwrap();

    println!("{:?}", c);
}

This last function requires a vector of indices to be passed in. However, it should be possible for emu! to make an optimization by removing idx and replacing i by get_global_id(0). Information about this optimization can then be passed on to the build! macro which can make sure it doesn't transfer idx to the GPU and LLVM should make the optimization of ignoring the vector of indices that isn't used anywhere in the Rust program.

emu! {
    add(a [f32], b [f32], c [f32], idx [i32]) {
        let i: i32 = idx[...];
        c[i] = a[i] + b[i];
    }
}

Ogeon commented 5 years ago

Hi! I was just looking at this project the other day and I think it looks very interesting! I just started wondering if something like "named holes", or holes in variables, would be an alternative to the index arrays. Basically extending the syntax to allow this:

emu! {
    add(a: [f32], b: [f32], c: [f32]) {
        let i: i32 = ...; // expands to i = get_global_id(0)
        c[i] = a[i] + b[i];
    }
}

or even something like this:

emu! {
    add(a: [f32], b: [f32], c: [f32]) {
        c[...i] = a[...i] + b[...i]; // implicitly defines i as get_global_id(0)
    }
}

I'm not sure if the last one would be too magic, though. I think the benefit for keeping it as some kind of special syntax item is that you can make it compile to whatever is best, rather than having to follow the semantics of something else, like with an array.

calebwin commented 5 years ago

These are good ideas.

I think holes are a neat idea but I want to make the holes system more structured and understandable for a user. Right now, the following...

function multiply(x [f32]) {
    x[..] *= x[..];
}

...compiles to...

function multiply(x [f32]) {
    x[get_global_id(0)] *= x[get_global_id(0)];
}

...which means that the program only has 1 hole. But it could also be compiled to...

function multiply(x [f32]) {
    x[get_global_id(0)] *= x[get_global_id(1)];
}

...which means that the program has 2 holes. Which is correct? This is another issue that we need to address including the one you bring up with the potential solution being named holes or holes in variables. One thing about holes in variables, you need to ensure that wherever the variable is used, it must be used as an index of arrays of same length. In the example you give...

emu! {
    add(a: [f32], b: [f32], c: [f32]) {
        let i: i32 = ...; // expands to i = get_global_id(0)
        c[i] = a[i] + b[i];
    }
}

...Emu needs to infer the size of the 0th dimension of the grid of work-items and it could be the length of either a, b, or c. And since its used for all of them, they must be of the same length.

Ogeon commented 5 years ago

Good points. I interpreted the holes to be independent from each other, but maybe I just didn't read closely enough. Either way, the current syntax doesn't make it clear if they are independent or not, as you say. My second example makes it more clear and has the hole attached to the arrays. Not saying it's the best syntax, or trying to push it in any way. It has drawbacks such as i coming out of nowhere.

As for inferring the size, it could be defined to take the length of the shortest array, just like the .zip iterator does, or it could even be a runtime error to not make sure the arrays have equal length. But I don't know how that would work if add in the example is called from another emu function...

I guess another alternative would be to introduce some kind of length parameter in the array type. I'm borrowing the Rust syntax here, but what if this would be possible:

emu! {
    add<const N: i32>(a: [f32; N], b: [f32; N], c: [f32; N]) {
        let i: i32 = ...N; // Not sure about this syntax... Too close to ranges? maybe hole(N) instead?
        c[i] = a[i] + b[i]; // Would be a compile error if i has the wrong length
    }
}

It makes the compiler much more complex, I guess, and it's maybe not the direction you want to go. It would probably solve the enforcing/assuming equal length problem, but maybe not the whole inferring length problem. I don't know enough about how it works to say. I'm just a passerby who had an idea to share. :slightly_smiling_face:

calebwin commented 5 years ago

I appreciate all of your input. I think I have an idea for how holes should work.

All holes should be independent. So if you want to square all elements in an array or you want to add two arrays, you use a function for enumerating all indices of an array.

function multiply(data [f32], coeff f32) {
    data[..] *= coeff;
}

function square(data [f32]) {
    let i: i32 = enumerate(data)[..];
    data[i] *= data[i];
}

function add(a [f32], b [f32]) {
    let i: i32 = enumerate(a)[..];
    a[i] += b[i];
}

The thing I like about this is that for each of these functions there is clearly only 1 hole. And there really should be only 1 hole. Thoughts?

Ogeon commented 5 years ago

Looks like a good middle ground, as far as I can tell. I would have to dive more into this to say how it works in practice, but it seems better than the index arrays (fewer mistakes and fewer function parameters, etc.) while still being essentially the same.

calebwin commented 5 years ago

OK. Emu v0.3.0 completely gets rid of "holes" or "enumerate" and replaces it all with simple for loops and pure Rust.

Closing!