calebwin / emu

The write-once-run-anywhere GPGPU library for Rust
https://calebwin.github.io/emu
MIT License
1.59k stars 53 forks source link

Initial overhead #21

Closed hugues31 closed 4 years ago

hugues31 commented 5 years ago

Hello !

First, thanks for this crate and your contribution to the Rust community. This is amazingly simple to use, even for me who has not ever touch OpenCL.

I tried running a simple benchmark program and I found the result unsatisfying. I suppose this example is so trivial, the initial cost of initializing the opencl environment each time is heavy and this is slowing down the entire function call.

The program is (based on your example) :

#![feature(test)]

extern crate em;
extern crate ocl;

extern crate test;

use em::emu;

emu! {
    function logistic(x [f32]) {
        x[..] = 1 / (1 + pow(E, -x[..]));
    }

    pub fn logistic(x: &mut Vec<f32>);
}

pub fn logistic_cpu(x: &mut Vec<f32>) {
    let mut result = Vec::new();

    for value in x {
        result.push(1.0 / (1.0 + 2.71828182846_f32.powf(-*value)))
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn logistic_opencl(b: &mut Bencher) {
        let mut test_data = vec![0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16, 81.20, -16.0, 0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16, 81.20, -16.0, 0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16, 81.20, -16.0, 0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16, 81.20, -16.0, 0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16];
        b.iter(|| logistic(&mut test_data));
        println!("OpenCL : {:?}", test_data);
    }

    #[bench]
    fn logistic_non_opencl(c: &mut Bencher) {
        let mut test_data = vec![0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16, 81.20, -16.0, 0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16, 81.20, -16.0, 0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16, 81.20, -16.0, 0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16, 81.20, -16.0, 0.9, 4.9, 4.8, 3.9, 1.3, 4.8, 9.13, -0.16];
        c.iter(|| logistic_cpu(&mut test_data));
        println!("non OpenCL : {:?}", test_data);
    }
}

And the result is :

test tests::logistic_non_opencl ... bench:         561 ns/iter (+/- 66)
test tests::logistic_opencl     ... bench:  72,081,552 ns/iter (+/- 4,863,815)

My initial intention was to write a recurrent network as efficiently as possible. Do you think using Emu is a good choice ?

Qwarctick commented 5 years ago

Same problem for me. I noticed the transfert between CPU and GPU as mentionned in the Readme.md.

calebwin commented 4 years ago

Just to let y'all know, Emu 0.3.0 makes reading, writing, and launching explicit operations.

That doesn't eliminate the overhead, of course (and this O(N) overhead will exist and is not specific to Emu (which is why I'm closing this), though if you are doing computationally intensive stuff every iteration or have an algorithm that is O(N^2) or greater, this overhead will generally become negligible with more data). What this does mean is that the overhead is explicit in the code. Generally speaking, more code = more overhead.

For example, it should be more or less obvious that the following is introducing unnecessary overhead.

#[gpu_use(add, multiply)]
fn main() {
    let mut data = vec![0.0; 1000];

    gpu_do!(load(data));
    data = add(data, 0.5);
    gpu_do!(read(data));
    gpu_do!(load(data));
    data = multiply(data, 2.0);
    gpu_do!(read(data));

    println!("{:?}", data);
}

Instead, you can be more efficient and just do-

#[gpu_use(add, multiply)]
fn main() {
    let mut data = vec![0.0; 1000];

    gpu_do!(load(data));
    data = add(data, 0.5);
    data = multiply(data, 2.0);
    gpu_do!(read(data));

    println!("{:?}", data);
}

I really like this - performance should feel ergonomic.