UoB-HPC / BabelStream

STREAM, for lots of devices written in many programming models
Other
323 stars 110 forks source link

Rust implementation #95

Closed tom91136 closed 2 years ago

tom91136 commented 3 years ago

This PR adds a standalone Rust implementation of the BabelStream benchmark and partially addresses #78.

Supported program arguments and output format should be identical to the C++ version. Parallelism is implemented using Rayon, a single threaded version is also implemented but not currently used.

Support for platforms other than CPU will be added in a separate PR.

tom91136 commented 3 years ago

Thanks @andy-thomason ! Yep, I'll mirror the original comments in the C++ version.

64 commented 3 years ago

Consider passing target-cpu=native to rustc (this is similar to -march=native). You can do this in the build.rustflags option in a .cargo/config.toml file (see https://doc.rust-lang.org/cargo/reference/config.html).

EDIT: You may also want to run cargo fmt

tom91136 commented 3 years ago

There's an ongoing issue with NUMA awareness. Currently looking at possible solutions, don't merge yet.

andy-thomason commented 3 years ago

It would be interesting to see if Rayon gets NUMA support. They would need to split the thread pool. (more a crossbeam thing).

tom91136 commented 3 years ago

Sorry, turns out I forgot I've stashed a big chunk of work which contains a Crossbeam version and flags for pinning/malloc from a while ago. With those commits, suggestions for rustfmt and target-cpu=native has also been added, thanks @64! @andy-thomason The new Crossbeam version uses mutable chunks for each thread, I wonder if there's a more idiomatic Rust way of doing it.

@tomdeakin I've cleaned everything up and added CI for it. You might want to skim through it again, there's a standalone README as well.

andy-thomason commented 3 years ago

You might want to try out the good old thread pool + atomic variable scheduler.

The problem with crossbeam/rayon is that they tend to use disjoint sections of memory which puts a heavy load on the memory controller which is shared amongst all the threads. They are also quite bulky, but very good when threads are blocked.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

fn main() {
    let next_work_item = Arc::new(AtomicUsize::new(0));
    let chunk_size = 1024;
    let job_size = 1235678;

    let threads = (0..8).map(|_tid| {
        let next_work_item = next_work_item.clone();
        std::thread::spawn(move || 
            loop {
                let work_item = next_work_item.fetch_add(1, Ordering::Acquire);
                let imin = work_item * chunk_size;
                if imin > job_size {
                    break;
                }
                let imax = (imin + chunk_size).min(job_size);
                for i in imin..imax {
                    // do something.
                }
            }
        )
    }
    ).collect::<Vec<_>>();

    for t in threads.into_iter() {
        t.join().unwrap();
    }
}
64 commented 3 years ago

Hm, I thought Rust uses malloc by default anyway?

tom91136 commented 3 years ago

Hm, I thought Rust uses malloc by default anyway?

I must have been living under a rock! I thought Rust may still use jemalloc in certain cases (one of the crates brought in jemallocator, that's probably why I thought that). The malloc option/experiment is mainly there to match what the native C version would do, that is, to prevent Rust from touching that uninitialised memory in any way.

tom91136 commented 3 years ago

You might want to try out the good old thread pool + atomic variable scheduler.

The problem with crossbeam/rayon is that they tend to use disjoint sections of memory which puts a heavy load on the memory controller which is shared amongst all the threads. They are also quite bulky, but very good when threads are blocked.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

fn main() {
    let next_work_item = Arc::new(AtomicUsize::new(0));
    let chunk_size = 1024;
    let job_size = 1235678;

    let threads = (0..8).map(|_tid| {
        let next_work_item = next_work_item.clone();
        std::thread::spawn(move || 
            loop {
                let work_item = next_work_item.fetch_add(1, Ordering::Acquire);
                let imin = work_item * chunk_size;
                if imin > job_size {
                    break;
                }
                let imax = (imin + chunk_size).min(job_size);
                for i in imin..imax {
                    // do something.
                }
            }
        )
    }
    ).collect::<Vec<_>>();

    for t in threads.into_iter() {
        t.join().unwrap();
    }
}

I'll give this a go.

tomdeakin commented 2 years ago

Reviewed, and will check the last suggestion.

tom91136 commented 2 years ago

@andy-thomason If I understand correctly about using std::thread::spawn, then my data must be in a form of some Arc<Mutex<T>>, and since we're working on the chunks, I came up with something like this:

    use std::sync;

    let threads = 2;

    let mut xs = vec![1, 2, 3, 4];
    let cs = xs
        .chunks(threads)
        .map(|x| { return Arc::new(Mutex::new(x.to_vec())) } )
        .collect::<Vec<_>>();

    let ts = (0..threads).map(move  |t| {
      let tc = Arc::clone(&cs[t]);
      std::thread::spawn( move || {
        let mut data = tc.lock().unwrap();
        for i in 0..(*data).len() {
          (*data)[i] = 0
        }
      })
    }).collect::<Vec<_>>();

    for t in ts.into_iter() { t.join().unwrap();  }

Which seems to be quite verbose compared to crossbeam's scope. There's also this unavoidable heap allocation unless the vectors are 'static. Does crossbeam::thread::scope actually have control over the memory that gets captured?

andy-thomason commented 2 years ago

crossbeam::thread::scope uses pointers and unsafe impl Send/Sync internally. You can do the same using the standard library, but if you are just starting with Rust then use crossbeam.

crossbeam::thread::scope is externally safe, because the task terminates before the lifetime of the reference. but std::thread must join separately and so the lifetime may be dangling after the thread call.

You can't safely send a reference across a thread boundary, but you can share a Vec, for example or make your own wrapper which implements Send and Sync.

andy-thomason commented 2 years ago

I've made an example of sharing a mutable slice with the standard library here:

https://github.com/atomicincrement/multithread-std/blob/main/src/main.rs

Check out the Rustonomicon

Note that if you care about NUMA and other things then you will need to work a bit harder than using rayon. But rayon/crossbeam is still a very good generalised library.

tom91136 commented 2 years ago

@andy-thomason Thanks for that, I was able to implement the unsafe version in the latest commit using your example. I've reran some of the benchmarks with different combinations of new options (--init, --pin, etc), here's the result on a dual socket Xeon machine: image

(--init corresponds to alloc in the chart)

The results here is very similar to what we're getting for Julia; the pinning doesn't consider the topology of the NUMA nodes and just pins them based a linear thread id. If we set the OMP version to do close placing, the results will be similar to Rust or Julia.

There's some weird performance drops for Arc when --pin and --init is set, will have to look into that later.

In the end, what did the trick is a combination of manual thread pinning and leaving the Vec uninitialised:

let mut xs = Vec::with_capacity_in(size, allocator);
unsafe {
  xs.set_len(size);
}

It's also quite interesting that the Unsafe implementation can achieve the same performance as the uninitialised Arc, Crossbeab, and Unsafe implementations.

I think this is ready for merge unless there's anything that stands out (cc @andy-thomason @64). We'll be doing more experiments on Rust's performance as this gets merged.