Closed tom91136 closed 2 years ago
Thanks @andy-thomason ! Yep, I'll mirror the original comments in the C++ version.
Consider passing target-cpu=native
to rustc (this is similar to -march=native
). You can do this in the build.rustflags
option in a .cargo/config.toml
file (see https://doc.rust-lang.org/cargo/reference/config.html).
EDIT: You may also want to run cargo fmt
There's an ongoing issue with NUMA awareness. Currently looking at possible solutions, don't merge yet.
It would be interesting to see if Rayon gets NUMA support. They would need to split the thread pool. (more a crossbeam thing).
Sorry, turns out I forgot I've stashed a big chunk of work which contains a Crossbeam version and flags for pinning/malloc from a while ago.
With those commits, suggestions for rustfmt and target-cpu=native
has also been added, thanks @64!
@andy-thomason The new Crossbeam version uses mutable chunks for each thread, I wonder if there's a more idiomatic Rust way of doing it.
@tomdeakin I've cleaned everything up and added CI for it. You might want to skim through it again, there's a standalone README as well.
You might want to try out the good old thread pool + atomic variable scheduler.
The problem with crossbeam/rayon is that they tend to use disjoint sections of memory which puts a heavy load on the memory controller which is shared amongst all the threads. They are also quite bulky, but very good when threads are blocked.
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
fn main() {
let next_work_item = Arc::new(AtomicUsize::new(0));
let chunk_size = 1024;
let job_size = 1235678;
let threads = (0..8).map(|_tid| {
let next_work_item = next_work_item.clone();
std::thread::spawn(move ||
loop {
let work_item = next_work_item.fetch_add(1, Ordering::Acquire);
let imin = work_item * chunk_size;
if imin > job_size {
break;
}
let imax = (imin + chunk_size).min(job_size);
for i in imin..imax {
// do something.
}
}
)
}
).collect::<Vec<_>>();
for t in threads.into_iter() {
t.join().unwrap();
}
}
Hm, I thought Rust uses malloc by default anyway?
Hm, I thought Rust uses malloc by default anyway?
I must have been living under a rock! I thought Rust may still use jemalloc in certain cases (one of the crates brought in jemallocator
, that's probably why I thought that). The malloc option/experiment is mainly there to match what the native C version would do, that is, to prevent Rust from touching that uninitialised memory in any way.
You might want to try out the good old thread pool + atomic variable scheduler.
The problem with crossbeam/rayon is that they tend to use disjoint sections of memory which puts a heavy load on the memory controller which is shared amongst all the threads. They are also quite bulky, but very good when threads are blocked.
use std::sync::atomic::{AtomicUsize, Ordering}; use std::sync::Arc; fn main() { let next_work_item = Arc::new(AtomicUsize::new(0)); let chunk_size = 1024; let job_size = 1235678; let threads = (0..8).map(|_tid| { let next_work_item = next_work_item.clone(); std::thread::spawn(move || loop { let work_item = next_work_item.fetch_add(1, Ordering::Acquire); let imin = work_item * chunk_size; if imin > job_size { break; } let imax = (imin + chunk_size).min(job_size); for i in imin..imax { // do something. } } ) } ).collect::<Vec<_>>(); for t in threads.into_iter() { t.join().unwrap(); } }
I'll give this a go.
Reviewed, and will check the last suggestion.
@andy-thomason If I understand correctly about using std::thread::spawn
, then my data must be in a form of some Arc<Mutex<T>>
, and since we're working on the chunks, I came up with something like this:
use std::sync;
let threads = 2;
let mut xs = vec![1, 2, 3, 4];
let cs = xs
.chunks(threads)
.map(|x| { return Arc::new(Mutex::new(x.to_vec())) } )
.collect::<Vec<_>>();
let ts = (0..threads).map(move |t| {
let tc = Arc::clone(&cs[t]);
std::thread::spawn( move || {
let mut data = tc.lock().unwrap();
for i in 0..(*data).len() {
(*data)[i] = 0
}
})
}).collect::<Vec<_>>();
for t in ts.into_iter() { t.join().unwrap(); }
Which seems to be quite verbose compared to crossbeam's scope
. There's also this unavoidable heap allocation unless the vectors are 'static
.
Does crossbeam::thread::scope
actually have control over the memory that gets captured?
crossbeam::thread::scope
uses pointers and unsafe impl Send/Sync
internally. You can do the same
using the standard library, but if you are just starting with Rust then use crossbeam.
crossbeam::thread::scope
is externally safe, because the task terminates before the lifetime of the reference.
but std::thread must join separately and so the lifetime may be dangling after the thread call.
You can't safely send a reference across a thread boundary, but you can share a Vec
, for example
or make your own wrapper which implements Send
and Sync
.
I've made an example of sharing a mutable slice with the standard library here:
https://github.com/atomicincrement/multithread-std/blob/main/src/main.rs
Check out the Rustonomicon
Note that if you care about NUMA and other things then you will need to work a bit harder than using rayon. But rayon/crossbeam is still a very good generalised library.
@andy-thomason Thanks for that, I was able to implement the unsafe version in the latest commit using your example.
I've reran some of the benchmarks with different combinations of new options (--init
, --pin
, etc), here's the result on a dual socket Xeon machine:
(--init
corresponds to alloc
in the chart)
The results here is very similar to what we're getting for Julia; the pinning doesn't consider the topology of the NUMA nodes and just pins them based a linear thread id. If we set the OMP version to do close
placing, the results will be similar to Rust or Julia.
There's some weird performance drops for Arc when --pin
and --init
is set, will have to look into that later.
In the end, what did the trick is a combination of manual thread pinning and leaving the Vec
uninitialised:
let mut xs = Vec::with_capacity_in(size, allocator);
unsafe {
xs.set_len(size);
}
It's also quite interesting that the Unsafe implementation can achieve the same performance as the uninitialised Arc, Crossbeab, and Unsafe implementations.
I think this is ready for merge unless there's anything that stands out (cc @andy-thomason @64). We'll be doing more experiments on Rust's performance as this gets merged.
This PR adds a standalone Rust implementation of the BabelStream benchmark and partially addresses #78.
Supported program arguments and output format should be identical to the C++ version. Parallelism is implemented using Rayon, a single threaded version is also implemented but not currently used.
Support for platforms other than CPU will be added in a separate PR.