Open fdkong opened 9 years ago
Suppose we have two threads. We could partition the buffer like this:
|---------- thread 0 -----------|----------- thread 1 ------------|
That's what OpenMP will do by default. But we could also interleave them:
|- 0 -|- 1 -|- 0 -|- 1 -|- 0 -|- 1 -|- 0 -|- 1 -|- 0 -|- 1 -|- 0 -|- 1 -|
If the block size is less than a cache line, you'll likely see false sharing. If the block size is less than a page and you run on a NUMA machine, you'll pull pages from only one memory bus.
Thanks, Jed.
False sharing happens because threads are trying to write to the same cache line when a block size is small? Is it right?
Fande Kong notifications@github.com writes:
False sharing happens because threads are trying to write to the same cache line when a block size is small? Is it right?
Yes.
If N=10, b=2, we get a mapping: 0 2 4 6 8 1 3 5 7 9. If we run two threads on this range, it does make sense: the first thread runs on 0 2 4 6 8, the second thread runs on 1 3 5 7 9. But if we increase b to 5, b=5, we get another mapping: 0 5 1 6 2 7 3 8 4 9, and we still use two threads, the first runs on 0 5 1 6 2, the second one runs on 7 3 8 4 9. How to understand these two sequences? It looks weird. What we are trying to do is make all threads work within a block. Is it right?
And If we think b is the number of threads, all these situations make sense. So what b is ?
I would say your example is b=1
, t=5
(where t
is the number of threads). The b=2, t=3
block-cyclic permutation would give a partition like (0 1) (6 7) | (2 3) (8 9) | (4 5) (10 11)
.
OpenMP will automatically partition the iteration range i ∈ {0..N-1}
. My suggestion was to permute this index i
via a function j(i)
to distribute work differently. If the block size b
is smaller than a cache line, each cache line will have a contribution from more than one thread, likely reducing performance.
I am affraid I am not clear on what b is? Also, how are we supposed the create a false sharing? steam.c doesn't seem to have any functionality for it?
b
is a block size. I suggested a permutation on the index space that you can use to make each thread responsible for a different part than the default mapping. False sharing occurs when two threads attempt to access the same cache line. If you choose a decomposition that causes multiple threads to write parts of the same cache line, you'll likely see that contention. (I'm envisioning a plot of performance versus block size, where block size ranges from 1 element (8 bytes for double
) to at least 4 KiB (default "small" page size).
Hi Jed,
What do you mean by interleaving threads? From my understanding, I think we make threads run in sequential in fine-grain level so that reduces the cache access conflicts. 'N' is the number of threads, 'b' is a block size. We run the threads in order 0,1,..,N-1 or another mapped order j(i). Please correct me! What do we use to measure the performance of the program? compute time, cache hit rate, or anything else?