How to understand the question "we interleave the threads"?

CUBoulder-HPCPerfAnalysis / memory

Experiments with memory performance

MIT License

2 stars 7 forks source link

How to understand the question "we interleave the threads"? #8

Open fdkong opened 9 years ago

fdkong commented 9 years ago

Hi Jed,

What do you mean by interleaving threads? From my understanding, I think we make threads run in sequential in fine-grain level so that reduces the cache access conflicts. 'N' is the number of threads, 'b' is a block size. We run the threads in order 0,1,..,N-1 or another mapped order j(i). Please correct me! What do we use to measure the performance of the program? compute time, cache hit rate, or anything else?

jedbrown commented 9 years ago

Suppose we have two threads. We could partition the buffer like this:

|---------- thread 0 -----------|----------- thread 1 ------------|

That's what OpenMP will do by default. But we could also interleave them:

|- 0 -|- 1 -|- 0 -|- 1 -|- 0 -|- 1 -|- 0 -|- 1 -|- 0 -|- 1 -|- 0 -|- 1 -|

If the block size is less than a cache line, you'll likely see false sharing. If the block size is less than a page and you run on a NUMA machine, you'll pull pages from only one memory bus.

fdkong commented 9 years ago

Thanks, Jed.

False sharing happens because threads are trying to write to the same cache line when a block size is small? Is it right?

jedbrown commented 9 years ago

Fande Kong notifications@github.com writes:

False sharing happens because threads are trying to write to the same cache line when a block size is small? Is it right?

Yes.

fdkong commented 9 years ago

If N=10, b=2, we get a mapping: 0 2 4 6 8 1 3 5 7 9. If we run two threads on this range, it does make sense: the first thread runs on 0 2 4 6 8, the second thread runs on 1 3 5 7 9. But if we increase b to 5, b=5, we get another mapping: 0 5 1 6 2 7 3 8 4 9, and we still use two threads, the first runs on 0 5 1 6 2, the second one runs on 7 3 8 4 9. How to understand these two sequences? It looks weird. What we are trying to do is make all threads work within a block. Is it right?

And If we think b is the number of threads, all these situations make sense. So what b is ?

jedbrown commented 9 years ago

I would say your example is b=1, t=5 (where t is the number of threads). The b=2, t=3 block-cyclic permutation would give a partition like (0 1) (6 7) | (2 3) (8 9) | (4 5) (10 11).

OpenMP will automatically partition the iteration range i ∈ {0..N-1}. My suggestion was to permute this index i via a function j(i) to distribute work differently. If the block size b is smaller than a cache line, each cache line will have a contribution from more than one thread, likely reducing performance.

ArashMehraban commented 9 years ago

I am affraid I am not clear on what b is? Also, how are we supposed the create a false sharing? steam.c doesn't seem to have any functionality for it?

jedbrown commented 9 years ago

b is a block size. I suggested a permutation on the index space that you can use to make each thread responsible for a different part than the default mapping. False sharing occurs when two threads attempt to access the same cache line. If you choose a decomposition that causes multiple threads to write parts of the same cache line, you'll likely see that contention. (I'm envisioning a plot of performance versus block size, where block size ranges from 1 element (8 bytes for double) to at least 4 KiB (default "small" page size).