CUBoulder-HPCPerfAnalysis / memory

Experiments with memory performance
MIT License
2 stars 7 forks source link

Changed stream.c to use the block cyclic algorithm for the dot product; ... #14

Closed dmdu closed 9 years ago

dmdu commented 9 years ago

...added script that ran various experiments, gathered data and a graphing R script.

Gathered the results on a 4-socket high-memory (1TB RAM) machine.

Username, Machinename, CPU name, CPU GHz, CPU Cores, CPU Cores used, L1 cache (MB), L2 cache (MB), L3 cache (MB), Array Length (MB)

dmitry, dav01, Xeon(R) CPU E7- 4870, 2.40, 10, Up to 32, 0.64, 2.56, 30.72, 76.3

Surprised to see a spike for Block=1 and ThreadCount=32. Expected to see the effect of the "false sharing" associated with a low performance. If there is a bug in my code, I was not able to find it. I would appreciate any advice on what I can try next to investigate and troubleshoot.

jedbrown commented 9 years ago

I'm guessing you mean SKIP=1, in which case the mapping is the identity (i.e., each thread has its own contiguous region). Also note that Dot is a read-only kernel, so you never have writes, except to the (private) counter variables. The typical acute "false sharing" scenario is when counters are placed on the same cache line. To suffer from this phenomena, you need multiple threads to write to (different parts of) the same cache line.