Closed troore closed 2 months ago
There's no facility to pin threads for memory bandwidth testing in non-NUMA mode because it is not needed. You can use other utilities to set affinity, like taskset
on Linux or start /b /affinity <mask>
on Windows to ensure the test only runs on certain physical cores.
There's no facility to pin threads for memory bandwidth testing in non-NUMA mode because it is not needed. You can use other utilities to set affinity, like
taskset
on Linux orstart /b /affinity <mask>
on Windows to ensure the test only runs on certain physical cores.
taskset
guarantee the precise affinity, e.g., if we pin 2 threads to 2 physical cores (SMT2, 4 logical cores) by taskset
, can we guarantee that the 2 threads are scheduled on physical core 0 and 1, rather than both being scheduled on physical core 0, resulting in different L1/L2 bandwidth results?At one point I had an option to put the first thread on core 0, second thread on core 1, and so on, but found that it made no difference compared to setting affinity through taskset
or start /b /affinity
for the whole process. Operating systems today are SMT-aware and are good at preferring to load separate physical cores before loading SMT threads.
If you have a problem with the operating system not being SMT-aware, you can use taskset
or start /b /affinity
to exclude SMT sibling threads. I haven't seen it be a problem on any recent Windows or Linux install.
NUMA gets special handling because each thread allocates memory from a designated pool of memory, and has to be pinned to a core close to that pool.
At one point I had an option to put the first thread on core 0, second thread on core 1, and so on, but found that it made no difference compared to setting affinity through
taskset
orstart /b /affinity
for the whole process. Operating systems today are SMT-aware and are good at preferring to load separate physical cores before loading SMT threads.If you have a problem with the operating system not being SMT-aware, you can use
taskset
orstart /b /affinity
to exclude SMT sibling threads. I haven't seen it be a problem on any recent Windows or Linux install.NUMA gets special handling because each thread allocates memory from a designated pool of memory, and has to be pinned to a core close to that pool.
Make sense, thanks. I think this issue is solved.
Hi @clamchowder,
I want to pin thread to cpu when measuring bandwidth, but I found that there seems no such facility under the non-numa mode. So I just borrow this part from CoherenceLatency:
void *ReadBandwidthTestThread(void *param) { BandwidthTestThreadData* bwTestData = (BandwidthTestThreadData*)param; if (hardaffinity) { sched_setaffinity(gettid(), sizeof(cpu_set_t), &global_cpuset); } else { // I add the following lines: cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(bwTestData->processorIndex, &cpuset); sched_setaffinity(gettid(), sizeof(cpu_set_t), &cpuset); fprintf(stderr, "thread %ld set affinity %d\n", gettid(), bwTestData->processorIndex); } ... }
Besides, the
processorIndex
is calculated bythread_idx % nprocs
according to the processor to core id mapping from/proc/cpuinfo
.I test on AMD Ryzen 7 5800X CPU, where only one numa node is equipped (8 physical cores, and 16 logical cores). So I didn't enable numa.
I got the following results:
In the figure above, "auto" means I run original MemoryBandwith code, while "manual" means I added the
CPU_SET
andsched_setaffinity
as the code snippet shows. The left and right figures show 8 and 16 threads results respectively.My question is, why are the "manual" bandwidth results lower than those of "auto" for 8 threads, while the "manual" catches up for 16 threads?
Thanks, troore
Hi @clamchowder,
I've just reopened this issue because I am still unable to explain the left figure of the original post (the comparison betwen auto
and manual
thread bindings). Because I think 8 threads are enough to fully utilize L1 bandwidth before the first slope.
I tried both taskset -c 0-7
and sched_setaffinity
but got similar results. The affinity masks of the auto
and manual
are ffff
and ff__
. I cannot explain why the manual
thread binding is lower than auto
.
Could you repeat the results and try to help explain?
Thanks, troore
Please don't do any affinity setting unless you're willing to investigate and debug the effects on your own time. If you choose to do that, tools like perf
and performance counters can help you understand what's going on.
Affinity setting is not supported in general, and was only done to work around issues on certain platforms.
Hi @clamchowder,
I want to pin thread to cpu when measuring bandwidth, but I found that there seems no such facility under the non-numa mode. So I just borrow this part from CoherenceLatency:
Besides, the
processorIndex
is calculated bythread_idx % nprocs
according to the processor to core id mapping from/proc/cpuinfo
.I test on AMD Ryzen 7 5800X CPU, where only one numa node is equipped (8 physical cores, and 16 logical cores). So I didn't enable numa.
I got the following results:
In the figure above, "auto" means I run original MemoryBandwith code, while "manual" means I added the
CPU_SET
andsched_setaffinity
as the code snippet shows. The left and right figures show 8 and 16 threads results respectively.My question is, why are the "manual" bandwidth results lower than those of "auto" for 8 threads, while the "manual" catches up for 16 threads?
Thanks, troore