Closed anthony-crystalpeak closed 3 months ago
Note, I get all false entries in the support matrix:
CpuBindingSupport {
set_current_process: false,
get_current_process: false,
set_process: false,
get_process: false,
set_current_thread: false,
get_current_thread: false,
set_thread: false,
get_thread: false,
get_current_process_last_cpu_location: false,
get_process_last_cpu_location: false,
get_current_thread_last_cpu_location: false,
}
MemoryBindingSupport {
set_current_process: false,
get_current_process: false,
set_process: false,
get_process: false,
set_current_thread: false,
get_current_thread: false,
set_area: false,
get_area: false,
get_area_memory_location: false,
allocate_bound: false,
first_touch_policy: false,
bind_policy: false,
interleave_policy: false,
next_touch_policy: false,
migrate_flag: false,
}
Binding the CPU works fine, but binding the memory fails with:
Failed to bind memory to node!: BadFlags(ParameterError(MemoryBindingFlags(THREAD | MIGRATE)))
Invocations:
/* Fails */
topology.bind_memory(&cpuset,
MemoryBindingPolicy::Interleave,
MemoryBindingFlags::THREAD |
MemoryBindingFlags::MIGRATE)
.expect("Failed to bind memory to node!");
/* Succeeds */
topology.bind_cpu(cpuset,
CpuBindingFlags::PROCESS |
CpuBindingFlags::STRICT)
.expect("Failed to bind process to node!");
I tried using the vendored library instead:
hwlocality = { git="https://github.com/HadrienG2/hwlocality", branch="main", features=["hwloc-latest", "vendored", "proptest"]}
and got the same results. Unfortunately I have to move on, but I'm happy to lend a hand if you could give me some pointers @HadrienG2 :>
Please disregard this message for now and check the next one first.
Wait a minute, there are things that puzzle me about the two error messages that you are getting, and if my intuition about these is right, the problem may be much simpler than I thought.
assertion failed: support::MemoryBindingSupport::default().set_current_thread() == true
Why are you testing the value of set_current_thread()
within MemoryBindingSupport::default()
? This memory binding support matrix should indeed be all-false: it's a Rust-generated default value with all support bits set to false, not actual OS support bits read from hwloc.
If you want to know what your system actually supports, you need to use topology.feature_support()
or topology.supports()
.
Failed to bind memory to node!: BadFlags(ParameterError(MemoryBindingFlags(THREAD | MIGRATE)))
This looks a lot like a bug that I fixed in the process of implementing memory binding tests (which are not finished yet). Can you switch to the test-memory-binding
development branch of hwlocality and see if it helps?
Ha! That totally was it, topology.feature_support() worked as expected.
That branch didn't throw any errors. Oddly enough locking binding memory to the same cpuset used in bind_cpu()
resulted in lower performance. Using bind_cpu()
alone worked better, so I left out the memory binding entirely.
I appreciate the help, it might be worth noting on the docs for (Memory|Cpu)BindingSupport to not construct the type directly, and instead to use the methods you listed.
Again, thank you very much. This library is an awesome addition to rust :>
Oddly enough locking binding memory to the same cpuset used in bind_cpu() resulted in lower performance. Using bind_cpu() alone worked better, so I left out the memory binding entirely.
Here you might be witnessing Linux's first touch NUMA policy in action.
Linux does not allocate memory at malloc()
time, but does so lazily the first time memory pages are accessed, on the NUMA node that the CPU accessing the memory belongs to. This means that assuming you used bind_cpu()
to pin execution to CPUs within a single NUMA node...
bind_cpu()
early on, before allocating or accessing memory, then the memory you subsequently allocate will be automatically bound to the right NUMA nodes, and bind_memory()
is unnecessary overhead.bind_cpu()
, and later tried to bind it to the target NUMA node using something like bind_memory_area()
, that will trigger a migration of memory from one NUMA node to another. This is a very costly process, which will only be worthwhile if you make many accesses to the memory after the migration completes.Overall, on Linux, if you are already binding CPUs to a single NUMA node using bind_cpu()
, the use cases for memory binding are pretty niche, because the system memory allocator will mostly do the right thing automatically. It's not completely useless though, for example you can use it to selectively allocate HBM or DDR memory on Intel's Xeon Phi and Sapphire Rapids CPUs.
I appreciate the help, it might be worth noting on the docs for (Memory|Cpu)BindingSupport to not construct the type directly, and instead to use the methods you listed.
That's a good idea, will do so and tag a new alpha after merging the memory binding test branch (which, as you've seen, is rather high-priority on my merging TODO list :smile: ).
Overall, on Linux, if you are already binding CPUs to a single NUMA node using bind_cpu(), the use cases for memory binding are pretty niche, because the system memory allocator will mostly do the right thing automatically.
That totally makes sense. My concern was with fork, I have a process that maps in a bunch of pages which are filled, and then the process forks off workers and I wanted those pages to migrate to the new processes numa node. Perhaps linux already does this? But that would probably conflict with the copy on write semantics of fork - since you'd have to duplicate the pages to all of the various NUMA nodes anyway.
Hopefully when a page is written, the new COW page is allocated in the correct node, and all of the non-written hot-path data can be stored in L3 caches, it doesn't really matter that it's not in the correct node?
Either way, performance numbers don't lie! Another interesting tidbit is that using Epyc's feature where each socket can report itself as 4 numa nodes (8 total across both sockets) resulted in drastically worse performance for my workload. That was surprising to me. Leaving it at the traditional 1 node per socket (2 total) was the best for performance.
If you bound the original process to a single NUMA node before the fork, and bind the forked processes to other NUMA nodes, then I would expect the following to happen:
From my understanding, what you were trying to do is to bind the memory on each process before reading from or writing to it. I'm not sure what will happen then, it depends on how hard the Linux fork/CoW implementation tries to avoid duplicating the pages and this is something that I don't know. But one thing that I guess could happen, and would explain your performance results, is that the fork/CoW implementation may not consider NUMA node migration as a memory write from the perspective of copy-on-write.
Should that be case, what will happen is that each process will simultaneously attempt to migrate the same physical memory pages to its own NUMA node, resulting in lots of conflicting inter-NUMA node traffic and thus a performance disaster. It makes sense, then, that configuring the system in a mode that exposes more NUMA nodes makes the problem worse: you have less inter-node migrations to carry out.
If this is the problem, then I think these are your options for ultimately reaching a scenario where each process is accessing a copy of the same data from its own NUMA node:
fork()
to do the right thing and just clone()
the buffer yourself in the forked process and access the clone instead of the original in your computation. The buffer clone will be allocated in NUMA-local memory, and thus everything will eventually be fine.Thanks! I'm going to close this since the original issue is solved on test-memory-binding
Cheers!!!
Hi,
I'm trying to bind and migrate all of my thread's memory to a numanode, but I'm getting an unsupported feature error despite my system supporting it.
How should I go about binding and migrating my thread's memory to the desired numa node?
I'm using main in the above test with the latest feature gate.
P.S. - this library is awesome, makes setting up locality a breeze. Thanks for your hard work :>