Disagreement of membind:set_thisthread_membind

anthony-crystalpeak commented 3 months ago

Hi,

I'm trying to bind and migrate all of my thread's memory to a numanode, but I'm getting an unsupported feature error despite my system supporting it.

assertion failed: support::MemoryBindingSupport::default().set_current_thread() == true
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5

$ pacman -Q | grep hwloc
hwloc 2.10.0-1

$ hwloc-info --support | grep thisthread_membind
membind:set_thisthread_membind = 1
membind:get_thisthread_membind = 1

How should I go about binding and migrating my thread's memory to the desired numa node?

I'm using main in the above test with the latest feature gate.

hwlocality = { git="https://github.com/HadrienG2/hwlocality", branch="main", features=["hwloc-latest"]}

P.S. - this library is awesome, makes setting up locality a breeze. Thanks for your hard work :>

anthony-crystalpeak commented 3 months ago

Note, I get all false entries in the support matrix:

CpuBindingSupport {
    set_current_process: false,
    get_current_process: false,
    set_process: false,
    get_process: false,
    set_current_thread: false,
    get_current_thread: false,
    set_thread: false,
    get_thread: false,
    get_current_process_last_cpu_location: false,
    get_process_last_cpu_location: false,
    get_current_thread_last_cpu_location: false,
}
MemoryBindingSupport {
    set_current_process: false,
    get_current_process: false,
    set_process: false,
    get_process: false,
    set_current_thread: false,
    get_current_thread: false,
    set_area: false,
    get_area: false,
    get_area_memory_location: false,
    allocate_bound: false,
    first_touch_policy: false,
    bind_policy: false,
    interleave_policy: false,
    next_touch_policy: false,
    migrate_flag: false,
}

Binding the CPU works fine, but binding the memory fails with:

Failed to bind memory to node!: BadFlags(ParameterError(MemoryBindingFlags(THREAD | MIGRATE)))

Invocations:

/* Fails */
        topology.bind_memory(&cpuset, 
                             MemoryBindingPolicy::Interleave, 
                             MemoryBindingFlags::THREAD | 
                             MemoryBindingFlags::MIGRATE)
                                .expect("Failed to bind memory to node!");

/* Succeeds */
        topology.bind_cpu(cpuset, 
                          CpuBindingFlags::PROCESS | 
                          CpuBindingFlags::STRICT)
                                    .expect("Failed to bind process to node!");

anthony-crystalpeak commented 3 months ago

I tried using the vendored library instead:

hwlocality = { git="https://github.com/HadrienG2/hwlocality", branch="main", features=["hwloc-latest", "vendored", "proptest"]}

and got the same results. Unfortunately I have to move on, but I'm happy to lend a hand if you could give me some pointers @HadrienG2 :>

HadrienG2 commented 3 months ago

Please disregard this message for now and check the next one first.

Previous message

That all-false support matrix is quite surprising indeed! On my Linux systems (which I assume is what you are also using given the pacman invocation in the OP), I normally get the same support matrix as via the hwloc CLI tools. And in fact, that Linux support matrix is so reproducible that I baked it into the hwlocality unit tests. So given that the hwloc CLI tools produce the expected result on your machine, this suggests that the `libhwloc` that they link to function correctly, and thus I think either of the following must be true: 1. `hwlocality` and the hwloc cli tools do not link against the same `libhwloc` (and the one that `hwlocality` links to has a problem). 2. They do link to the same `libhwloc`, but something bad is happening at the dynamic linking interface between `hwlocality` and `libhwloc`, while for some reason the hwloc CLI tools link to `libhwloc` just fine on their side. Hypothesis 1 can be tested easily with `ldd`, along with a few other easy linking issues (e.g. not linking to the same libc), so I guess we should start with that. Please run the following commands and report their output: ```bash # Create a local clone of hwlocality and compile one of the examples git clone --depth=1 https://github.com/HadrienG2/hwlocality cd hwlocality cargo build --example allocate_bound # Check what the example links against ldd target/debug/examples/allocate_bound # Check what the hwloc CLI links against ldd $(which hwloc-info) ``` While you're at it, you can also run a `cargo test` in that `hwlocality` repository clone, and hopefully that will give you a simpler reproducer of weird hwloc behavior than your entire app. If the tests pass there, well, that's also very interesting and we'll need to figure out what your app is doing differently. --- Assuming `ldd` reports the same link-time dependencies for `hwlocality` examples and `hwloc-info`, the next step would be to investigate what the linker is doing when linking `hwlocality` with `libhwloc`. That's unfortunately a bit at the limit of my area of expertise, but if you are using an unusual linker (gold, mold, wild...), one easy first step would be to switch rustc/cargo back to a more standard linker like good old `ld.bfd` and see if this makes the problem go away. If so, we're most likely dealing with a linker bug, and you should go to the linker's bugtracker for advice on how to reduce the test case for them. If you're using bfd already or it still fails with bfd, then I guess we're in for some fun linker debugging session with `binutils` and `elfutils`... but I'll get into that if we get there ;)

HadrienG2 commented 3 months ago

Wait a minute, there are things that puzzle me about the two error messages that you are getting, and if my intuition about these is right, the problem may be much simpler than I thought.

assertion failed: support::MemoryBindingSupport::default().set_current_thread() == true

Why are you testing the value of set_current_thread() within MemoryBindingSupport::default()? This memory binding support matrix should indeed be all-false: it's a Rust-generated default value with all support bits set to false, not actual OS support bits read from hwloc.

If you want to know what your system actually supports, you need to use topology.feature_support() or topology.supports().

Failed to bind memory to node!: BadFlags(ParameterError(MemoryBindingFlags(THREAD | MIGRATE)))

This looks a lot like a bug that I fixed in the process of implementing memory binding tests (which are not finished yet). Can you switch to the test-memory-binding development branch of hwlocality and see if it helps?

anthony-crystalpeak commented 3 months ago

Ha! That totally was it, topology.feature_support() worked as expected.

That branch didn't throw any errors. Oddly enough locking binding memory to the same cpuset used in bind_cpu() resulted in lower performance. Using bind_cpu() alone worked better, so I left out the memory binding entirely.

I appreciate the help, it might be worth noting on the docs for (Memory|Cpu)BindingSupport to not construct the type directly, and instead to use the methods you listed.

Again, thank you very much. This library is an awesome addition to rust :>

HadrienG2 commented 3 months ago

Oddly enough locking binding memory to the same cpuset used in bind_cpu() resulted in lower performance. Using bind_cpu() alone worked better, so I left out the memory binding entirely.

Here you might be witnessing Linux's first touch NUMA policy in action.

Linux does not allocate memory at malloc() time, but does so lazily the first time memory pages are accessed, on the NUMA node that the CPU accessing the memory belongs to. This means that assuming you used bind_cpu() to pin execution to CPUs within a single NUMA node...

If you call bind_cpu() early on, before allocating or accessing memory, then the memory you subsequently allocate will be automatically bound to the right NUMA nodes, and bind_memory() is unnecessary overhead.
If you allocated and filled memory before bind_cpu(), and later tried to bind it to the target NUMA node using something like bind_memory_area(), that will trigger a migration of memory from one NUMA node to another. This is a very costly process, which will only be worthwhile if you make many accesses to the memory after the migration completes.

Overall, on Linux, if you are already binding CPUs to a single NUMA node using bind_cpu(), the use cases for memory binding are pretty niche, because the system memory allocator will mostly do the right thing automatically. It's not completely useless though, for example you can use it to selectively allocate HBM or DDR memory on Intel's Xeon Phi and Sapphire Rapids CPUs.

I appreciate the help, it might be worth noting on the docs for (Memory|Cpu)BindingSupport to not construct the type directly, and instead to use the methods you listed.

That's a good idea, will do so and tag a new alpha after merging the memory binding test branch (which, as you've seen, is rather high-priority on my merging TODO list :smile: ).

anthony-crystalpeak commented 3 months ago

Overall, on Linux, if you are already binding CPUs to a single NUMA node using bind_cpu(), the use cases for memory binding are pretty niche, because the system memory allocator will mostly do the right thing automatically.

That totally makes sense. My concern was with fork, I have a process that maps in a bunch of pages which are filled, and then the process forks off workers and I wanted those pages to migrate to the new processes numa node. Perhaps linux already does this? But that would probably conflict with the copy on write semantics of fork - since you'd have to duplicate the pages to all of the various NUMA nodes anyway.

Hopefully when a page is written, the new COW page is allocated in the correct node, and all of the non-written hot-path data can be stored in L3 caches, it doesn't really matter that it's not in the correct node?

Either way, performance numbers don't lie! Another interesting tidbit is that using Epyc's feature where each socket can report itself as 4 numa nodes (8 total across both sockets) resulted in drastically worse performance for my workload. That was surprising to me. Leaving it at the traditional 1 node per socket (2 total) was the best for performance.

HadrienG2 commented 3 months ago

If you bound the original process to a single NUMA node before the fork, and bind the forked processes to other NUMA nodes, then I would expect the following to happen:

Initially, all pages are allocated in the NUMA node of the originating process.
After the fork, all processes that read from the originally allocated/filled memory keep hitting the same physical memory pages, and thus the same NUMA node.
- Technically, Linux can also migrate pages across NUMA nodes, but hopefully that doesn't happen yet as it would result in ping-pong between NUMA nodes, and thus a performance disaster. Also, if you pinned the memory of the original process, then you've prevented this from happening.
Once you start writing, copies of the affected pages are made from the originating NUMA node to the target process' NUMA node, so these pages that you have written to (and only them) become NUMA-local.

From my understanding, what you were trying to do is to bind the memory on each process before reading from or writing to it. I'm not sure what will happen then, it depends on how hard the Linux fork/CoW implementation tries to avoid duplicating the pages and this is something that I don't know. But one thing that I guess could happen, and would explain your performance results, is that the fork/CoW implementation may not consider NUMA node migration as a memory write from the perspective of copy-on-write.

Should that be case, what will happen is that each process will simultaneously attempt to migrate the same physical memory pages to its own NUMA node, resulting in lots of conflicting inter-NUMA node traffic and thus a performance disaster. It makes sense, then, that configuring the system in a mode that exposes more NUMA nodes makes the problem worse: you have less inter-node migrations to carry out.

If this is the problem, then I think these are your options for ultimately reaching a scenario where each process is accessing a copy of the same data from its own NUMA node:

Explicitly duplicate the memory that you want to pin after the fork, i.e. don't rely on fork() to do the right thing and just clone() the buffer yourself in the forked process and access the clone instead of the original in your computation. The buffer clone will be allocated in NUMA-local memory, and thus everything will eventually be fine.
Revisit whether you need this fork-based memory sharing design at all. Maybe the buffer initialization work is easy enough that you can just fork before it happens, and repeat it inside of each forked process. This will result in duplicate work, but each CPUs from each NUMA node will be operating on memory that is local to their NUMA node, so hopefully overall the initialization process will go faster. Obviously, this strategy only makes sense if the initialization process is compute/memory bound: you don't want to hit e.g. the storage or network N times when you could do it only once, that's orders of magnitude more expensive than accessing memory from the wrong NUMA node once for the purpose of creating a NUMA-local copy.

anthony-crystalpeak commented 3 months ago

Thanks! I'm going to close this since the original issue is solved on test-memory-binding

Cheers!!!

HadrienG2 / hwlocality

Disagreement of membind:set_thisthread_membind #129