Closed BrendanCunningham closed 1 month ago
@BrendanCunningham Internal ticket has been created to investigate this issue. Thanks!
Hi @BrendanCunningham, while the driver is not yet public, is it possible to provide some code with your caching logic that reproduces the issue? I've reached out to our kernel driver team, and my understanding is that the buffer should not be freed before rdma_put_pages is called, so this should be working in theory. Of course, this would mean that the free_callback is vestigial, so I'd like to figure out if this buffer can be freed somewhere else, as a search through our codebase does not show anywhere this would be used.
@schung-amd
Hi @BrendanCunningham, while the driver is not yet public, is it possible to provide some code with your caching logic that reproduces the issue?
Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.
I've reached out to our kernel driver team, and my understanding is that the buffer should not be freed before rdma_put_pages is called, so this should be working in theory.
Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages()
are guaranteed to remain valid until after we explicitly call rdma_put_pages()
?
Of course, this would mean that the free_callback is vestigial, so I'd like to figure out if this buffer can be freed somewhere else, as a search through our codebase does not show anywhere this would be used.
Does amdgpu
have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?
Thanks for the quick response!
Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.
Perfect, thanks. I'll try to reproduce what you're seeing here.
Does that mean that DMA addresses for DMA-mapped AMD pages obtained with
rdma_get_pages()
are guaranteed to remain valid until after we explicitly callrdma_put_pages()
?
Yes, this is what I have been told. When callingrdma_get_pages()
, the amdgpu VRAM manager pins VRAM pages for the user which are not released until rdma_put_pages()
is called. When there is VRAM pressure, the VRAM manager stops user processes and migrates pages to relieve the pressure, but pages pinned by rdma_get_pages()
are exempt and will not be evicted by the VRAM manager, so your cached addresses should remain valid. However, as this does not appear to match your observations, I'm discussing with the internal team for further understanding on this.
Does
amdgpu
have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?
Do you mean a cache that can be accessed by the user to retrieve these pages, to eliminate the need for a cache on your end? I'll discuss this with the internal team and update you when I have more information.
Thanks for the quick response!
Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.
Perfect, thanks. I'll try to reproduce what you're seeing here.
I have a workaround in pin_amd.c that disables the cache. This workaround is not public yet. It makes two changes:
insert_amd_pinning()
, do not insert struct amd_pintree_node
objects into the pintree.
rdma_get_pages()
for every ROCm VA range passed into it.struct amd_pintree_node
lifetime with struct kref
and suitable kref_put()
destructors instead of the atomic_t refcount
based code now.
rdma_put_pages()
as soon as our hardware signals packet send completion. As opposed to waiting for cache eviction.With this workaround, we no longer send incorrect data.
Does that mean that DMA addresses for DMA-mapped AMD pages obtained with
rdma_get_pages()
are guaranteed to remain valid until after we explicitly callrdma_put_pages()
?Yes, this is what I have been told. When calling
rdma_get_pages()
, the amdgpu VRAM manager pins VRAM pages for the user which are not released untilrdma_put_pages()
is called. When there is VRAM pressure, the VRAM manager stops user processes and migrates pages to relieve the pressure, but pages pinned byrdma_get_pages()
are exempt and will not be evicted by the VRAM manager, so your cached addresses should remain valid. However, as this does not appear to match your observations, I'm discussing with the internal team for further understanding on this.
To be absolutely clear, we are not having problems with pages obtained with rdma_get_pages()
being migrated. Just that it would be a problem for us if DMA-mapped pages were migrated underneath us or the DMA addresses otherwise became invalid.
The problem that we do have is that since our free_callback
is not called, we can't maintain our VA-range:DMA-mapping (struct amd_p2p_info
) cache properly and end up with stale cache entries. Which can then lead to us using the wrong DMA addresses.
Does
amdgpu
have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?Do you mean a cache that can be accessed by the user to retrieve these pages, to eliminate the need for a cache on your end? I'll discuss this with the internal team and update you when I have more information.
I mean a cache in the amdgpu
driver for other drivers like ours, not for userspace. I.e. is there a cache behind rdma_get_pages()
? But yes, to eliminate the need for a cache on our end.
@schung-amd Also, is it safe to call rdma_put_pages()
from an interrupt context? My workaround involves calling rdma_put_pages()
in our hardware completion handler, which is called via interrupt.
Hi @BrendanCunningham, thanks for following up on this! Sorry for the delay, I'm trying to collect more information from our internal teams before providing answers because I don't have a complete understanding of DMA, so some of these points and followup questions may be unclear; feel free to correct me or clarify on anything. So far my understanding is:
Are you relying purely on the callback to inform when to remove entries from your cache, even when invoking rdma_put_pages()? In this case, the callback will never be called so the cache will still have stale entries. Or are the addresses changing underneath the hood somewhere, and you are expecting free_callback to be invoked? From your initial post, it sounds like pages are being freed without being handled by the cache, but if my understanding is correct then this should not be happening. Again, my understanding of DMA is incomplete, so there may be something obvious here I'm missing.
I will inquire regarding calling rdma_put_pages() in an interrupt context. I suspect it is not safe to do so, but I'll let you know what the internal teams suggest as soon as I have that information.
Hi @BrendanCunningham, thanks for following up on this! Sorry for the delay, I'm trying to collect more information from our internal teams before providing answers because I don't have a complete understanding of DMA, so some of these points and followup questions may be unclear; feel free to correct me or clarify on anything. So far my understanding is:
* We guarantee that pages pinned by rdma_get_pages() are resident until rdma_put_pages() is called; * free_callback() is never called, as you have already observed in testing and also in the codebase, and is probably not needed; * I don't think there is a cache in the driver you can use, but still inquiring about this; and * I don't know how your cached addresses are becoming invalid.
Are you relying purely on the callback to inform when to remove entries from your cache, even when invoking rdma_put_pages()? In this case, the callback will never be called so the cache will still have stale entries.
Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.
Or are the addresses changing underneath the hood somewhere, and you are expecting free_callback to be invoked?
Edit: No, the DMA addresses are not changing underneath us; I only asked if the addresses were guaranteed stable to rule that out as a source of error.
From your initial post, it sounds like pages are being freed without being handled by the cache, but if my understanding is correct then this should not be happening. Again, my understanding of DMA is incomplete, so there may be something obvious here I'm missing.
I will inquire regarding calling rdma_put_pages() in an interrupt context. I suspect it is not safe to do so, but I'll let you know what the internal teams suggest as soon as I have that information.
Okay, thanks.
@BrendanCunningham Still gathering information re: calling rdma_put_pages() from an interrupt context; the internal team initially recommends against calling it, but is digging into the code to check.
Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.
Is your driver in control of freeing the buffer, or is something else freeing the buffer? Is it possible to place the free_callback logic to modify your cache entries directly where you are calling rdma_put_pages()?
@BrendanCunningham Still gathering information re: calling rdma_put_pages() from an interrupt context; the internal team initially recommends against calling it, but is digging into the code to check.
Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.
Is your driver in control of freeing the buffer, or is something else freeing the buffer? Is it possible to place the free_callback logic to modify your cache entries directly where you are calling rdma_put_pages()?
Our driver is not in control of freeing the buffer; we can only react to others (userspace or amdgpu) freeing the buffers and only if those other entities call free_callback.
Absent free_callback, our only good/safe option is disabling our caching code.
According to the internal team, your driver should have full control over the lifetime of the buffers; amdgpu guarantees that your buffers are resident until the driver calls rdma_put_pages() on them, and by design userspace should have no way of freeing them safely outside of your driver's control. Do you have any logs showing how/when your buffers are being freed without the driver knowing?
We will update with logs soon.
Here are pr_debug()
printout logs from two hosts in a 2 rank, 2 host job with our driver (hfi1
) with our AMD DMA cache enabled:
Note the absence of invalidate_sdma_pages_gpu
lines in either log; those events would only be printed from our free_callback function, which is not called. unpin_sdma_pages_gpu
lines are not the same as invalidate_sdma_pages_gpu
. In these logs, unpin_sdma_pages_gpu
lines are printed when the user file descriptor is torn down and any entries in our SDMA cache are freed.
Here is the job log:
The job fails on a data validation error because our driver cannot maintain correct DMA cache entries.
Here is the script I ran on both hosts after loading our driver:
#!/bin/bash
set -x
alias ddcmd='echo $* > /proc/dynamic_debug/control'
shopt -s expand_aliases
dmesg -n debug
dmesg -C
ddcmd '-p; module hfi1 file pin_amd.c +p'
ddcmd 'module amdgpu +p'
Thanks for the logs! I'll pass them on to the internal team for more insight. As discussed, I wouldn't expect the callback to be called anywhere, as the internal team has stated that it is not used, so the absence of invalidate_sdma_pages_gpu
is expected. Apologies if these seem like trivial questions, but to confirm: where unpin_sdma_pages_gpu
is being called, is this happening outside of your control (i.e. from amdgpu
, which should be guaranteed to not happen, or from userspace somehow)? Is it possible to modify your cache inside unpin_sdma_pages_gpu
?
Thanks for the logs! I'll pass them on to the internal team for more insight. As discussed, I wouldn't expect the callback to be called anywhere, as the internal team has stated that it is not used, so the absence of
invalidate_sdma_pages_gpu
is expected. Apologies if these seem like trivial questions, but to confirm: whereunpin_sdma_pages_gpu
is being called, is this happening outside of your control (i.e. fromamdgpu
, which should be guaranteed to not happen, or from userspace somehow)?
unpin_sdma_pages_gpu
is under our control.
Note that the actual function in our code is unpin_amd_node()
; unpin_sdma_pages_gpu
is the trace event emitted by the tracepoint in unpin_amd_node()
. But to avoid further confusion, I'll stick to calling it unpin_sdma_pages_gpu()
since that's what's in our logs and is only used in unpin_amd_node()
.
Is it possible to modify your cache inside
unpin_sdma_pages_gpu
?
No, it is not possible; we will only call unpin_sdma_pages_gpu()
after we have evicted the cache entry from our cache and the refcount for that pinning object hits 0.
Even if amdgpu did call our free_callback
, our code wouldn't call unpin_sdma_pages_gpu()
in that path on the assumption that amdgpu will unpin after we return and that calling into amdgpu from inside the free_callback path could deadlock.
Our free_callback
would evict the cache entry from the cache as soon as we get the lock for our cache tree. This is to prevent calls into our driver from other CPUs from getting the old cache entry. Our free_callback
would then wait for all outstanding I/O on our cache object to complete before returning.
Want to make sure we are all on the same page here and that I understand things. I think what I'm hearing is that the high level issue is our driver doesn't know when the VA changes. The PA is fine, those pages are pinned. The put/get ops handle that.
Basically what happens is this:
There is no mechanism right now for the hfi1 driver to know that VA1 mapping should now be pointing at PA2.
The ask here is that the AMD driver use the call back to inform the driver that we need to handle this scenario in (5).
So while @schung-amd you are correct, the AMD driver doesn't NEED to call the callback. Other users of the buffer do need it to be called. Otherwise we can't cache those VA<->PA mappings. Now for things that use a separate memory registration scheme this may not be an issue, but for our pseudo on-demand paging scheme it is.
@BrendanCunningham Please correct any of my misunderstanding in the above.
@ddalessa no corrections; that is a good summary
Thanks for the clarification, I discussed this with the internal team. Unfortunately, we do not provide support for this. From our point of view, hfi1
needs to have full control over the VA, and the user freeing the memory outside of hfi1
's control is an access violation. One option is to provide an interface by which users can notify hfi1
that they are going to free the VA, so that hfi1
can modify its cache appropriately before the VA is freed. In any scenario, users freeing the VA without notifying hfi1
should be treated as a memory access violation, and we do not provide a mechanism to protect you from that.
The free_callback
would not help here, as it is meant for scenarios where amdgpu
decides on its own that it needs to remove the buffer and must notify hfi1
(although we guarantee that this does not occur so this is not called, as you note); a user freeing the VA does not fall into this category.
Thanks for the clarification, I discussed this with the internal team. Unfortunately, we do not provide support for this. From our point of view,
hfi1
needs to have full control over the VA, and the user freeing the memory outside ofhfi1
's control is an access violation. One option is to provide an interface by which users can notifyhfi1
that they are going to free the VA, so thathfi1
can modify its cache appropriately before the VA is freed. In any scenario, users freeing the VA without notifyinghfi1
should be treated as a memory access violation, and we do not provide a mechanism to protect you from that.
We can implement such a notification interface for hfi1
, though we would prefer it if there were an automatic mechanism to be notified of ROCm-VA:GPU-page unmappings.
I have two further questions:
mmu_interval_notifier
?How does ROCm/ROCR call into amdgpu
to unmap a ROCm-VA:GPU-page mapping (e.g. when doing hipFree()
)?
We are interested in hooking hipFree()
to then notify hfi1
of a VA:page unmapping and want to make sure that we can find other places that might result in a VA:page unmapping, whether they be from userspace or within the kernel.
The
free_callback
would not help here, as it is meant for scenarios whereamdgpu
decides on its own that it needs to remove the buffer and must notifyhfi1
(although we guarantee that this does not occur so this is not called, as you note); a user freeing the VA does not fall into this category.
Okay.
Sorry for the delay, I've gotten some answers from the internal team.
Are all ROCm-VA:GPU-page mappings mapped into the process virtual address space? If so, can we monitor those ROCm-VA:GPU-page unmappings with
mmu_interval_notifier
?
Memory allocated by hipMalloc is device memory and shows up as a device backed memory region in the process VMA. This is fully controlled by amdgpu, and there will be no MMU notifications for these buffers.
How does ROCm/ROCR call into
amdgpu
to unmap a ROCm-VA:GPU-page mapping (e.g. when doing hipFree())?We are interested in hooking
hipFree()
to then notifyhfi1
of a VA:page unmapping and want to make sure that we can find other places that might result in a VA:page unmapping, whether they be from userspace or within the kernel.
The path is HIP calls --> common language runtime --> rocm runtime --> via ioctl --> amdgpu driver. This can be done, but it sounds like you'll have to insert these hooks into clr (https://github.com/ROCm/clr), and the user will have to install your modified version. If this is the path you want to take, we can reach out to the HIP team for more guidance on this if necessary.
Sorry for the delay, I've gotten some answers from the internal team.
Are all ROCm-VA:GPU-page mappings mapped into the process virtual address space? If so, can we monitor those ROCm-VA:GPU-page unmappings with
mmu_interval_notifier
?
We have modified our driver, hfi1
to monitor ROCm VA ranges for UNMAP with mmu_interval_notifier
and remove those entries from our cache.
Prior to this change, our reproducer failed. With this change, our reproducer now passes.
Memory allocated by hipMalloc is device memory and shows up as a device backed memory region in the process VMA. This is fully controlled by amdgpu, and there will be no MMU notifications for these buffers.
How does ROCm/ROCR call into
amdgpu
to unmap a ROCm-VA:GPU-page mapping (e.g. when doing hipFree())? We are interested in hookinghipFree()
to then notifyhfi1
of a VA:page unmapping and want to make sure that we can find other places that might result in a VA:page unmapping, whether they be from userspace or within the kernel.The path is HIP calls --> common language runtime --> rocm runtime --> via ioctl --> amdgpu driver. This can be done, but it sounds like you'll have to insert these hooks into clr (https://github.com/ROCm/clr), and the user will have to install your modified version. If this is the path you want to take, we can reach out to the HIP team for more guidance on this if necessary.
This is not the path we want to take.
Is it possible that all ROCm VA ranges are in-fact mapped into the process with the MMU? If not, under which cases would they not be, i.e. which cases do we have to consider doing an unmap ioctl for?
We have modified our driver, hfi1 to monitor ROCm VA ranges for UNMAP with mmu_interval_notifier and remove those entries from our cache.
Prior to this change, our reproducer failed. With this change, our reproducer now passes.
Is it possible that all ROCm VA ranges are in-fact mapped into the process with the MMU? If not, under which cases would they not be, i.e. which cases do we have to consider doing an unmap ioctl for?
Interesting, this isn't what I would expect based on the info I was given, but hopefully this is reliable behavior. I'll ask the internal team about this and get back to you.
To clarify, in your reproducer (and I assume in the intended usecase), the memory that the VAs point to is device memory allocated with hipMalloc and freed with hipFree?
Our reproducer is osu_bibw
built from the OSU Microbenchmarks 7.4 source code available here.
I am 99% certain that the only ROCm calls in the benchmark code are HIP runtime calls and that the only AMD GPU device allocations are done via hipMalloc()
.
I have double-checked with one of our userspace library developers; he says that our userspace library does not allocate any ROCm buffers of its own and that it only passes the ROCm buffers the library is given to our driver.
To clarify, in your reproducer (and I assume in the intended usecase), the memory that the VAs point to is device memory allocated with hipMalloc and freed with hipFree?
Yes, I'm 99% certain that the VAs come solely from ROCm buffers allocated with hipMalloc()
.
Thanks for the confirmation. Is it possible to share the code you're inserting to use the mmu notifier?
Yes, hfi1/pin_amd.c amd_node_mmu_register()
is where our driver subscribes with mmu_interval_notifier
for notifications about the ROCm VA range it was given.
@BrendanCunningham After discussing this further with the internal team, this does work as you've observed; ROCm (libhsakmt) calls mmap in hipMalloc() and munmap in hipFree(), and when munmap is called, hfi1 should get mmu notification of it. You should be able to rely on this behaviour. Let me know if you have any follow-up questions about this or if there's anything else you need.
This is satisfactory for us then. Thanks for your help.
Problem Description
TLDR
amdgpu
does not call the free callback that our Linux kernel driver passes intordma_get_pages()
. Results in stale ROCm-buffer:DMA-mapped-page cache entries and wrong cache entry lookup. Ultimately, wrong pages are used in constructing packet and wrong data sent.Additionally, our driver requires assurances that once acquired, DMA addresses to AMD GPU pages remain valid until either driver calls
rdma_put_pages()
or only after our page invalidation callback has been called and we can delay DMA-unmapping until our outstanding I/O operations with those pages have completed.Detailed description
Our Linux kernel driver (hfi1) for our HPC fabric interface card (HFI) calls rdma_get_pages() in amdgpu to get DMA-mapped AMD GPU pages for a ROCm buffer. It does this so it can pass the DMA addresses into the HFI's packet construction engine (SDMA) to fill the packet payloads using DMA.
To avoid repeated rdma_get_pages() calls for the same buffer, hfi1 maintains a cache of ROCm-virtual-address-range:DMA-mapped-AMD-GPU-pages ("VA:DMA" entry or mapping). hfi1 has a cache per user context (userspace process). This is hfi1’s AMD-DMA cache.
hfi1 provides rdma_get_pages() with a free callback function.
hfi1 evict entries from the DMA cache in two cases:
In our experience however, hfi1's free callback is never called. As a result, hfi1 cannot remove the VA:DMA entry for a ROCm buffer when the ROCm buffer is freed.
This causes a problem: When a new ROCm buffer gets a virtual address that overlaps with the VA range of a previously allocated-and-freed ROCm buffer that was passed into hfi1, hfi1 will find the old ROCm buffer's VA:DMA entry in its cache. hfi1 will pass the stale entry's DMA address into the HFI's SDMA engine. The adapter will then construct the packet from the wrong pages with the wrong data.
We are able to reproduce this problem in testing using
osu_multi_lat --warmup 0 --validation-warmup 0 -i 1 -c -m 1:4200000 H D
. Our driver with AMD-DMA support is not yet public.We can work around this by disabling hfi1's AMD-DMA cache. With hfi1's AMD-DMA cache disabled, we do an rdma_put_pages() as soon as we have detected that the packet was sent. However this may come at a performance cost.
Additionally, hfi1 needs assurances DMA addresses for DMA-mapped pages will remain valid until either:
Operating System
Red hat Enterprise Linux 9.4 (Plow)
CPU
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.2.0
ROCm Component
ROCK-Kernel-Driver
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response