BrendanCunningham commented 3 months ago

Problem Description

TLDR

amdgpu does not call the free callback that our Linux kernel driver passes into rdma_get_pages(). Results in stale ROCm-buffer:DMA-mapped-page cache entries and wrong cache entry lookup. Ultimately, wrong pages are used in constructing packet and wrong data sent.

Additionally, our driver requires assurances that once acquired, DMA addresses to AMD GPU pages remain valid until either driver calls rdma_put_pages() or only after our page invalidation callback has been called and we can delay DMA-unmapping until our outstanding I/O operations with those pages have completed.

Detailed description

Our Linux kernel driver (hfi1) for our HPC fabric interface card (HFI) calls rdma_get_pages() in amdgpu to get DMA-mapped AMD GPU pages for a ROCm buffer. It does this so it can pass the DMA addresses into the HFI's packet construction engine (SDMA) to fill the packet payloads using DMA.

To avoid repeated rdma_get_pages() calls for the same buffer, hfi1 maintains a cache of ROCm-virtual-address-range:DMA-mapped-AMD-GPU-pages ("VA:DMA" entry or mapping). hfi1 has a cache per user context (userspace process). This is hfi1’s AMD-DMA cache.

hfi1 provides rdma_get_pages() with a free callback function.

hfi1 evict entries from the DMA cache in two cases:

When the cache is full. In this case, hfi1 calls amdgpu's rdma_put_pages().
When hfi1’s free callback is called, presumably because the ROCm buffer has been freed.
- In this case, our free callback waits for outstanding I/O involving those pages to complete before returning.

In our experience however, hfi1's free callback is never called. As a result, hfi1 cannot remove the VA:DMA entry for a ROCm buffer when the ROCm buffer is freed.

This causes a problem: When a new ROCm buffer gets a virtual address that overlaps with the VA range of a previously allocated-and-freed ROCm buffer that was passed into hfi1, hfi1 will find the old ROCm buffer's VA:DMA entry in its cache. hfi1 will pass the stale entry's DMA address into the HFI's SDMA engine. The adapter will then construct the packet from the wrong pages with the wrong data.

We are able to reproduce this problem in testing using osu_multi_lat --warmup 0 --validation-warmup 0 -i 1 -c -m 1:4200000 H D. Our driver with AMD-DMA support is not yet public.

We can work around this by disabling hfi1's AMD-DMA cache. With hfi1's AMD-DMA cache disabled, we do an rdma_put_pages() as soon as we have detected that the packet was sent. However this may come at a performance cost.

Additionally, hfi1 needs assurances DMA addresses for DMA-mapped pages will remain valid until either:

hfi1 calls rdma_put_pages(), OR
after amdgpu calls hfi1’s free callback for the GPU pages and hfi1 returns from that callback.
- It is important that hfi1 be able to delay returning until the outstanding I/O has completed.

Operating System

Red hat Enterprise Linux 9.4 (Plow)

CPU

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.2.0

ROCm Component

ROCK-Kernel-Driver

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

[37mROCk module version 6.8.5 is loaded[0m
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            44                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            44                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-95386651081add54               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   1280                               
  Internal Node ID:        2                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 67                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

No response

ppanchad-amd commented 3 months ago

@BrendanCunningham Internal ticket has been created to investigate this issue. Thanks!

schung-amd commented 3 months ago

Hi @BrendanCunningham, while the driver is not yet public, is it possible to provide some code with your caching logic that reproduces the issue? I've reached out to our kernel driver team, and my understanding is that the buffer should not be freed before rdma_put_pages is called, so this should be working in theory. Of course, this would mean that the free_callback is vestigial, so I'd like to figure out if this buffer can be freed somewhere else, as a search through our codebase does not show anywhere this would be used.

BrendanCunningham commented 3 months ago

@schung-amd

Hi @BrendanCunningham, while the driver is not yet public, is it possible to provide some code with your caching logic that reproduces the issue?

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

I've reached out to our kernel driver team, and my understanding is that the buffer should not be freed before rdma_put_pages is called, so this should be working in theory.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Of course, this would mean that the free_callback is vestigial, so I'd like to figure out if this buffer can be freed somewhere else, as a search through our codebase does not show anywhere this would be used.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

schung-amd commented 3 months ago

Thanks for the quick response!

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

Perfect, thanks. I'll try to reproduce what you're seeing here.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Yes, this is what I have been told. When callingrdma_get_pages(), the amdgpu VRAM manager pins VRAM pages for the user which are not released until rdma_put_pages() is called. When there is VRAM pressure, the VRAM manager stops user processes and migrates pages to relieve the pressure, but pages pinned by rdma_get_pages() are exempt and will not be evicted by the VRAM manager, so your cached addresses should remain valid. However, as this does not appear to match your observations, I'm discussing with the internal team for further understanding on this.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

Do you mean a cache that can be accessed by the user to retrieve these pages, to eliminate the need for a cache on your end? I'll discuss this with the internal team and update you when I have more information.

BrendanCunningham commented 3 months ago

Thanks for the quick response!

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

Perfect, thanks. I'll try to reproduce what you're seeing here.

I have a workaround in pin_amd.c that disables the cache. This workaround is not public yet. It makes two changes:

In insert_amd_pinning(), do not insert struct amd_pintree_node objects into the pintree.
1. Makes it so our driver calls rdma_get_pages() for every ROCm VA range passed into it.
Manage struct amd_pintree_node lifetime with struct kref and suitable kref_put() destructors instead of the atomic_t refcount based code now.
1. Makes it so our driver does rdma_put_pages() as soon as our hardware signals packet send completion. As opposed to waiting for cache eviction.

With this workaround, we no longer send incorrect data.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Yes, this is what I have been told. When callingrdma_get_pages(), the amdgpu VRAM manager pins VRAM pages for the user which are not released until rdma_put_pages() is called. When there is VRAM pressure, the VRAM manager stops user processes and migrates pages to relieve the pressure, but pages pinned by rdma_get_pages() are exempt and will not be evicted by the VRAM manager, so your cached addresses should remain valid. However, as this does not appear to match your observations, I'm discussing with the internal team for further understanding on this.

To be absolutely clear, we are not having problems with pages obtained with rdma_get_pages() being migrated. Just that it would be a problem for us if DMA-mapped pages were migrated underneath us or the DMA addresses otherwise became invalid.

The problem that we do have is that since our free_callback is not called, we can't maintain our VA-range:DMA-mapping (struct amd_p2p_info) cache properly and end up with stale cache entries. Which can then lead to us using the wrong DMA addresses.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

Do you mean a cache that can be accessed by the user to retrieve these pages, to eliminate the need for a cache on your end? I'll discuss this with the internal team and update you when I have more information.

I mean a cache in the amdgpu driver for other drivers like ours, not for userspace. I.e. is there a cache behind rdma_get_pages()? But yes, to eliminate the need for a cache on our end.

BrendanCunningham commented 2 months ago

@schung-amd Also, is it safe to call rdma_put_pages() from an interrupt context? My workaround involves calling rdma_put_pages() in our hardware completion handler, which is called via interrupt.

schung-amd commented 2 months ago

Hi @BrendanCunningham, thanks for following up on this! Sorry for the delay, I'm trying to collect more information from our internal teams before providing answers because I don't have a complete understanding of DMA, so some of these points and followup questions may be unclear; feel free to correct me or clarify on anything. So far my understanding is:

We guarantee that pages pinned by rdma_get_pages() are resident until rdma_put_pages() is called;
free_callback() is never called, as you have already observed in testing and also in the codebase, and is probably not needed;
I don't think there is a cache in the driver you can use, but still inquiring about this; and
I don't know how your cached addresses are becoming invalid.

Are you relying purely on the callback to inform when to remove entries from your cache, even when invoking rdma_put_pages()? In this case, the callback will never be called so the cache will still have stale entries. Or are the addresses changing underneath the hood somewhere, and you are expecting free_callback to be invoked? From your initial post, it sounds like pages are being freed without being handled by the cache, but if my understanding is correct then this should not be happening. Again, my understanding of DMA is incomplete, so there may be something obvious here I'm missing.

I will inquire regarding calling rdma_put_pages() in an interrupt context. I suspect it is not safe to do so, but I'll let you know what the internal teams suggest as soon as I have that information.

BrendanCunningham commented 2 months ago

Hi @BrendanCunningham, thanks for following up on this! Sorry for the delay, I'm trying to collect more information from our internal teams before providing answers because I don't have a complete understanding of DMA, so some of these points and followup questions may be unclear; feel free to correct me or clarify on anything. So far my understanding is:
* We guarantee that pages pinned by rdma_get_pages() are resident until rdma_put_pages() is called;

* free_callback() is never called, as you have already observed in testing and also in the codebase, and is probably not needed;

* I don't think there is a cache in the driver you can use, but still inquiring about this; and

* I don't know how your cached addresses are becoming invalid.
Are you relying purely on the callback to inform when to remove entries from your cache, even when invoking rdma_put_pages()? In this case, the callback will never be called so the cache will still have stale entries.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Or are the addresses changing underneath the hood somewhere, and you are expecting free_callback to be invoked?

Edit: No, the DMA addresses are not changing underneath us; I only asked if the addresses were guaranteed stable to rule that out as a source of error.

From your initial post, it sounds like pages are being freed without being handled by the cache, but if my understanding is correct then this should not be happening. Again, my understanding of DMA is incomplete, so there may be something obvious here I'm missing.

I will inquire regarding calling rdma_put_pages() in an interrupt context. I suspect it is not safe to do so, but I'll let you know what the internal teams suggest as soon as I have that information.

Okay, thanks.

schung-amd commented 2 months ago

@BrendanCunningham Still gathering information re: calling rdma_put_pages() from an interrupt context; the internal team initially recommends against calling it, but is digging into the code to check.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Is your driver in control of freeing the buffer, or is something else freeing the buffer? Is it possible to place the free_callback logic to modify your cache entries directly where you are calling rdma_put_pages()?

BrendanCunningham commented 2 months ago

@BrendanCunningham Still gathering information re: calling rdma_put_pages() from an interrupt context; the internal team initially recommends against calling it, but is digging into the code to check.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Is your driver in control of freeing the buffer, or is something else freeing the buffer? Is it possible to place the free_callback logic to modify your cache entries directly where you are calling rdma_put_pages()?

Our driver is not in control of freeing the buffer; we can only react to others (userspace or amdgpu) freeing the buffers and only if those other entities call free_callback.

Absent free_callback, our only good/safe option is disabling our caching code.

schung-amd commented 2 months ago

According to the internal team, your driver should have full control over the lifetime of the buffers; amdgpu guarantees that your buffers are resident until the driver calls rdma_put_pages() on them, and by design userspace should have no way of freeing them safely outside of your driver's control. Do you have any logs showing how/when your buffers are being freed without the driver knowing?

ddalessa commented 2 months ago

We will update with logs soon.

BrendanCunningham commented 2 months ago

Here are pr_debug() printout logs from two hosts in a 2 rank, 2 host job with our driver (hfi1) with our AMD DMA cache enabled:

Note the absence of invalidate_sdma_pages_gpu lines in either log; those events would only be printed from our free_callback function, which is not called. unpin_sdma_pages_gpu lines are not the same as invalidate_sdma_pages_gpu. In these logs, unpin_sdma_pages_gpu lines are printed when the user file descriptor is torn down and any entries in our SDMA cache are freed.

Here is the job log:

run.log

The job fails on a data validation error because our driver cannot maintain correct DMA cache entries.

Here is the script I ran on both hosts after loading our driver:

#!/bin/bash
set -x
alias ddcmd='echo $* > /proc/dynamic_debug/control'
shopt -s expand_aliases
dmesg -n debug
dmesg -C
ddcmd '-p; module hfi1 file pin_amd.c +p'
ddcmd 'module amdgpu +p'

schung-amd commented 2 months ago

Thanks for the logs! I'll pass them on to the internal team for more insight. As discussed, I wouldn't expect the callback to be called anywhere, as the internal team has stated that it is not used, so the absence of invalidate_sdma_pages_gpu is expected. Apologies if these seem like trivial questions, but to confirm: where unpin_sdma_pages_gpu is being called, is this happening outside of your control (i.e. from amdgpu, which should be guaranteed to not happen, or from userspace somehow)? Is it possible to modify your cache inside unpin_sdma_pages_gpu?

BrendanCunningham commented 2 months ago

Thanks for the logs! I'll pass them on to the internal team for more insight. As discussed, I wouldn't expect the callback to be called anywhere, as the internal team has stated that it is not used, so the absence of invalidate_sdma_pages_gpu is expected. Apologies if these seem like trivial questions, but to confirm: where unpin_sdma_pages_gpu is being called, is this happening outside of your control (i.e. from amdgpu, which should be guaranteed to not happen, or from userspace somehow)?

unpin_sdma_pages_gpu is under our control.

Note that the actual function in our code is unpin_amd_node(); unpin_sdma_pages_gpu is the trace event emitted by the tracepoint in unpin_amd_node(). But to avoid further confusion, I'll stick to calling it unpin_sdma_pages_gpu() since that's what's in our logs and is only used in unpin_amd_node().

Is it possible to modify your cache inside unpin_sdma_pages_gpu?

No, it is not possible; we will only call unpin_sdma_pages_gpu() after we have evicted the cache entry from our cache and the refcount for that pinning object hits 0.

Even if amdgpu did call our free_callback, our code wouldn't call unpin_sdma_pages_gpu() in that path on the assumption that amdgpu will unpin after we return and that calling into amdgpu from inside the free_callback path could deadlock.

Our free_callback would evict the cache entry from the cache as soon as we get the lock for our cache tree. This is to prevent calls into our driver from other CPUs from getting the old cache entry. Our free_callback would then wait for all outstanding I/O on our cache object to complete before returning.

ddalessa commented 2 months ago

Want to make sure we are all on the same page here and that I understand things. I think what I'm hearing is that the high level issue is our driver doesn't know when the VA changes. The PA is fine, those pages are pinned. The put/get ops handle that.

Basically what happens is this:

User allocates a buffer on GPU (probably with something like hipMalloc()). This gets the user a pointer, call it VA1 for the buffer
User goes to ask the hfi1 driver to send that buffer telling it the VA1
Now hfi1 goes to pin those pages and effectively cache a VA1 to PA1 mapping. This is a time consuming operation and why we need to cache.
User does a hipFree() or something else happens to free that VA.
AMD Driver can free that mapping. The pages are still pinned. The hfi can still access the memory. This is not the problem.
User does another allocation and gets another buffer somehow this VA is the same as it was in step 1 so consider it VA1 still.
User asks hfi1 to send the new buffer again with VA1, and the hfi1 looks up the address associated in our cache and finds PA1, never pinning PA2. The end result is hfi1 sends data from the wrong PA.

There is no mechanism right now for the hfi1 driver to know that VA1 mapping should now be pointing at PA2.

The ask here is that the AMD driver use the call back to inform the driver that we need to handle this scenario in (5).

So while @schung-amd you are correct, the AMD driver doesn't NEED to call the callback. Other users of the buffer do need it to be called. Otherwise we can't cache those VA<->PA mappings. Now for things that use a separate memory registration scheme this may not be an issue, but for our pseudo on-demand paging scheme it is.

@BrendanCunningham Please correct any of my misunderstanding in the above.

BrendanCunningham commented 2 months ago

@ddalessa no corrections; that is a good summary

schung-amd commented 2 months ago

Thanks for the clarification, I discussed this with the internal team. Unfortunately, we do not provide support for this. From our point of view, hfi1 needs to have full control over the VA, and the user freeing the memory outside of hfi1's control is an access violation. One option is to provide an interface by which users can notify hfi1 that they are going to free the VA, so that hfi1 can modify its cache appropriately before the VA is freed. In any scenario, users freeing the VA without notifying hfi1 should be treated as a memory access violation, and we do not provide a mechanism to protect you from that.

The free_callback would not help here, as it is meant for scenarios where amdgpu decides on its own that it needs to remove the buffer and must notify hfi1 (although we guarantee that this does not occur so this is not called, as you note); a user freeing the VA does not fall into this category.

BrendanCunningham commented 2 months ago

Thanks for the clarification, I discussed this with the internal team. Unfortunately, we do not provide support for this. From our point of view, hfi1 needs to have full control over the VA, and the user freeing the memory outside of hfi1's control is an access violation. One option is to provide an interface by which users can notify hfi1 that they are going to free the VA, so that hfi1 can modify its cache appropriately before the VA is freed. In any scenario, users freeing the VA without notifying hfi1 should be treated as a memory access violation, and we do not provide a mechanism to protect you from that.

We can implement such a notification interface for hfi1, though we would prefer it if there were an automatic mechanism to be notified of ROCm-VA:GPU-page unmappings.

I have two further questions:

Are all ROCm-VA:GPU-page mappings mapped into the process virtual address space? If so, can we monitor those ROCm-VA:GPU-page unmappings with mmu_interval_notifier?
How does ROCm/ROCR call into amdgpu to unmap a ROCm-VA:GPU-page mapping (e.g. when doing hipFree())?

We are interested in hooking hipFree() to then notify hfi1 of a VA:page unmapping and want to make sure that we can find other places that might result in a VA:page unmapping, whether they be from userspace or within the kernel.

The free_callback would not help here, as it is meant for scenarios where amdgpu decides on its own that it needs to remove the buffer and must notify hfi1 (although we guarantee that this does not occur so this is not called, as you note); a user freeing the VA does not fall into this category.

Okay.

schung-amd commented 1 month ago

Sorry for the delay, I've gotten some answers from the internal team.

Are all ROCm-VA:GPU-page mappings mapped into the process virtual address space? If so, can we monitor those ROCm-VA:GPU-page unmappings with mmu_interval_notifier?

Memory allocated by hipMalloc is device memory and shows up as a device backed memory region in the process VMA. This is fully controlled by amdgpu, and there will be no MMU notifications for these buffers.

How does ROCm/ROCR call into amdgpu to unmap a ROCm-VA:GPU-page mapping (e.g. when doing hipFree())?

We are interested in hooking hipFree() to then notify hfi1 of a VA:page unmapping and want to make sure that we can find other places that might result in a VA:page unmapping, whether they be from userspace or within the kernel.

The path is HIP calls --> common language runtime --> rocm runtime --> via ioctl --> amdgpu driver. This can be done, but it sounds like you'll have to insert these hooks into clr (https://github.com/ROCm/clr), and the user will have to install your modified version. If this is the path you want to take, we can reach out to the HIP team for more guidance on this if necessary.

BrendanCunningham commented 1 month ago

Sorry for the delay, I've gotten some answers from the internal team.

Are all ROCm-VA:GPU-page mappings mapped into the process virtual address space? If so, can we monitor those ROCm-VA:GPU-page unmappings with mmu_interval_notifier?

We have modified our driver, hfi1 to monitor ROCm VA ranges for UNMAP with mmu_interval_notifier and remove those entries from our cache.

Prior to this change, our reproducer failed. With this change, our reproducer now passes.

Memory allocated by hipMalloc is device memory and shows up as a device backed memory region in the process VMA. This is fully controlled by amdgpu, and there will be no MMU notifications for these buffers.

How does ROCm/ROCR call into amdgpu to unmap a ROCm-VA:GPU-page mapping (e.g. when doing hipFree())? We are interested in hooking hipFree() to then notify hfi1 of a VA:page unmapping and want to make sure that we can find other places that might result in a VA:page unmapping, whether they be from userspace or within the kernel.

The path is HIP calls --> common language runtime --> rocm runtime --> via ioctl --> amdgpu driver. This can be done, but it sounds like you'll have to insert these hooks into clr (https://github.com/ROCm/clr), and the user will have to install your modified version. If this is the path you want to take, we can reach out to the HIP team for more guidance on this if necessary.

This is not the path we want to take.

Is it possible that all ROCm VA ranges are in-fact mapped into the process with the MMU? If not, under which cases would they not be, i.e. which cases do we have to consider doing an unmap ioctl for?

schung-amd commented 1 month ago

We have modified our driver, hfi1 to monitor ROCm VA ranges for UNMAP with mmu_interval_notifier and remove those entries from our cache.

Prior to this change, our reproducer failed. With this change, our reproducer now passes.

Is it possible that all ROCm VA ranges are in-fact mapped into the process with the MMU? If not, under which cases would they not be, i.e. which cases do we have to consider doing an unmap ioctl for?

Interesting, this isn't what I would expect based on the info I was given, but hopefully this is reliable behavior. I'll ask the internal team about this and get back to you.

schung-amd commented 1 month ago

To clarify, in your reproducer (and I assume in the intended usecase), the memory that the VAs point to is device memory allocated with hipMalloc and freed with hipFree?

BrendanCunningham commented 1 month ago

Our reproducer is osu_bibw built from the OSU Microbenchmarks 7.4 source code available here.

I am 99% certain that the only ROCm calls in the benchmark code are HIP runtime calls and that the only AMD GPU device allocations are done via hipMalloc().

I have double-checked with one of our userspace library developers; he says that our userspace library does not allocate any ROCm buffers of its own and that it only passes the ROCm buffers the library is given to our driver.

To clarify, in your reproducer (and I assume in the intended usecase), the memory that the VAs point to is device memory allocated with hipMalloc and freed with hipFree?

Yes, I'm 99% certain that the VAs come solely from ROCm buffers allocated with hipMalloc().

schung-amd commented 1 month ago

Thanks for the confirmation. Is it possible to share the code you're inserting to use the mmu notifier?

BrendanCunningham commented 1 month ago

Yes, hfi1/pin_amd.c amd_node_mmu_register() is where our driver subscribes with mmu_interval_notifier for notifications about the ROCm VA range it was given.

schung-amd commented 1 month ago

@BrendanCunningham After discussing this further with the internal team, this does work as you've observed; ROCm (libhsakmt) calls mmap in hipMalloc() and munmap in hipFree(), and when munmap is called, hfi1 should get mmu notification of it. You should be able to rely on this behaviour. Let me know if you have any follow-up questions about this or if there's anything else you need.

BrendanCunningham commented 1 month ago

This is satisfactory for us then. Thanks for your help.

ROCm / ROCK-Kernel-Driver

[Issue]: free_callback passed into rdma_get_pages() never called #169

Problem Description

TLDR

Detailed description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information