ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
320 stars 97 forks source link

[Issue]: free_callback passed into rdma_get_pages() never called #169

Open BrendanCunningham opened 4 weeks ago

BrendanCunningham commented 4 weeks ago

Problem Description

TLDR

amdgpu does not call the free callback that our Linux kernel driver passes into rdma_get_pages(). Results in stale ROCm-buffer:DMA-mapped-page cache entries and wrong cache entry lookup. Ultimately, wrong pages are used in constructing packet and wrong data sent.

Additionally, our driver requires assurances that once acquired, DMA addresses to AMD GPU pages remain valid until either driver calls rdma_put_pages() or only after our page invalidation callback has been called and we can delay DMA-unmapping until our outstanding I/O operations with those pages have completed.

Detailed description

Our Linux kernel driver (hfi1) for our HPC fabric interface card (HFI) calls rdma_get_pages() in amdgpu to get DMA-mapped AMD GPU pages for a ROCm buffer. It does this so it can pass the DMA addresses into the HFI's packet construction engine (SDMA) to fill the packet payloads using DMA.

To avoid repeated rdma_get_pages() calls for the same buffer, hfi1 maintains a cache of ROCm-virtual-address-range:DMA-mapped-AMD-GPU-pages ("VA:DMA" entry or mapping). hfi1 has a cache per user context (userspace process). This is hfi1’s AMD-DMA cache.

hfi1 provides rdma_get_pages() with a free callback function.

hfi1 evict entries from the DMA cache in two cases:

  1. When the cache is full. In this case, hfi1 calls amdgpu's rdma_put_pages().
  2. When hfi1’s free callback is called, presumably because the ROCm buffer has been freed.
    • In this case, our free callback waits for outstanding I/O involving those pages to complete before returning.

In our experience however, hfi1's free callback is never called. As a result, hfi1 cannot remove the VA:DMA entry for a ROCm buffer when the ROCm buffer is freed.

This causes a problem: When a new ROCm buffer gets a virtual address that overlaps with the VA range of a previously allocated-and-freed ROCm buffer that was passed into hfi1, hfi1 will find the old ROCm buffer's VA:DMA entry in its cache. hfi1 will pass the stale entry's DMA address into the HFI's SDMA engine. The adapter will then construct the packet from the wrong pages with the wrong data.

We are able to reproduce this problem in testing using osu_multi_lat --warmup 0 --validation-warmup 0 -i 1 -c -m 1:4200000 H D. Our driver with AMD-DMA support is not yet public.

We can work around this by disabling hfi1's AMD-DMA cache. With hfi1's AMD-DMA cache disabled, we do an rdma_put_pages() as soon as we have detected that the packet was sent. However this may come at a performance cost.

Additionally, hfi1 needs assurances DMA addresses for DMA-mapped pages will remain valid until either:

Operating System

Red hat Enterprise Linux 9.4 (Plow)

CPU

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.2.0

ROCm Component

ROCK-Kernel-Driver

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.8.5 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            44                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            44                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-95386651081add54               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   1280                               
  Internal Node ID:        2                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 67                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Additional Information

No response

ppanchad-amd commented 3 weeks ago

@BrendanCunningham Internal ticket has been created to investigate this issue. Thanks!

schung-amd commented 3 weeks ago

Hi @BrendanCunningham, while the driver is not yet public, is it possible to provide some code with your caching logic that reproduces the issue? I've reached out to our kernel driver team, and my understanding is that the buffer should not be freed before rdma_put_pages is called, so this should be working in theory. Of course, this would mean that the free_callback is vestigial, so I'd like to figure out if this buffer can be freed somewhere else, as a search through our codebase does not show anywhere this would be used.

BrendanCunningham commented 2 weeks ago

@schung-amd

Hi @BrendanCunningham, while the driver is not yet public, is it possible to provide some code with your caching logic that reproduces the issue?

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

I've reached out to our kernel driver team, and my understanding is that the buffer should not be freed before rdma_put_pages is called, so this should be working in theory.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Of course, this would mean that the free_callback is vestigial, so I'd like to figure out if this buffer can be freed somewhere else, as a search through our codebase does not show anywhere this would be used.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

schung-amd commented 2 weeks ago

Thanks for the quick response!

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

Perfect, thanks. I'll try to reproduce what you're seeing here.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Yes, this is what I have been told. When callingrdma_get_pages(), the amdgpu VRAM manager pins VRAM pages for the user which are not released until rdma_put_pages() is called. When there is VRAM pressure, the VRAM manager stops user processes and migrates pages to relieve the pressure, but pages pinned by rdma_get_pages() are exempt and will not be evicted by the VRAM manager, so your cached addresses should remain valid. However, as this does not appear to match your observations, I'm discussing with the internal team for further understanding on this.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

Do you mean a cache that can be accessed by the user to retrieve these pages, to eliminate the need for a cache on your end? I'll discuss this with the internal team and update you when I have more information.

BrendanCunningham commented 2 weeks ago

Thanks for the quick response!

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

Perfect, thanks. I'll try to reproduce what you're seeing here.

I have a workaround in pin_amd.c that disables the cache. This workaround is not public yet. It makes two changes:

  1. In insert_amd_pinning(), do not insert struct amd_pintree_node objects into the pintree.
    1. Makes it so our driver calls rdma_get_pages() for every ROCm VA range passed into it.
  2. Manage struct amd_pintree_node lifetime with struct kref and suitable kref_put() destructors instead of the atomic_t refcount based code now.
    1. Makes it so our driver does rdma_put_pages() as soon as our hardware signals packet send completion. As opposed to waiting for cache eviction.

With this workaround, we no longer send incorrect data.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Yes, this is what I have been told. When callingrdma_get_pages(), the amdgpu VRAM manager pins VRAM pages for the user which are not released until rdma_put_pages() is called. When there is VRAM pressure, the VRAM manager stops user processes and migrates pages to relieve the pressure, but pages pinned by rdma_get_pages() are exempt and will not be evicted by the VRAM manager, so your cached addresses should remain valid. However, as this does not appear to match your observations, I'm discussing with the internal team for further understanding on this.

To be absolutely clear, we are not having problems with pages obtained with rdma_get_pages() being migrated. Just that it would be a problem for us if DMA-mapped pages were migrated underneath us or the DMA addresses otherwise became invalid.

The problem that we do have is that since our free_callback is not called, we can't maintain our VA-range:DMA-mapping (struct amd_p2p_info) cache properly and end up with stale cache entries. Which can then lead to us using the wrong DMA addresses.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

Do you mean a cache that can be accessed by the user to retrieve these pages, to eliminate the need for a cache on your end? I'll discuss this with the internal team and update you when I have more information.

I mean a cache in the amdgpu driver for other drivers like ours, not for userspace. I.e. is there a cache behind rdma_get_pages()? But yes, to eliminate the need for a cache on our end.

BrendanCunningham commented 1 week ago

@schung-amd Also, is it safe to call rdma_put_pages() from an interrupt context? My workaround involves calling rdma_put_pages() in our hardware completion handler, which is called via interrupt.

schung-amd commented 1 week ago

Hi @BrendanCunningham, thanks for following up on this! Sorry for the delay, I'm trying to collect more information from our internal teams before providing answers because I don't have a complete understanding of DMA, so some of these points and followup questions may be unclear; feel free to correct me or clarify on anything. So far my understanding is:

Are you relying purely on the callback to inform when to remove entries from your cache, even when invoking rdma_put_pages()? In this case, the callback will never be called so the cache will still have stale entries. Or are the addresses changing underneath the hood somewhere, and you are expecting free_callback to be invoked? From your initial post, it sounds like pages are being freed without being handled by the cache, but if my understanding is correct then this should not be happening. Again, my understanding of DMA is incomplete, so there may be something obvious here I'm missing.

I will inquire regarding calling rdma_put_pages() in an interrupt context. I suspect it is not safe to do so, but I'll let you know what the internal teams suggest as soon as I have that information.

BrendanCunningham commented 1 week ago

Hi @BrendanCunningham, thanks for following up on this! Sorry for the delay, I'm trying to collect more information from our internal teams before providing answers because I don't have a complete understanding of DMA, so some of these points and followup questions may be unclear; feel free to correct me or clarify on anything. So far my understanding is:

* We guarantee that pages pinned by rdma_get_pages() are resident until rdma_put_pages() is called;

* free_callback() is never called, as you have already observed in testing and also in the codebase, and is probably not needed;

* I don't think there is a cache in the driver you can use, but still inquiring about this; and

* I don't know how your cached addresses are becoming invalid.

Are you relying purely on the callback to inform when to remove entries from your cache, even when invoking rdma_put_pages()? In this case, the callback will never be called so the cache will still have stale entries.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Or are the addresses changing underneath the hood somewhere, and you are expecting free_callback to be invoked?

Edit: No, the DMA addresses are not changing underneath us; I only asked if the addresses were guaranteed stable to rule that out as a source of error.

From your initial post, it sounds like pages are being freed without being handled by the cache, but if my understanding is correct then this should not be happening. Again, my understanding of DMA is incomplete, so there may be something obvious here I'm missing.

I will inquire regarding calling rdma_put_pages() in an interrupt context. I suspect it is not safe to do so, but I'll let you know what the internal teams suggest as soon as I have that information.

Okay, thanks.

schung-amd commented 6 days ago

@BrendanCunningham Still gathering information re: calling rdma_put_pages() from an interrupt context; the internal team initially recommends against calling it, but is digging into the code to check.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Is your driver in control of freeing the buffer, or is something else freeing the buffer? Is it possible to place the free_callback logic to modify your cache entries directly where you are calling rdma_put_pages()?

BrendanCunningham commented 6 days ago

@BrendanCunningham Still gathering information re: calling rdma_put_pages() from an interrupt context; the internal team initially recommends against calling it, but is digging into the code to check.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Is your driver in control of freeing the buffer, or is something else freeing the buffer? Is it possible to place the free_callback logic to modify your cache entries directly where you are calling rdma_put_pages()?

Our driver is not in control of freeing the buffer; we can only react to others (userspace or amdgpu) freeing the buffers and only if those other entities call free_callback.

Absent free_callback, our only good/safe option is disabling our caching code.

schung-amd commented 6 days ago

According to the internal team, your driver should have full control over the lifetime of the buffers; amdgpu guarantees that your buffers are resident until the driver calls rdma_put_pages() on them, and by design userspace should have no way of freeing them safely outside of your driver's control. Do you have any logs showing how/when your buffers are being freed without the driver knowing?

ddalessa commented 1 day ago

We will update with logs soon.