[Issue]: kmemleak "unreferenced object...(size 32)" reports for memory allocated in amdgpu_vm_update_range()

BrendanCunningham commented 2 weeks ago

Problem Description

After running osu_bibw -m 256:256 D D across two nodes, I see "unreferenced object" memory leak reports like so:

unreferenced object 0xffff940d27c6aae0 (size 32):
  comm "osu_bibw", pid 8915, jiffies 4556610972 (age 18089.050s)
  hex dump (first 32 bytes):
    0f 00 00 00 0f 00 00 00 00 6e d8 00 00 00 00 00  .........n......
    2d 75 84 b1 14 be d7 30 00 00 00 00 00 00 00 00  -u.....0........
  backtrace:
    [<ffffffffa3382c35>] kmalloc_trace+0x25/0x90
    [<ffffffffc0aa96a7>] amdgpu_vm_update_range+0x97/0x890 [amdgpu]
    [<ffffffffc0aaa7ce>] amdgpu_vm_clear_freed+0xde/0x250 [amdgpu]
    [<ffffffffc0cf5da9>] amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x169/0x230 [amdgpu]
    [<ffffffffc0cbc3fc>] kfd_ioctl_unmap_memory_from_gpu+0xec/0x310 [amdgpu]
    [<ffffffffc0cba396>] kfd_ioctl+0x376/0x4d0 [amdgpu]
    [<ffffffffa346fc1d>] __x64_sys_ioctl+0x8d/0xc0
    [<ffffffffa3c8dcac>] do_syscall_64+0x5c/0x90
    [<ffffffffa3e000a6>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

In /sys/kernel/debug/kmemleak on both nodes. There are 2887 of these reports on one node and 2945 reports on the other node.

On the node that has 2887 "unreferenced object" reports, there are 2887 occurrences of amdgpu_vm_update_range in the kmemleak output.

On the other node that has 2940 "unreferenced object" reports, there are 2940 occurences of amdgpu_vm_update_range in the kmemleak output. The other 5 reports trace through nfs code and are all 16 bytes in size (size 16).

All of the other "unreferenced object" reports between the two nodes are 32 bytes in size (size 32).

I have not gone through every report but, given that the number of occurrences of amdgpu_vm_update_range matches the number of unreferenced object...(size 32): reports on both nodes, I strongly suspect that this is a repeating leak or small variants thereof.

This is running osu_bibw D D with Open MPI on top of a driver for our HPC interconnect card with support for sending packets generated from ROCm buffers using a DMA engine. No calls from our driver (hfi1) appear in either kmemleak report. There are nearly as many (2881 and 2935) occurrences of kfd_ioctl+ in both kmemleak files as there are unreferenced object occurrences so I suspect that these leaks are occurring under an ioctl from ROCm userspace into amdgpu.

These leaks do not seem to affect the stability or functionality of the system but I am doing short tests, one benchmark every few minutes to every few hours.

Operating System

Red hat Enterprise Linux 9.4 (Plow)

CPU

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

Hardware prerequisites:

Two nodes equipped with:
- At least one MI100 each
- At least one Cornelis Networks Omni-Path 100 Host Fabric Adapter each
- Connected two each other either back-to-back or via an Omni-Path switch

Software prerequisites:

Open MPI 5.0.5 built with ROCm 6.2.0 support
libfabric with OPX provider with ROCm SDMA support
hfi1 with AMD SDMA support
- hfi1 with AMD SDMA support can be found here

As root, echo clear > /sys/kernel/debug/kmemleak on both nodes.
Run osu_bibw -m 256:256 D D across two nodes.
As root, echo scan > /sys/kernel/debug/kmemleak on both nodes.
As root, cat /sys/kernel/debug/kmemleak > kmemleak-$(hostname)-256.txt.
On each node, run dmesg -wT to monitor for when kmemleaks have been detected with a message like so: [Tue Oct 1 17:09:58 2024] kmemleak: 2980 new suspected memory leaks (see /sys/kernel/debug/kmemleak) This may take a few minutes.
grep -Ec '^unreferenced object' kmemleak-*256.txt; make note of number of hits from each file.
grep -c amdgpu_vm_update_range kmemleak-*-256.txt; note number of hits from each file and compare to hits from same file in step 6.
Expectation is that number of hits in step 7 will be same or close to number of hits in step 6 for the same file.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

[37mROCk module version 6.8.5 is loaded[0m
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            44                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32287400(0x1ecaaa8) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32287400(0x1ecaaa8) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32287400(0x1ecaaa8) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            44                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    33007252(0x1f7a694) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    33007252(0x1f7a694) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33007252(0x1f7a694) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-95386651081add54               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   1280                               
  Internal Node ID:        2                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 67                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

kmemleak reports

BrendanCunningham commented 2 weeks ago

Two other things I didn't note in my initial report:

I observed these problems on a 6.5.0 development kernel built with CONFIG_DEBUG_KMEMLEAK=y; the driver code I pointed you to is meant to be built against a distro kernel (e.g. 5.14 in RHEL 9.4). Distro kernels may not be built with CONFIG_DEBUG_KMEMLEAK=y.
I first observed this problem a few weeks back, I narrowed the likely culprit to tlb_cb = kmalloc(sizeof(*tlb_cb), GFP_KERNEL); in amdgpu_vm_update_range() in the /usr/src/amdgpu-6.8.5-2009582.el9/ on my MI100 nodes.

It looks like that struct is supposed to be freed in amdgpu_vm_tlb_seq_cb(). When I saw this problem, I noticed that both amdgpu_vm_update_range and amdgpu_vm_tlb_seq_cb are in /sys/kernel/tracing/available_filter_functions.

So I did echo function > /sys/kernel/tracing/current_tracer, limited the function tracer to just amdgpu_vm_update_range and amdgpu_vm_tlb_seq_cb, and ran the reproducer.

After running the reproducer, I saw many occurrences of amdgpu_vm_update_range but no occurrences of amdgpu_vm_tlb_seq_cb in /sys/kernel/tracing/trace for either node.

ppanchad-amd commented 1 day ago

Hi @BrendanCunningham. Internal ticket has been created to investigate your issue. Thanks!

tcgu-amd commented 1 day ago

Hi @BrendanCunningham Thanks for reporting the issue! This is curious for sure, and we will try our best to reproduce it. Meanwhile, a speculation regarding your investigation:

It looks like that struct is supposed to be freed in amdgpu_vm_tlb_seq_cb(). When I saw this problem, I noticed that both amdgpu_vm_update_range and amdgpu_vm_tlb_seq_cb are in /sys/kernel/tracing/available_filter_functions.

So I did echo function > /sys/kernel/tracing/current_tracer, limited the function tracer to just amdgpu_vm_update_range and amdgpu_vm_tlb_seq_cb, and ran the reproducer. After running the reproducer, I saw many occurrences of amdgpu_vm_update_range but no occurrences of amdgpu_vm_tlb_seq_cb in /sys/kernel/tracing/trace for either node.

So how this works is that at here tlb_cb is passed to amdgpu_vm_tlb_flush, then immediately set to NULL afterwards. However, amdgpu_vm_tlb_flush doesn't hold on to tlb_cb either; instead it only passes a reference to its member, &tlb_cb->cb, to the amdgpu_vm_tlb_seq_cb function here, which gets executed only when the dma fence get signaled. This means that technically nothing is pointing to tlb_cb at the end of the scope of amdgpu_vm_update_range. I am not exactly sure how kmalloc_trace is keeping track of memory leaks, but if it is doing it by reference counting, then it is likely going to set a false alarm at that point. My suggestion would be to run a longer test and see if there's any actual memory consumption building up over time.

Hope this helps. Thanks!

ROCm / ROCK-Kernel-Driver