Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
856 stars 180 forks source link

GPU memory in pod and memory in host are different, and GPU memory in pod obviously beyond the limit #446

Open lut777 opened 2 months ago

lut777 commented 2 months ago

1. Issue or feature description

I deploy HAMi based on the doc

I deploy a PyTorch pod with follow spec:

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-pod
spec:
  hostPID: true
  containers:
    - name: tf-container
      image: pytorch-mnist-test
      imagePullPolicy: IfNotPresent
      command: ["/bin/bash", "-c", "python /project/mnist.py"]
      resources:
        limits:
          nvidia.com/gpu: 1
          nvidia.com/gpumem: 1000
          nvidia.com/gpucores: 50

To clarify the usage of GPU memory, 'hostPID: true' is set.

Now in the pod, nvidia-smi shows this:

image

But outside the pod, the result is this:

image

Now the tricky part is, which GPU memory is right? 782 or 1660?? And HAMi-core log is not much:

image

I thought I could get right GPU memory in cuMemAlloc_v2 but I failed.

So I have 2 problems: 1: which GPU memory usage is correct? 782 or 1660? if the latter, why is the limit exceeded? 2: How could I determine the real GPU memory usage? 3: I can't find LD_PRELOAD in the pod, is that correct? But I didn't find error message in webhook.

2. Steps to reproduce the issue

Apply the pod and checkout the result in and out the pod.

3. Information to attach (optional if deemed irrelevant)

nvidia-smi -a in the pod:

[root@k8s-master ~]# kubectl exec -ti pytorch-pod bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@pytorch-pod:/project# nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Mon Aug 19 10:14:50 2024
Driver Version                            : 535.104.05
[HAMI-core Msg(24953:139932371863360:libvgpu.c:836)]: Initializing.....
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:00:0C.0
    Product Name                          : Tesla V100S-PCIE-32GB
    Product Brand                         : Tesla
    Product Architecture                  : Volta
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : N/A
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1561221010428
    GPU UUID                              : GPU-12226bc4-7201-3ff9-2299-ea29f419bcc1
    Minor Number                          : 1
    VBIOS Version                         : 88.00.98.00.01
    MultiGPU Board                        : No
    Board ID                              : 0xc
    Board Part Number                     : 900-2G500-0040-000
    GPU Part Number                       : 1DF6-907-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G500.0212.00.02
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x0C
        Domain                            : 0x0000
        Device Id                         : 0x1DF610DE
        Bus Id                            : 00000000:00:0C.0
        Sub System Id                     : 0x13D610DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
                Device Current            : 3
                Device Max                : 3
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 8000 KB/s
        Rx Throughput                     : 45000 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 1000 MiB
        Reserved                          : 266 MiB
        Used                              : 782 MiB
        Free                              : 30837 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 6 MiB
        Free                              : 32762 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 15 %
        Memory                            : 6 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 37 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 90 C
        GPU Slowdown Temp                 : 87 C
        GPU Max Operating Temp            : 83 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 35 C
        Memory Max Operating Temp         : 85 C
    GPU Power Readings
        Power Draw                        : 50.37 W
        Current Power Limit               : 250.00 W
        Requested Power Limit             : 250.00 W
        Default Power Limit               : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 250.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1245 MHz
        SM                                : 1245 MHz
        Memory                            : 1107 MHz
        Video                             : 1125 MHz
    Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 1107 MHz
    Default Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 1107 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1597 MHz
        SM                                : 1597 MHz
        Memory                            : 1107 MHz
        Video                             : 1432 MHz
    Max Customer Boost Clocks
        Graphics                          : 1597 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 18040
            Type                          : C
            Name                          : python
            Used GPU Memory               : 1660 MiB

[HAMI-core Msg(24953:139932371863360:multiprocess_memory_limit.c:468)]: Calling exit handler 24953

And the result outside the pod:


[root@gpuserver154 ~]# nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Mon Aug 19 18:15:51 2024
Driver Version                            : 535.104.05
CUDA Version                              : 12.2

Attached GPUs                             : 2
GPU 00000000:00:0B.0
    Product Name                          : Tesla V100S-PCIE-32GB
    Product Brand                         : Tesla
    Product Architecture                  : Volta
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : N/A
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1561221010375
    GPU UUID                              : GPU-8100dd2f-04fe-71e3-5f8a-d6bf79d655d0
    Minor Number                          : 0
    VBIOS Version                         : 88.00.98.00.01
    MultiGPU Board                        : No
    Board ID                              : 0xb
    Board Part Number                     : 900-2G500-0040-000
    GPU Part Number                       : 1DF6-907-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G500.0212.00.02
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x0B
        Domain                            : 0x0000
        Device Id                         : 0x1DF610DE
        Bus Id                            : 00000000:00:0B.0
        Sub System Id                     : 0x13D610DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
                Device Current            : 3
                Device Max                : 3
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 32768 MiB
        Reserved                          : 266 MiB
        Used                              : 0 MiB
        Free                              : 32501 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 2 MiB
        Free                              : 32766 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 32 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 90 C
        GPU Slowdown Temp                 : 87 C
        GPU Max Operating Temp            : 83 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 29 C
        Memory Max Operating Temp         : 85 C
    GPU Power Readings
        Power Draw                        : 27.19 W
        Current Power Limit               : 250.00 W
        Requested Power Limit             : 250.00 W
        Default Power Limit               : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 250.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 1107 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 1107 MHz
    Default Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 1107 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1597 MHz
        SM                                : 1597 MHz
        Memory                            : 1107 MHz
        Video                             : 1432 MHz
    Max Customer Boost Clocks
        Graphics                          : 1597 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None

GPU 00000000:00:0C.0
    Product Name                          : Tesla V100S-PCIE-32GB
    Product Brand                         : Tesla
    Product Architecture                  : Volta
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : N/A
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1561221010428
    GPU UUID                              : GPU-12226bc4-7201-3ff9-2299-ea29f419bcc1
    Minor Number                          : 1
    VBIOS Version                         : 88.00.98.00.01
    MultiGPU Board                        : No
    Board ID                              : 0xc
    Board Part Number                     : 900-2G500-0040-000
    GPU Part Number                       : 1DF6-907-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G500.0212.00.02
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x0C
        Domain                            : 0x0000
        Device Id                         : 0x1DF610DE
        Bus Id                            : 00000000:00:0C.0
        Sub System Id                     : 0x13D610DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
                Device Current            : 3
                Device Max                : 3
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 8000 KB/s
        Rx Throughput                     : 58000 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 32768 MiB
        Reserved                          : 266 MiB
        Used                              : 1664 MiB
        Free                              : 30837 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 6 MiB
        Free                              : 32762 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 14 %
        Memory                            : 6 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 37 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 90 C
        GPU Slowdown Temp                 : 87 C
        GPU Max Operating Temp            : 83 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 34 C
        Memory Max Operating Temp         : 85 C
    GPU Power Readings
        Power Draw                        : 49.90 W
        Current Power Limit               : 250.00 W
        Requested Power Limit             : 250.00 W
        Default Power Limit               : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 250.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1245 MHz
        SM                                : 1245 MHz
        Memory                            : 1107 MHz
        Video                             : 1125 MHz
    Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 1107 MHz
    Default Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 1107 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1597 MHz
        SM                                : 1597 MHz
        Memory                            : 1107 MHz
        Video                             : 1432 MHz
    Max Customer Boost Clocks
        Graphics                          : 1597 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 18040
            Type                          : C
            Name                          : python
            Used GPU Memory               : 1660 MiB

[root@gpuserver154 ~]# 
chaunceyjiang commented 2 months ago

which GPU memory usage is correct? 782 or 1660? if the latter, why is the limit exceeded?

Looks like 1660 should be the right value. Not sure why 782 is showing up. My guess is you might be using cudaMallocAsync, refer to https://github.com/Project-HAMi/HAMi/issues/409.

I can't find LD_PRELOAD in the pod, is that correct?

Sounds like you're looking for /etc/ld.so.preload?

archlitchi commented 2 months ago

yes, 1660 is correct

lut777 commented 2 months ago

Well if the real GPU memory usage is 1660, then the problem is really serious. BECAUSE the GPU memory limit is 1000. This is a bug, I guess some cuda api interfaces are not handled by HAMi-core. I will try to get more information and post them later.