Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
937 stars 195 forks source link

The limitations on gpumem and gpucores are not working correctly. #332

Open thungrac opened 5 months ago

thungrac commented 5 months ago

1. Issue or feature description

Hi HAMi team,

I configured a pod with the image oguzpastirmaci/gpu-burn for testing GPU burn, but when running the burn, it's still using 100% of the GPU card's utilization (100 cores over the 20 cores configuration) and exceeding the memory limit (1332 MiB over the 1000 MiB configuration).

2. Steps to reproduce the issue

Deploy the pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  runtimeClassName: nvidia
  containers:   
    - name: ubuntu-container1
      image: oguzpastirmaci/gpu-burn
      imagePullPolicy: IfNotPresent
      command: ["bash", "-c", "sleep 86400"]
      env:
      - name: ACTIVE_OOM_KILLER
        value: "true"
      - name: GPU_CORE_UTILIZATION_POLICY
        value: force
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
          nvidia.com/gpumem: 1000
          nvidia.com/gpucores: 20
        limits:
          memory: 1Gi
          nvidia.com/gpu: 1
          nvidia.com/gpumem: 1000
          nvidia.com/gpucores: 20

Then run burn

./gpu_burn 1000

3. Information to attach (optional if deemed irrelevant)

Common error checking:

nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Thu May 30 08:07:14 2024
Driver Version                            : 550.78
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:0B:00.0
    Product Name                          : NVIDIA GeForce RTX 3090
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-29b0561f-a589-0559-8703-ee7dd81d9d38
    Minor Number                          : 0
    VBIOS Version                         : 94.02.42.40.34
    MultiGPU Board                        : No
    Board ID                              : 0xb00
    Board Part Number                     : N/A
    GPU Part Number                       : 2204-300-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    GSP Firmware Version                  : 550.78
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x0B
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x0
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:0B:00.0
        Sub System Id                     : 0x403B1458
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 10 KB/s
        Rx Throughput                     : 23 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 98 %
    Performance State                     : P2
    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 24576 MiB
        Reserved                          : 538 MiB
        Used                              : 1341 MiB
        Free                              : 22699 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 4 MiB
        Free                              : 252 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 100 %
        Memory                            : 25 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 84 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 88 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 249.13 W
        Current Power Limit               : 350.00 W
        Requested Power Limit             : 350.00 W
        Default Power Limit               : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 615 MHz
        SM                                : 615 MHz
        Memory                            : 9501 MHz
        Video                             : 1335 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 725.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2251991
            Type                          : C
            Name                          : ./gpu_burn
            Used GPU Memory               : 1332 MiB
/etc/containerd/config.toml
version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

I0530 07:53:56.617573 2234739 register.go:159] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0

I0530 07:53:56.617641 2234739 register.go:166] "start working on the devices" devices=[{"Index":0,"Id":"GPU-29b0561f-a589-0559-8703-ee7dd81d9d38","Count":10,"Devmem":24576,"Devcore":100,"Type":"NVIDIA-NVIDIA GeForce RTX 3090","Numa":0,"Health":true}]

I0530 07:53:56.622957 2234739 util.go:128] Encoded node Devices: GPU-29b0561f-a589-0559-8703-ee7dd81d9d38,10,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:

I0530 07:53:56.622993 2234739 register.go:176] patch node with the following annos map[hami.io/node-handshake:Reported 2024-05-30 07:53:56.622978072 +0000 UTC m=+391.760752043 hami.io/node-nvidia-register:GPU-29b0561f-a589-0559-8703-ee7dd81d9d38,10,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:]

I0530 07:53:56.635162 2234739 register.go:196] Successfully registered annotation. Next check in 30s seconds...



Additional information that might help better understand your environment and reproduce the bug:

k8s version: v1.27.12+rke2r1
linux OS version: Ubuntu 22.04.4 LTS
linux kernel version: 5.15.0-102-generic
container runtime: containerd://1.7.11-k3s2
nvidia-container-runtime: runc version 1.1.12
nvidia driver: 550.67
cuda version: 12.4
archlitchi commented 5 months ago

what is the output of 'nvidia-smi' inside container?

thungrac commented 5 months ago

what is the output of 'nvidia-smi' inside container?

nvidia-smi

[HAMI-core Info(80:140225570006848:hook.c:300)]: loaded nvml libraries
[HAMI-core Msg(80:140225570006848:libvgpu.c:836)]: Initializing.....
[HAMI-core Info(80:140225570006848:hook.c:238)]: Start hijacking
[HAMI-core Info(80:140225570006848:hook.c:267)]: loaded_cuda_libraries
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:122)]: put_device_info finished 1
[HAMI-core Info(80:140225570006848:device.c:102)]: driver version=12040
Fri May 31 04:40:47 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:260)]: get_gpu_memory_usage dev=0
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:267)]: dev=0 pid=71 host pid=71 i=931267700
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:267)]: dev=0 pid=76 host pid=0 i=0
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:267)]: dev=0 pid=80 host pid=0 i=0
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:0B:00.0 Off |                  N/A |
| 35%   54C    P2            346W /  350W |     889MiB /   1000MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        71      C   ./gpu_burn                                      0MiB |
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(80:140225570006848:multiprocess_memory_limit.c:468)]: Calling exit handler 80

nvidia-smi -a

nvidia-smi -a
[HAMI-core Info(81:140237489944384:hook.c:300)]: loaded nvml libraries

==============NVSMI LOG==============

Timestamp                                 : Fri May 31 04:41:10 2024
Driver Version                            : 550.78
[HAMI-core Msg(81:140237489944384:libvgpu.c:836)]: Initializing.....
[HAMI-core Info(81:140237489944384:hook.c:238)]: Start hijacking
[HAMI-core Info(81:140237489944384:hook.c:267)]: loaded_cuda_libraries
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:122)]: put_device_info finished 1
[HAMI-core Info(81:140237489944384:device.c:102)]: driver version=12040
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:0B:00.0
    Product Name                          : NVIDIA GeForce RTX 3090
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-29b0561f-a589-0559-8703-ee7dd81d9d38
    Minor Number                          : 0
    VBIOS Version                         : 94.02.42.40.34
    MultiGPU Board                        : No
    Board ID                              : 0xb00
    Board Part Number                     : N/A
    GPU Part Number                       : 2204-300-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    GSP Firmware Version                  : 550.78
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x0B
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x0
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:0B:00.0
        Sub System Id                     : 0x403B1458
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 16 KB/s
        Rx Throughput                     : 22 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 88 %
    Performance State                     : P2
    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:260)]: get_gpu_memory_usage dev=0
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:267)]: dev=0 pid=71 host pid=71 i=931267700
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:267)]: dev=0 pid=76 host pid=0 i=0
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:267)]: dev=0 pid=81 host pid=0 i=0
        Total                             : 1000 MiB
        Reserved                          : 538 MiB
        Used                              : 889 MiB
        Free                              : 22699 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 4 MiB
        Free                              : 252 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 100 %
        Memory                            : 37 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 85 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 88 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 293.45 W
        Current Power Limit               : 350.00 W
        Requested Power Limit             : 350.00 W
        Default Power Limit               : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 915 MHz
        SM                                : 915 MHz
        Memory                            : 9501 MHz
        Video                             : 1335 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 725.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 71
            Type                          : C
            Name                          : ./gpu_burn
            Used GPU Memory               : 0 MiB
archlitchi commented 5 months ago

i see, is the utilization of GPU vibrate from 0-100 during execution or stable at 100

thungrac commented 5 months ago

i see, is the utilization of GPU vibrate from 0-100 during execution or stable at 100

it's alway stable at 100%

archlitchi commented 5 months ago

could you try using tensorflow/pytorch benchmarks(https://github.com/tensorflow/benchmarks), HAMi-core implement gpucores limitation by blocking new kernels from running, so if you have a very very large kernel already submitted, then we can't do anything to limit its utilization

thungrac commented 5 months ago

could you try using tensorflow/pytorch benchmarks(https://github.com/tensorflow/benchmarks), HAMi-core implement gpucores limitation by blocking new kernels from running, so if you have a very very large kernel already submitted, then we can't do anything to limit its utilization

I'll try.

Additionally, when setting

export LIBCUDA_LOG_LEVEL=4

the log is:

...
...
...
[HAMI-core Debug(204:139707654555456:libvgpu.c:79)]: into dlsym nvmlEventSetWait_v2
[HAMI-core Debug(204:139707654555456:nvml_entry.c:1479)]: Hijacking nvmlEventSetWait_v2
[HAMI-core Debug(199:139974955491328:hook.c:418)]: Hijacking nvmlDeviceGetCount_v2
[HAMI-core Debug(199:139974955491328:hook.c:422)]: Hijacking nvmlDeviceGetCount_v2
[HAMI-core Debug(199:139974955491328:hook.c:385)]: nvmlDeviceGetHandleByIndex index=0
[HAMI-core Debug(199:139974955491328:multiprocess_utilization_watcher.c:212)]: userutil=796950 currentcores=0 total=4030464 limit=20 share=0
...
...
...

[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:47)]: grid: 2048, blocks: 256
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:48)]: launch kernel 2048, curr core: 3930112
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:53)]: current core: 3930112
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:47)]: grid: 2048, blocks: 256
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:48)]: launch kernel 2048, curr core: 3928064
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:53)]: current core: 3928064
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:47)]: grid: 2048, blocks: 256
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:48)]: launch kernel 2048, curr core: 3926016
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:53)]: current core: 3926016
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:47)]: grid: 2048, blocks: 256
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:48)]: launch kernel 2048, curr core: 3923968
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:53)]: current core: 3923968
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:memory.c:471)]: Hijacking cuMemsetD32Async
[HAMI-core Debug(186:140393414545408:memory.c:385)]: cuMemcpyDtoHAsync_v2,dst=0x7faf96c00000 src=7faf96220600 count=4
[HAMI-core Debug(186:140393414545408:memory.c:387)]: Hijacking cuMemcpyDtoHAsync_v2
[HAMI-core Debug(186:140393324339200:hook.c:418)]: Hijacking nvmlDeviceGetCount_v2
[HAMI-core Debug(186:140393324339200:hook.c:422)]: Hijacking nvmlDeviceGetCount_v2
[HAMI-core Debug(186:140393324339200:hook.c:385)]: nvmlDeviceGetHandleByIndex index=0
[HAMI-core Debug(186:140393324339200:multiprocess_utilization_watcher.c:212)]: userutil=0 currentcores=3921920 total=4030464 limit=20 share=4030464

I copied these logs from inside the pod. I don't know how to copy the full entries because they don't show in the pod log (via console), so I can't write them to a log file.

haitwang-cloud commented 5 months ago

@thungrac You can use Python's > or >> operators to redirect output to a file. The > operator creates a new file, or overwrites the file if it already exists. The >> operator appends content to the end of an existing file, or creates a new file if it does not exist. If you want to store both standard output (stdout) and error output (stderr) in the same file, you can use 2>&1, which means to redirect standard error 2 to standard output 1. The "> &" operation redirects standard output to a file, and then "2 > &1" redirects standard error to the current standard output, i.e., the previously specified file. (Please note, the following syntax is in Unix/Linux shell, such as bash/sh etc. If you are using Windows's cmd or powershell, the syntax may vary a bit) Here are the specific Python commands:

python train.py config/train_shakespeare_char.py > output.txt 2>&1

or

python train.py config/train_shakespeare_char.py >> output.txt 2>&1

This way, the output.txt file will contain all output from your program.

thungrac commented 5 months ago

@haitwang-cloud many thanks

Here is the output when i run ./gpu_burn 10 when setting export LIBCUDA_LOG_LEVEL=4

output.txt

z19311 commented 3 months ago

Same problem with you. The memory is wrong in the pod when execute nvidia-smi, so the limit cannot work. But the actual memory is larger the the limit setting number.

image