Open thungrac opened 5 months ago
what is the output of 'nvidia-smi' inside container?
what is the output of 'nvidia-smi' inside container?
nvidia-smi
[HAMI-core Info(80:140225570006848:hook.c:300)]: loaded nvml libraries
[HAMI-core Msg(80:140225570006848:libvgpu.c:836)]: Initializing.....
[HAMI-core Info(80:140225570006848:hook.c:238)]: Start hijacking
[HAMI-core Info(80:140225570006848:hook.c:267)]: loaded_cuda_libraries
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:122)]: put_device_info finished 1
[HAMI-core Info(80:140225570006848:device.c:102)]: driver version=12040
Fri May 31 04:40:47 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:260)]: get_gpu_memory_usage dev=0
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:267)]: dev=0 pid=71 host pid=71 i=931267700
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:267)]: dev=0 pid=76 host pid=0 i=0
[HAMI-core Info(80:140225570006848:multiprocess_memory_limit.c:267)]: dev=0 pid=80 host pid=0 i=0
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:0B:00.0 Off | N/A |
| 35% 54C P2 346W / 350W | 889MiB / 1000MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 71 C ./gpu_burn 0MiB |
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(80:140225570006848:multiprocess_memory_limit.c:468)]: Calling exit handler 80
nvidia-smi -a
nvidia-smi -a
[HAMI-core Info(81:140237489944384:hook.c:300)]: loaded nvml libraries
==============NVSMI LOG==============
Timestamp : Fri May 31 04:41:10 2024
Driver Version : 550.78
[HAMI-core Msg(81:140237489944384:libvgpu.c:836)]: Initializing.....
[HAMI-core Info(81:140237489944384:hook.c:238)]: Start hijacking
[HAMI-core Info(81:140237489944384:hook.c:267)]: loaded_cuda_libraries
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:122)]: put_device_info finished 1
[HAMI-core Info(81:140237489944384:device.c:102)]: driver version=12040
CUDA Version : 12.4
Attached GPUs : 1
GPU 00000000:0B:00.0
Product Name : NVIDIA GeForce RTX 3090
Product Brand : GeForce
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-29b0561f-a589-0559-8703-ee7dd81d9d38
Minor Number : 0
VBIOS Version : 94.02.42.40.34
MultiGPU Board : No
Board ID : 0xb00
Board Part Number : N/A
GPU Part Number : 2204-300-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : N/A
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
GSP Firmware Version : 550.78
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x0B
Device : 0x00
Domain : 0x0000
Base Classcode : 0x3
Sub Classcode : 0x0
Device Id : 0x220410DE
Bus Id : 00000000:0B:00.0
Sub System Id : 0x403B1458
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 16 KB/s
Rx Throughput : 22 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 88 %
Performance State : P2
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:260)]: get_gpu_memory_usage dev=0
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:267)]: dev=0 pid=71 host pid=71 i=931267700
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:267)]: dev=0 pid=76 host pid=0 i=0
[HAMI-core Info(81:140237489944384:multiprocess_memory_limit.c:267)]: dev=0 pid=81 host pid=0 i=0
Total : 1000 MiB
Reserved : 538 MiB
Used : 889 MiB
Free : 22699 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 4 MiB
Free : 252 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 37 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 85 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 88 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 293.45 W
Current Power Limit : 350.00 W
Requested Power Limit : 350.00 W
Default Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 350.00 W
GPU Memory Power Readings
Power Draw : N/A
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 915 MHz
SM : 915 MHz
Memory : 9501 MHz
Video : 1335 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 9751 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 725.000 mV
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Bandwidth : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 71
Type : C
Name : ./gpu_burn
Used GPU Memory : 0 MiB
i see, is the utilization of GPU vibrate from 0-100 during execution or stable at 100
i see, is the utilization of GPU vibrate from 0-100 during execution or stable at 100
it's alway stable at 100%
could you try using tensorflow/pytorch benchmarks(https://github.com/tensorflow/benchmarks), HAMi-core implement gpucores limitation by blocking new kernels from running, so if you have a very very large kernel already submitted, then we can't do anything to limit its utilization
could you try using tensorflow/pytorch benchmarks(https://github.com/tensorflow/benchmarks), HAMi-core implement gpucores limitation by blocking new kernels from running, so if you have a very very large kernel already submitted, then we can't do anything to limit its utilization
I'll try.
Additionally, when setting
export LIBCUDA_LOG_LEVEL=4
the log is:
...
...
...
[HAMI-core Debug(204:139707654555456:libvgpu.c:79)]: into dlsym nvmlEventSetWait_v2
[HAMI-core Debug(204:139707654555456:nvml_entry.c:1479)]: Hijacking nvmlEventSetWait_v2
[HAMI-core Debug(199:139974955491328:hook.c:418)]: Hijacking nvmlDeviceGetCount_v2
[HAMI-core Debug(199:139974955491328:hook.c:422)]: Hijacking nvmlDeviceGetCount_v2
[HAMI-core Debug(199:139974955491328:hook.c:385)]: nvmlDeviceGetHandleByIndex index=0
[HAMI-core Debug(199:139974955491328:multiprocess_utilization_watcher.c:212)]: userutil=796950 currentcores=0 total=4030464 limit=20 share=0
...
...
...
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:47)]: grid: 2048, blocks: 256
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:48)]: launch kernel 2048, curr core: 3930112
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:53)]: current core: 3930112
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:47)]: grid: 2048, blocks: 256
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:48)]: launch kernel 2048, curr core: 3928064
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:53)]: current core: 3928064
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:47)]: grid: 2048, blocks: 256
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:48)]: launch kernel 2048, curr core: 3926016
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:53)]: current core: 3926016
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:47)]: grid: 2048, blocks: 256
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:48)]: launch kernel 2048, curr core: 3923968
[HAMI-core Debug(186:140393414545408:multiprocess_utilization_watcher.c:53)]: current core: 3923968
[HAMI-core Debug(186:140393414545408:memory.c:549)]: Hijacking cuLaunchKernel
[HAMI-core Debug(186:140393414545408:memory.c:471)]: Hijacking cuMemsetD32Async
[HAMI-core Debug(186:140393414545408:memory.c:385)]: cuMemcpyDtoHAsync_v2,dst=0x7faf96c00000 src=7faf96220600 count=4
[HAMI-core Debug(186:140393414545408:memory.c:387)]: Hijacking cuMemcpyDtoHAsync_v2
[HAMI-core Debug(186:140393324339200:hook.c:418)]: Hijacking nvmlDeviceGetCount_v2
[HAMI-core Debug(186:140393324339200:hook.c:422)]: Hijacking nvmlDeviceGetCount_v2
[HAMI-core Debug(186:140393324339200:hook.c:385)]: nvmlDeviceGetHandleByIndex index=0
[HAMI-core Debug(186:140393324339200:multiprocess_utilization_watcher.c:212)]: userutil=0 currentcores=3921920 total=4030464 limit=20 share=4030464
I copied these logs from inside the pod. I don't know how to copy the full entries because they don't show in the pod log (via console), so I can't write them to a log file.
@thungrac You can use Python's > or >> operators to redirect output to a file. The > operator creates a new file, or overwrites the file if it already exists. The >> operator appends content to the end of an existing file, or creates a new file if it does not exist. If you want to store both standard output (stdout) and error output (stderr) in the same file, you can use 2>&1, which means to redirect standard error 2 to standard output 1. The "> &" operation redirects standard output to a file, and then "2 > &1" redirects standard error to the current standard output, i.e., the previously specified file. (Please note, the following syntax is in Unix/Linux shell, such as bash/sh etc. If you are using Windows's cmd or powershell, the syntax may vary a bit) Here are the specific Python commands:
python train.py config/train_shakespeare_char.py > output.txt 2>&1
or
python train.py config/train_shakespeare_char.py >> output.txt 2>&1
This way, the output.txt file will contain all output from your program.
@haitwang-cloud many thanks
Here is the output when i run ./gpu_burn 10
when setting export LIBCUDA_LOG_LEVEL=4
Same problem with you. The memory is wrong in the pod when execute nvidia-smi, so the limit cannot work. But the actual memory is larger the the limit setting number.
1. Issue or feature description
Hi HAMi team,
I configured a pod with the image oguzpastirmaci/gpu-burn for testing GPU burn, but when running the burn, it's still using 100% of the GPU card's utilization (100 cores over the 20 cores configuration) and exceeding the memory limit (1332 MiB over the 1000 MiB configuration).
2. Steps to reproduce the issue
Deploy the pod
Then run burn
./gpu_burn 1000
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)I0530 07:53:56.617573 2234739 register.go:159] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0
I0530 07:53:56.617641 2234739 register.go:166] "start working on the devices" devices=[{"Index":0,"Id":"GPU-29b0561f-a589-0559-8703-ee7dd81d9d38","Count":10,"Devmem":24576,"Devcore":100,"Type":"NVIDIA-NVIDIA GeForce RTX 3090","Numa":0,"Health":true}]
I0530 07:53:56.622957 2234739 util.go:128] Encoded node Devices: GPU-29b0561f-a589-0559-8703-ee7dd81d9d38,10,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:
I0530 07:53:56.622993 2234739 register.go:176] patch node with the following annos map[hami.io/node-handshake:Reported 2024-05-30 07:53:56.622978072 +0000 UTC m=+391.760752043 hami.io/node-nvidia-register:GPU-29b0561f-a589-0559-8703-ee7dd81d9d38,10,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:]
I0530 07:53:56.635162 2234739 register.go:196] Successfully registered annotation. Next check in 30s seconds...