allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.57k stars 644 forks source link

GPU monitoring failed getting GPU reading, switching off GPU monitoring #1295

Open PriyanshuPansari opened 2 months ago

PriyanshuPansari commented 2 months ago

Describe the bug

Using clearml-1.16.2 , Not receiving logs from both CPU and GPU.

Environment

eugen-ajechiloae-clearml commented 2 months ago

Hi @PriyanshuPansari ! Can you set CLEARML_RESMON_DEBUG=1 environment variable and post the stack trace you should see printed out?

PriyanshuPansari commented 2 months ago

test code: from clearml import Task

def main():

Initialize a new task

task = Task.init(project_name="ClearML Test", task_name="My First Task")

# Log some parameters
params = {"learning_rate": 0.001, "batch_size": 32}
task.connect(params)

# Simulate some work
for epoch in range(10):
    # Log metrics
    task.logger.report_scalar("Accuracy", "train", value=epoch*10, iteration=epoch)
    task.logger.report_scalar("Loss", "train", value=10-epoch, iteration=epoch)

print("ClearML test completed successfully!")

if name == "main": main()

trace (chip8) (base) undead@Pandoras-Box:~/projects/emulation/chip-8$ python test_clearml.py ClearML Task: overwriting (reusing) task id=54b17fc1740d464fb3b1affcaaf7d85e 2024-07-11 20:36:44,123 - clearml.Task - INFO - No repository found, storing script code instead ClearML results page: https://app.clear.ml/projects/1762cb16737f46e893e46ee3ccdbd94c/experiments/54b17fc1740d464fb3b1affcaaf7d85e/output/log ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring Traceback (most recent call last): File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/resource_monitor.py", line 381, in _get_gpu_stats gpu_stat = self._gpustat.new_query(per_process_stats=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/gpustat.py", line 648, in new_query return GPUStatCollection.new_query(shutdown=shutdown, per_process_stats=per_process_stats, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/gpustat.py", line 602, in new_query return GPUStatCollection._new_query_nvidia( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/gpustat.py", line 526, in _new_query_nvidia gpu_info = get_gpu_info(index, handle) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/gpustat.py", line 427, in get_gpu_info name = _decode(N.nvmlDeviceGetName(handle)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/pynvml.py", line 1863, in wrapper return res.decode() ^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/resource_monitor.py", line 287, in _machine_stats stats.update(self._get_gpu_stats()) ^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/resource_monitor.py", line 383, in _get_gpu_stats gpu_stat = self._gpustat.new_query(per_process_stats=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/gpustat.py", line 648, in new_query return GPUStatCollection.new_query(shutdown=shutdown, per_process_stats=per_process_stats, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/gpustat.py", line 602, in new_query return GPUStatCollection._new_query_nvidia( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/gpustat.py", line 526, in _new_query_nvidia gpu_info = get_gpu_info(index, handle) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/gpustat.py", line 427, in get_gpu_info name = _decode(N.nvmlDeviceGetName(handle)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/undead/miniconda3/envs/chip8/lib/python3.12/site-packages/clearml/utilities/gpu/pynvml.py", line 1863, in wrapper return res.decode() ^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

ClearML test completed successfully! (chip8) (base) undead@Pandoras-Box:~/pr

eugen-ajechiloae-clearml commented 2 months ago

@PriyanshuPansari what is the output of running the nvidia-smi command in the terminal? It looks like pynvml can't find the name of your GPU.

PriyanshuPansari commented 2 months ago

Fri Jul 12 05:08:49 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.52.01 Driver Version: 555.99 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 ... On | 00000000:01:00.0 Off | N/A | | N/A 39C P8 1W / 115W | 0MiB / 8188MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

eugen-ajechiloae-clearml commented 2 months ago

nvidia-smi looks fine. I think there is a problem when converting the c-string of your name (which is returned as a bytes object to python) to a regular python string. What is the output of:

from ctypes import *
from clearml.utilities.gpu import pynvml as N

def nvmlDeviceGetName(handle):
    c_name = N.create_string_buffer(N.NVML_DEVICE_NAME_V2_BUFFER_SIZE)
    fn = N._nvmlGetFunctionPointer("nvmlDeviceGetName")
    ret = fn(handle, c_name, c_uint(N.NVML_DEVICE_NAME_V2_BUFFER_SIZE))
    N._nvmlCheckReturn(ret)
    return c_name.value

N.nvmlInit()
handle = N.nvmlDeviceGetHandleByIndex(0)
name = nvmlDeviceGetName(handle)
print(name)
PriyanshuPansari commented 2 months ago

b'\xf8\x95\xa0\x81\x8e\xf8\x91\x80\x81\x89\xf8\x90\x90\x81\x89\xf8\x91\xb0\x80\xa0\xf8\x91\xa0\x81\xa5\xf8\x9c\xa0\x81\xaf\xf8\x99\x90\x81\xa3\xf8\x94\xa0\x80\xa0\xf8\x96\x80\x81\x94\xf8\x8d\x80\x80\xa0\xf8\x8d\xa0\x80\xb0\xf8\x88\x80\x80\xb0'