allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.67k stars 654 forks source link

Scalar :monitor:gpu gpu_0_mem_used_gb is always showing zero #1049

Open jax79sg opened 1 year ago

jax79sg commented 1 year ago

Describe the bug

Scalar :monitor:gpu gpu_0_mem_used_gb is always showing zero. However, gpu_0_mem_usage is showing 60. I'm assuming the later is a percentage while former is supposed to show absolute GB.

To reproduce

Running off K8SGlue with a queue to spawn a pod that has a single 32GB VRAM V100. Submit a training job to ClearML queue.

Expected behaviour

If gpu_0_mem_usage is showing 60%, then gpu_0_mem_used_gb should show 6*32/10 GB.

Environment

jkhenning commented 1 year ago

@jax79sg ClearML basically uses psutil to get these values, can you try and see what psutil returns inside these pods?

jax79sg commented 1 year ago

Hi @jkhenning , psutil doesn't offer gpu stats. Do you mean other utilities? (https://github.com/giampaolo/psutil)

jkhenning commented 1 year ago

Oh, apologies, I meant gpustat 🙂

jax79sg commented 1 year ago

Oh, apologies, I meant gpustat 🙂

Hi @jkhenning , could you help indicate which keys here corresponds to the keys on scalar :monitor:gpu ?

gpustat==1.0.0 nvidia-ml-py==11.495.46

gpustat --json
{
    "hostname": "clearml-id-xxx",
    "driver_version": "525.85.12",
    "query_time": "2023-06-20T08:16:09.570881",
    "gpus": [
        {
            "index": 0,
            "uuid": "GPU-xxx",
            "name": "Tesla V100-SXM2-32GB",
            "temperature.gpu": 35,
            "fan.speed": null,
            "utilization.gpu": 0,
            "utilization.enc": 0,
            "utilization.dec": 0,
            "power.draw": 57,
            "enforced.power.limit": 300,
            "memory.used": 2038,
            "memory.total": 32768,
            "processes": []
        }
    ]
}
jkhenning commented 1 year ago

@jax79sg ,

gpu_0_mem_usage is calculated as 100. * float(<memory.used>) / float(<memory.total>) gpu_0_mem_used_gb is calculated as float(sum(<processes.gpu_memory_usage>)) / 1024 is there are any processes associated with the task, otherwise it should be float(<memory.used>) / 1024

jax79sg commented 1 year ago

There is no processes associated that's detected by gpustat though the only task that's running is ClearML agent's spawned task. gpu_0_mem_usage is showing the right values. gpu_0_mem_used_gb is not, its showing zero.

@jax79sg ,

gpu_0_mem_usage is calculated as 100. * float(<memory.used>) / float(<memory.total>) gpu_0_mem_used_gb is calculated as float(sum(<processes.gpu_memory_usage>)) / 1024 is there are any processes associated with the task, otherwise it should be float(<memory.used>) / 1024

@jkhenning There is no processes associated that's detected by gpustat though the only task that's running is ClearML agent's spawned task. gpu_0_mem_usage is showing the right values. gpu_0_mem_used_gb is not, its showing zero.

jkhenning commented 1 year ago

Is there a sure way to reproduce it? It didn't happen when I tested it, and the only thing I can think of is some weird bug that will cause the SDK to think there's some process associated with it when in fact there isn't...

jax79sg commented 1 year ago

Well I reproduced this in a K8SGlue spawned K8S pod. One thing i noticed is this. On a K8S pod, the last line of the json output from gpustat is "processes": [] but on a baremetal, the last line of the output is "processes": null. I'm not sure if this is the key to the bug.

Hi @jkhenning . Well I reproduced this in a K8SGlue spawned K8S pod. One thing i noticed is this. On a K8S pod, the last line of the json output from gpustat is "processes": [] but on a baremetal, the last line of the output is "processes": null. I'm not sure if this is the key to the bug.