Open jax79sg opened 1 year ago
@jax79sg ClearML basically uses psutil
to get these values, can you try and see what psutil
returns inside these pods?
Hi @jkhenning , psutil doesn't offer gpu stats. Do you mean other utilities? (https://github.com/giampaolo/psutil)
Oh, apologies, I meant gpustat
🙂
Oh, apologies, I meant
gpustat
🙂
Hi @jkhenning , could you help indicate which keys here corresponds to the keys on scalar :monitor:gpu ?
gpustat==1.0.0 nvidia-ml-py==11.495.46
gpustat --json
{
"hostname": "clearml-id-xxx",
"driver_version": "525.85.12",
"query_time": "2023-06-20T08:16:09.570881",
"gpus": [
{
"index": 0,
"uuid": "GPU-xxx",
"name": "Tesla V100-SXM2-32GB",
"temperature.gpu": 35,
"fan.speed": null,
"utilization.gpu": 0,
"utilization.enc": 0,
"utilization.dec": 0,
"power.draw": 57,
"enforced.power.limit": 300,
"memory.used": 2038,
"memory.total": 32768,
"processes": []
}
]
}
@jax79sg ,
gpu_0_mem_usage is calculated as 100. * float(<memory.used>) / float(<memory.total>)
gpu_0_mem_used_gb is calculated as float(sum(<processes.gpu_memory_usage>)) / 1024
is there are any processes associated with the task, otherwise it should be float(<memory.used>) / 1024
There is no processes associated that's detected by gpustat though the only task that's running is ClearML agent's spawned task. gpu_0_mem_usage is showing the right values. gpu_0_mem_used_gb is not, its showing zero.
@jax79sg ,
gpu_0_mem_usage is calculated as
100. * float(<memory.used>) / float(<memory.total>)
gpu_0_mem_used_gb is calculated asfloat(sum(<processes.gpu_memory_usage>)) / 1024
is there are any processes associated with the task, otherwise it should befloat(<memory.used>) / 1024
@jkhenning There is no processes associated that's detected by gpustat though the only task that's running is ClearML agent's spawned task. gpu_0_mem_usage is showing the right values. gpu_0_mem_used_gb is not, its showing zero.
Is there a sure way to reproduce it? It didn't happen when I tested it, and the only thing I can think of is some weird bug that will cause the SDK to think there's some process associated with it when in fact there isn't...
Well I reproduced this in a K8SGlue spawned K8S pod. One thing i noticed is this. On a K8S pod, the last line of the json output from gpustat is
"processes": []
but on a baremetal, the last line of the output is"processes": null
. I'm not sure if this is the key to the bug.
Hi @jkhenning . Well I reproduced this in a K8SGlue spawned K8S pod. One thing i noticed is this. On a K8S pod, the last line of the json output from gpustat is "processes": [] but on a baremetal, the last line of the output is "processes": null. I'm not sure if this is the key to the bug.
Describe the bug
Scalar :monitor:gpu gpu_0_mem_used_gb is always showing zero. However, gpu_0_mem_usage is showing 60. I'm assuming the later is a percentage while former is supposed to show absolute GB.
To reproduce
Running off K8SGlue with a queue to spawn a pod that has a single 32GB VRAM V100. Submit a training job to ClearML queue.
Expected behaviour
If gpu_0_mem_usage is showing 60%, then gpu_0_mem_used_gb should show 6*32/10 GB.
Environment
Related Discussion
If this continues a slack thread, please provide a link to the original slack thread. https://clearml.slack.com/archives/CTK20V944/p1687167090695239?thread_ts=1687162877.050179&cid=CTK20V944