Open lut777 opened 2 months ago
which GPU memory usage is correct? 782 or 1660? if the latter, why is the limit exceeded?
Looks like 1660 should be the right value. Not sure why 782 is showing up. My guess is you might be using cudaMallocAsync
, refer to https://github.com/Project-HAMi/HAMi/issues/409.
I can't find LD_PRELOAD in the pod, is that correct?
Sounds like you're looking for /etc/ld.so.preload
?
yes, 1660 is correct
Well if the real GPU memory usage is 1660, then the problem is really serious. BECAUSE the GPU memory limit is 1000. This is a bug, I guess some cuda api interfaces are not handled by HAMi-core. I will try to get more information and post them later.
1. Issue or feature description
I deploy HAMi based on the doc
I deploy a PyTorch pod with follow spec:
To clarify the usage of GPU memory, 'hostPID: true' is set.
Now in the pod,
nvidia-smi
shows this:But outside the pod, the result is this:
Now the tricky part is, which GPU memory is right? 782 or 1660?? And HAMi-core log is not much:
I thought I could get right GPU memory in
cuMemAlloc_v2
but I failed.So I have 2 problems: 1: which GPU memory usage is correct? 782 or 1660? if the latter, why is the limit exceeded? 2: How could I determine the real GPU memory usage? 3: I can't find
LD_PRELOAD
in the pod, is that correct? But I didn't find error message in webhook.2. Steps to reproduce the issue
Apply the pod and checkout the result in and out the pod.
3. Information to attach (optional if deemed irrelevant)
nvidia-smi -a
in the pod:And the result outside the pod: