Open tyleryzhu opened 8 months ago
This is super weird; haven’t seen this before. One thing — are all 10 GPUs you’re running on are on a single node?
Have you solved the issue? Mine hangs after loading from the checkpoint. I am using 2xV100.
@show981111 - can you tell me where exactly in the code you're noticing the hanging? Can you also dump RAM/GPU Memory Utilization?
Thank you for the incredible set of repositories (this one and prismatic-vlms), it has been a great joy using them. Very well-designed, configurable, and easy to use for researchers.
I'm running into a problem where evaluation hangs when run over multiple GPUs, precisely at the step where I load the local model checkpoint. This doesn't happen with just one GPU however, and as far as I can tell it's not just that it's taking an abnormally long amount of time to load.
Here is the command I'm using to evaluate my own trained SigLIP Prismatic VLM:
which hangs on the line
This is being done over 10xRTX 3090's.