Evaluation hangs with accelerate over multiple gpus.

tyleryzhu commented 8 months ago

Thank you for the incredible set of repositories (this one and prismatic-vlms), it has been a great joy using them. Very well-designed, configurable, and easy to use for researchers.

I'm running into a problem where evaluation hangs when run over multiple GPUs, precisely at the step where I load the local model checkpoint. This doesn't happen with just one GPU however, and as far as I can tell it's not just that it's taking an abnormally long amount of time to load.

Here is the command I'm using to evaluate my own trained SigLIP Prismatic VLM:

accelerate launch --num_processes=10 scripts/evaluate.py \
    --model_dir ../prismatic-vlms/runs/prism-siglip \
    --model_id prism-siglip \
    --dataset.type text-vqa-slim

which hangs on the line

| >> [*] Loading VLM prism-siglip-controlled+7b from Checkpoint; Freezing       load.py:98
Weights 🥶

This is being done over 10xRTX 3090's.

siddk commented 8 months ago

This is super weird; haven’t seen this before. One thing — are all 10 GPUs you’re running on are on a single node?

show981111 commented 7 months ago

Have you solved the issue? Mine hangs after loading from the checkpoint. I am using 2xV100.

siddk commented 6 months ago

@show981111 - can you tell me where exactly in the code you're noticing the hanging? Can you also dump RAM/GPU Memory Utilization?

TRI-ML / vlm-evaluation

Evaluation hangs with accelerate over multiple gpus. #4