Run locally on multiple GPUs

maximotus commented 8 months ago

Hello,

great work! What are the minimal adaptions I need to apply to the code so I can run the narrator on multiple GPUs locally? nn.DataParallel is not optimal since I would need to adapt the model classes.

Cheers, Max

zhaoyue-zephyrus commented 8 months ago

Hi @maximotus

Could you make the question more clear? If you are referring to running "inference", then you don't need to parallelism at all. If you mean running a "training" job, we use torch.nn.parallel.DistributedDataParallel

maximotus commented 8 months ago

Hi @zhaoyue-zephyrus,

sure. I was trying to run your demo script. My goal is to produce captions for short video clips of 1 second. python demo_narrator.py --video-path "../path/to/my/video" If I do so without the --cuda flag, it works, but it needs about 70 seconds inference time per clip with nucleus k=10 on my device. So I wanted to speed up using GPU(s).

But if I pass the cuda flag, so the command is like python demo_narrator.py --cuda --video-path "../path/to/my/video", my GPU with 10GB is not enough and I get a RuntimeError: CUDA error: out of memory.

However, I thought enabling parallelism could solve this since I have 4 GPUs with 10 GB each available. But I could not manage to make this work with your code easily.

So I wonder now how I can manage to run the inference on more than one GPU so I will not get a RuntimeError: CUDA error: out of memory.

I tried out wrapping your model with torch.nn.DataParallel after line 57. In this case, I can observe that the model weights are being distributed on 2 GPUs, but when it comes to the specific function calls in line 74 and line 75, it fails since models wrapped with torch.nn.DataParallel are then only able to call the defaut forward method (compare https://discuss.pytorch.org/t/dataparallel-model-with-custom-functions/75053/10).

So I was thinking about adapting your code for these needs (so e.g. the custom methods like encode_image and generate are being passed to the default forward method and then there will be a case selection inside forward).

However, I thought it would be good asking you about this issue first since I may have overseen a more trivial solution.

Cheers, Max

Anirudh257 commented 5 months ago

Hi @maximotus did you figure this out?

You will need to use at least a 20 GB GPU.

If not, I think that the issue is due to model parallelism. Look into https://www.deepspeed.ai/tutorials/pipeline/ for model parallelism.

facebookresearch / LaViLa

Run locally on multiple GPUs #29