Open maximotus opened 8 months ago
Hi @maximotus
Could you make the question more clear? If you are referring to running "inference", then you don't need to parallelism at all. If you mean running a "training" job, we use torch.nn.parallel.DistributedDataParallel
Hi @zhaoyue-zephyrus,
sure. I was trying to run your demo script.
My goal is to produce captions for short video clips of 1 second.
python demo_narrator.py --video-path "../path/to/my/video"
If I do so without the --cuda
flag, it works, but it needs about 70 seconds inference time per clip with nucleus k=10 on my device.
So I wanted to speed up using GPU(s).
But if I pass the cuda flag, so the command is like python demo_narrator.py --cuda --video-path "../path/to/my/video"
, my GPU with 10GB is not enough and I get a RuntimeError: CUDA error: out of memory
.
However, I thought enabling parallelism could solve this since I have 4 GPUs with 10 GB each available. But I could not manage to make this work with your code easily.
So I wonder now how I can manage to run the inference on more than one GPU so I will not get a RuntimeError: CUDA error: out of memory
.
I tried out wrapping your model with torch.nn.DataParallel
after line 57. In this case, I can observe that the model weights are being distributed on 2 GPUs, but when it comes to the specific function calls in line 74 and line 75, it fails since models wrapped with torch.nn.DataParallel
are then only able to call the defaut forward method (compare https://discuss.pytorch.org/t/dataparallel-model-with-custom-functions/75053/10).
So I was thinking about adapting your code for these needs (so e.g. the custom methods like encode_image
and generate
are being passed to the default forward method and then there will be a case selection inside forward).
However, I thought it would be good asking you about this issue first since I may have overseen a more trivial solution.
Cheers, Max
Hi @maximotus did you figure this out?
You will need to use at least a 20 GB GPU.
If not, I think that the issue is due to model parallelism. Look into https://www.deepspeed.ai/tutorials/pipeline/ for model parallelism.
Hello,
great work! What are the minimal adaptions I need to apply to the code so I can run the narrator on multiple GPUs locally?
nn.DataParallel
is not optimal since I would need to adapt the model classes.Cheers, Max