QhelDIV / ShapeFormer

Official repository for the ShapeFormer Project
https://shapeformer.github.io/
188 stars 21 forks source link

error when using multi-gpu training #7

Closed linjing7 closed 1 year ago

linjing7 commented 1 year ago

Hi, thanks for your excellent work. I successfully train VQDIF-16 with 1 GPU. However, the training speed is so slow (30 epoch/day), that means we need ten days to train 300 epochs. So I try the multi-gpu training, but when I use multi-gpu training, the error No module named xgutils occurs. Do you have any idea about this issue? BTW, I notice that the pretrained vqdif you provide is 31 epoch, but the default max_epoch is 300. So do we need to train 300 epoch?

image
QhelDIV commented 1 year ago

I guess you cloned the repo without the flay --recursive, hence the submodule xgutils is not cloned. You can fix this by running git submodule update --init --recursive at the repo root.

linjing7 commented 1 year ago

Hi, I clone the repo with the flag --recursive, but the error still exists. It's okay when I train with one GPU, so I think it may not be caused by the reason that the submodule xgutils is not cloned. Have you tried multi-gpu training with this code?

linjing7 commented 1 year ago

Besides, do we need to train 300 epoch as default?

kajalsanklecha commented 3 weeks ago

Did the code work on multi-GPU? How did the error ModuleNotFoundError: No module named 'xgutils' go away. Even after --recurcive, it still persists for me.

If anyone could solve this, can you please help me.

QhelDIV commented 3 weeks ago

So first make sure xgutils is not empty. Then make sure when you run the command you are in the root of the ShapeFormer folder. Just add an argument of '--gpu 0 1 2 4' will achieve the multi-GPU training.

kajalsanklecha commented 3 weeks ago

Yes, the command is run from Root of the Shapeformer repository and xgutils is not empty. It still gives error as xgutils not found.

And my GPU numbers are 0 1 2 3. Do I still need to add '--gpu 0 1 2 4'. Or '--gpu 0 1 2 3' ?