multi-GPU for training - Githubissues

SleepEarlyLiveLong commented 2 years ago

hello, thank you for your awesome work. I have toubles of using multi-gpus when training:

I added "model = nn.DataParallel(model)" before main.py line187: "all_param = []", but it doesn't work and gives an error: Traceback (most recent call last): File "main.py", line 190, in for i_model in model: TypeError: 'DataParallel' object is not iterable

can you please tell me how to solve this question? thank you!

Vegetebird commented 2 years ago

You can using "model['trans'] = nn.DataParallel(model['trans'])"

SleepEarlyLiveLong commented 2 years ago

Thank you a lot! It works when I run "python main.py". However, when I run refine, it failed and gives error as follows:

run: python main.py --refine --lr 1e-5 --reload --previous_dir checkpoint/1003_1041_53_351_no/

errors: INFO: Training on 3119616 frames INFO: Testing on 543360 frames checkpoint/1003_1041_53_351_no/no_refine_4_4668.pth 0%| | 0/24372 [00:05<?, ?it/s] Traceback (most recent call last): File "main.py", line 198, in loss = train(opt, actions, train_dataloader, model, optimizer_all, epoch) File "main.py", line 23, in train return step('train', opt, actions, train_loader, model, optimizer, epoch) File "main.py", line 80, in step loss.backward() File "/home/cty/miniconda3/envs/pose2/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/cty/miniconda3/envs/pose2/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 1024]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I tried add code like this: model['trans'] = nn.DataParallel(model['trans']) model['refine'] = nn.DataParallel(model['refine'])

it still doesn't work. So, can you please tell me how to use multi-GPUS when addimg refine modules? Thank you a lot!

Vegetebird commented 2 years ago

Maybe you can try torch==1.7.1 or you can modify the https://github.com/Vegetebird/StridedTransformer-Pose3D/blob/9d988ac54234c5acc6a67ae746ce5bdbea204f8a/model/block/refine.py#L18 to nn.ReLU(inplace=True)

SleepEarlyLiveLong commented 1 year ago

thank you! It is useful to use torch==1.7.1 to avoid that problem.

Vegetebird / StridedTransformer-Pose3D

multi-GPU for training #19