Jeff-sjtu / HybrIK

Official code of "HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation", CVPR 2021
MIT License
1.21k stars 147 forks source link

one of the variables needed for gradient computation has been modified by an inplace operation #53

Open with-twilight opened 3 years ago

with-twilight commented 3 years ago

Hello, I am very interested in your work. Now I have the following problems: Traceback (most recent call last): File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 375, in main() File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 238, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(opt, cfg)) File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 323, in main_worker loss, acc17 = train(opt, train_loader, m, criterion, optimizer, writer) File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 79, in train loss.backward() File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [768]], which is output 0 of IndexPutBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Looking forward to your recovery!

Jeff-sjtu commented 2 years ago

Hi, could you provide more details? Such as which config file you run.

with-twilight commented 2 years ago

system: ubuntu 16.04 cuda : 10.2 pytorch:1.2 torchvision:0.4 This problem occurs while running ./scripts/train_smpl.sh train_res34 ./configs/256x192_adam_lr1e-3-res34_smpl_3d_base_2x_mix.yaml

I don't know how to solve this problem, but I did not continue to report this error when I re-ran it a few days later. Personal reasons led to the interruption of the program. The code reported the above error after a rerun.

with-twilight commented 2 years ago

Maybe the problem is the PyTorch I installed. I didn't notice that I was running with another environment yesterday, pytorch:1.9 torchvision:0.3

Jeff-sjtu commented 2 years ago

Hi @with-twilight, it seems the problem is caused by the PyTorch environment. Our code is tested on PyTorch 1.2, so there may be some bugs with PyTorch1.0.

lulindeng commented 2 years ago

Hi @with-twilight, it seems the problem is caused by the PyTorch environment. Our code is tested on PyTorch 1.2, so there may be some bugs with PyTorch1.0.

Hi, thanks for the great work! I also encountered exactly the same problem. pytorch: 1.2.0 torchvision: 0.4.0 cuda: 10.2 python: 3.6

The train_smpl.sh is

EXPID=$1
CONFIG=$2

python ./scripts/train_smpl.py \
    --nThreads 10 \
    --launcher pytorch --rank 0 \
    --dist-url tcp://localhost:23456 \
    --exp-id ${EXPID} \
    --cfg ${CONFIG} --seed 123123

And I try to use 2 GPUs to train, I ran this in the terminal:

CUDA_VISIBLE_DEVICES=2,7 ./scripts/train_smpl.sh train_res34 ./configs/test_config.yaml

The test_config.yaml only change the dataset path and changes the WORLD_SIZE to 2.

Looking forward to your reply!

biansy000 commented 2 years ago

@lulindeng I met with similar problems before. After I switch to use pytorch==1.6.0, the problem disappears.

lulindeng commented 2 years ago

@lulindeng I met with similar problems before. After I switch to use pytorvh==1.6.0, the problem disappears.

Thank you ! I solved the problem by using revised code in this issue: https://github.com/Jeff-sjtu/HybrIK/issues/35#issuecomment-887304816