training problem - Githubissues

Whiplash-18 commented 1 year ago

when I trained the model on panoptic datasets and met such problem. and I use the torch1.13, cuda 11.8. File "/workspace/faster_voxel_pose/run/train.py", line 181, in main() File "/workspace/faster_voxel_pose/run/train.py", line 151, in main train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict) File "/workspace/faster_voxel_pose/run/../lib/core/function.py", line 41, in train_3d final_poses, poses, proposal_centers, loss_dict, input_heatmap = model(views=inputs, meta=meta, targets=targets,\ File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward return self.module(*inputs[0], *kwargs[0]) File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/workspace/faster_voxel_pose/run/../lib/models/voxelpose.py", line 38, in forward bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform) File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "/workspace/faster_voxel_pose/run/../lib/models/human_detection_net.py", line 94, in forward proposal_heatmaps_1d = self.c2c_net(torch.flatten(feature_1d, 0, 1)).view(batch_size, self.max_people, -1) File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/workspace/faster_voxel_pose/run/../lib/models/cnns_1d.py", line 131, in forward hm = self.output_hm(x) File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack return traceback.format_stack() (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "/workspace/faster_voxel_pose/run/train.py", line 181, in main() File "/workspace/faster_voxel_pose/run/train.py", line 151, in main train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict) File "/workspace/faster_voxel_pose/run/../lib/core/function.py", line 71, in train_3d accu_loss.backward() File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 1]] is at version 7; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

cucdengjunli commented 1 year ago

same question

cucdengjunli commented 1 year ago

maybe you need to use V100

gpastal24 commented 1 year ago

@Whiplash-18 I had the same problem, you have to use torch 1.4 in order to train the models, so you will need a gpu which supports cuda 10.x

AlvinYH commented 1 year ago

Hi, @Whiplash-18. Thanks for your interest in our work. Yes, there exists a bug in our former implementation. And we solved this problem by using two optimizers to learn HDN and JLN, respectively. We've revised the code and you can pull the recent release. Now it can support a higher PyTorch version (>1.4).

AlvinYH / Faster-VoxelPose

training problem #15