When i want to use fp16 to accelerate my model training, I got
File "train.py", line 107, in <module>
main(opt)
File "train.py", line 81, in main
log_dict_train, _ = trainer.train(epoch, train_loader)
File "/home/wx/hoi/PPDM-pt1/src/lib/trainers.py", line 143, in train
ret, results = self.run_epoch(model_with_loss, epoch, data_loader)
File "/home/wx/hoi/PPDM-pt1/src/lib/trainers.py", line 100, in run_epoch
output, loss, loss_states = model_with_loss(batch)
File "/home/wx/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wx/anaconda3/envs/torch/lib/python3.7/site-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/home/wx/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wx/hoi/PPDM-pt1/src/lib/trainers.py", line 23, in forward
outputs = self.model(batch['input'])
File "/home/wx/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wx/hoi/PPDM-pt1/src/lib/models/networks/pose_dla_dcn.py", line 376, in forward
x = self.dla_up(x)
File "/home/wx/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wx/hoi/PPDM-pt1/src/lib/models/networks/pose_dla_dcn.py", line 305, in forward
ida(layers, len(layers) - i - 2, len(layers))
File "/home/wx/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wx/hoi/PPDM-pt1/src/lib/models/networks/pose_dla_dcn.py", line 279, in forward
layers[i] = upsample(project(layers[i]))
File "/home/wx/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wx/hoi/PPDM-pt1/src/lib/models/networks/pose_dla_dcn.py", line 251, in forward
x = self.conv(x)
File "/home/wx/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wx/hoi/PPDM-pt1/src/lib/models/networks/DCNv2/dcn_v2.py", line 170, in forward
self.deformable_groups,
File "/home/wx/hoi/PPDM-pt1/src/lib/models/networks/DCNv2/dcn_v2.py", line 37, in forward
ctx.deformable_groups,
RuntimeError: expected scalar type Float but found Half
So, I try to fix this bug. And according to add torch.cuda.amp decorator to _DCNv2 forward and backward function, it seems work well in my machine:
Ubuntu 18.04
RTX 2080Ti
CUDA 10.1
pytorch 17. 1
This is my test script, and i think it need more careful experiment
When i want to use fp16 to accelerate my model training, I got
So, I try to fix this bug. And according to add torch.cuda.amp decorator to _DCNv2 forward and backward function, it seems work well in my machine:
This is my test script, and i think it need more careful experiment