Closed JustWon closed 5 years ago
Which version of pytorch are you using? It's a strange error, since the metric is not involved in the backward pass at all
@tano297 I am using pytorch 1.3.0
I haven't tried using 1.3 yet. I'm on 1.1 because of tensorrt requirements on jetson platform. I will try to set up a docker image with 1.3 and see what's up, but this will be in coming weeks. In the meantime I suggest sticking to 1.1 for this repo
Alright, I'll try this on pytorch 1.1. Thanks!
yeah! the version of pytorch was the problem. it is working with pytorch 1.1.
refer this issue: https://github.com/pytorch/pytorch/issues/31672 this PR looks like fix it : https://github.com/pytorch/pytorch/pull/31692
if anyone want to run this repo in pytorch1.4, here has two place need to modify: in the /train/tasks/semantic/modules/ioueval.py
#self.conf_matrix = torch.zeros(
# (self.n_classes, self.n_classes), device=self.device).long()
self.conf_matrix = torch.zeros(
(self.n_classes, self.n_classes), device=self.device).float()
if self.ones is None or self.last_scan_size != idxs.shape[-1]:
# self.ones = torch.ones((idxs.shape[-1]), device=self.device).long()
self.ones = torch.ones((idxs.shape[-1]), device=self.device).float()
self.last_scan_size = idxs.shape[-1]
This is a bit dangerous, that is why I haven't changed it. After the mantissa of the float value uses all 24 bits if you add one, you will obtain the same number, and therefore the confusion matrix element will get stuck.
More precisely, 16777216 vs 16777217 for a 24 bit mantissa in IEEE-754. See this link to view the internal binary representation. When you add 1 to 16777216, you basically get stuck, so you can never get off this state by adding one.
If you change to double, this problem moves from 16777216 (2^24) to (2^54), which is probably enough, but IDK if this function is happy with double. I have to try it out. @muzi2045 can you have a look and let me know if double works?
@tano297
I have test it with float(), the training results look like normal...
I'll try the double() , thanks for reply
@tano297 I have test it with float(), the training results look like normal... I'll try the double() , thanks for reply
Hi, is there anything different between using float or double?
Thanks for reply.
I tried to train the network from the scratch but I encountered the following runtime error.
I searched this error on Google but there was nothing related to this kind of error.
After a lot attempts to fix this, I figured out that the training is working when the following line in "ioueval.py" is removed.
but the IoU value is shown to be zero.
Do you know what is the problem? Thanks.