PRBonn / lidar-bonnetal

Semantic and Instance Segmentation of LiDAR point clouds for autonomous driving
http://semantic-kitti.org
MIT License
945 stars 205 forks source link

RuntimeError: "embedding_backward" not implemented for 'Long' #22

Closed JustWon closed 4 years ago

JustWon commented 4 years ago

I tried to train the network from the scratch but I encountered the following runtime error.

image

I searched this error on Google but there was nothing related to this kind of error.

After a lot attempts to fix this, I figured out that the training is working when the following line in "ioueval.py" is removed.

self.conf_matrix = self.conf_matrix.indexput(tuple(idxs), self.ones, accumulate=True)

but the IoU value is shown to be zero. image

Do you know what is the problem? Thanks.

tano297 commented 4 years ago

Which version of pytorch are you using? It's a strange error, since the metric is not involved in the backward pass at all

JustWon commented 4 years ago

@tano297 I am using pytorch 1.3.0

image

tano297 commented 4 years ago

I haven't tried using 1.3 yet. I'm on 1.1 because of tensorrt requirements on jetson platform. I will try to set up a docker image with 1.3 and see what's up, but this will be in coming weeks. In the meantime I suggest sticking to 1.1 for this repo

JustWon commented 4 years ago

Alright, I'll try this on pytorch 1.1. Thanks!

JustWon commented 4 years ago

yeah! the version of pytorch was the problem. it is working with pytorch 1.1.

image

muzi2045 commented 4 years ago

refer this issue: https://github.com/pytorch/pytorch/issues/31672 this PR looks like fix it : https://github.com/pytorch/pytorch/pull/31692

muzi2045 commented 4 years ago

if anyone want to run this repo in pytorch1.4, here has two place need to modify: in the /train/tasks/semantic/modules/ioueval.py

#self.conf_matrix = torch.zeros(
#        (self.n_classes, self.n_classes), device=self.device).long()
self.conf_matrix = torch.zeros(
        (self.n_classes, self.n_classes), device=self.device).float()
if self.ones is None or self.last_scan_size != idxs.shape[-1]:
      # self.ones = torch.ones((idxs.shape[-1]), device=self.device).long()
      self.ones = torch.ones((idxs.shape[-1]), device=self.device).float()
      self.last_scan_size = idxs.shape[-1]
tano297 commented 4 years ago

This is a bit dangerous, that is why I haven't changed it. After the mantissa of the float value uses all 24 bits if you add one, you will obtain the same number, and therefore the confusion matrix element will get stuck.

More precisely, 16777216 vs 16777217 for a 24 bit mantissa in IEEE-754. See this link to view the internal binary representation. When you add 1 to 16777216, you basically get stuck, so you can never get off this state by adding one.

If you change to double, this problem moves from 16777216 (2^24) to (2^54), which is probably enough, but IDK if this function is happy with double. I have to try it out. @muzi2045 can you have a look and let me know if double works?

muzi2045 commented 4 years ago

@tano297
I have test it with float(), the training results look like normal... I'll try the double() , thanks for reply

LZDSJTU commented 4 years ago

@tano297 I have test it with float(), the training results look like normal... I'll try the double() , thanks for reply

Hi, is there anything different between using float or double?

Thanks for reply.