amdegroot / ssd.pytorch

A PyTorch Implementation of Single Shot MultiBox Detector
MIT License
5.13k stars 1.74k forks source link

RuntimeError: The shape of the mask [32, 8732] at index 0 does not match the shape of the indexed tensor [279424, 1] at index 0 #173

Open 17764591637 opened 6 years ago

17764591637 commented 6 years ago

rps@rps:~/桌面/ssd.pytorch$ python3 train.py /home/rps/桌面/ssd.pytorch/ssd.py:34: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. self.priors = Variable(self.priorbox.forward(), volatile=True) /home/rps/桌面/ssd.pytorch/layers/modules/l2norm.py:17: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(self.weight,self.gamma) Loading base network... Initializing weights... train.py:214: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavieruniform. init.xavier_uniform(param) Loading the dataset... Training SSD on: VOC0712 Using the specified args: Namespace(basenet='vgg16_reducedfc.pth', batch_size=32, cuda=True, dataset='VOC', dataset_root='/home/rps/data/VOCdevkit/', gamma=0.1, lr=0.001, momentum=0.9, num_workers=4, resume=None, save_folder='weights/', start_iter=0, visdom=False, weight_decay=0.0005) train.py:169: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. targets = [Variable(ann.cuda(), volatile=True) for ann in targets] Traceback (most recent call last): File "train.py", line 255, in train() File "train.py", line 178, in train loss_l, loss_c = criterion(out, targets) File "/home/rps/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/home/rps/桌面/ssd.pytorch/layers/modules/multibox_loss.py", line 97, in forward loss_c[pos] = 0 # filter out pos boxes for now RuntimeError: The shape of the mask [32, 8732] at index 0 does not match the shape of the indexed tensor [279424, 1] at index 0

anyone helps,please...

isaactalx commented 6 years ago

I have the same error.Using Pytorch0.4+python3.5.

bobo0810 commented 6 years ago

python3.5 and pytorch 0.3.0 no problem

xscjun commented 6 years ago

I have the same error,if I switch the lines 96,97 loss_c = loss_c.view(num, -1) loss_c[pos] = 0 in multibox_loss.py, this error disappear. But come with another error : "File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward loss_l /= N RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'" The type of tensor is not match, how can I fix it ?

slomrafgrav commented 6 years ago

@xscjun change line: N = num_pos.data.sum() to:
N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() this should work

gtwell commented 6 years ago

Anyone has solved this problem? help me tks.

Lin-Zhipeng commented 6 years ago

I have the same error,if I switch the lines 96,97 loss_c = loss_c.view(num, -1) loss_c[pos] = 0 in multibox_loss.py, this error disappear. But come with another error : "File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward loss_l /= N RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'" The type of tensor is not match, how can I fix it ?

The “pos” -> torch.Size([32, 8732]) The “loss_c ” ->torch.Size([279424, 1]) when I add one line as :

        loss_c = loss_c.view(pos.size()[0], pos.size()[1]) #add line 
        loss_c[pos] = 0  # filter out pos boxes for now
        loss_c = loss_c.view(num, -1)

Then it worked.

zxt-triumph commented 5 years ago

I have the same error,if I switch the lines 96,97 loss_c = loss_c.view(num, -1) loss_c[pos] = 0 in multibox_loss.py, this error disappear. But come with another error : "File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward loss_l /= N RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'" The type of tensor is not match, how can I fix it ?

i have the same error, and how did you solve it finally?

zxt-triumph commented 5 years ago

I have the same error,if I switch the lines 96,97 loss_c = loss_c.view(num, -1) loss_c[pos] = 0 in multibox_loss.py, this error disappear. But come with another error : "File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward loss_l /= N RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'" The type of tensor is not match, how can I fix it ?

i have the same error, so how could you figure it out finally?

matthewarthur commented 5 years ago

What file should be updated?

queryor commented 5 years ago

I have the same error,if I switch the lines 96,97 loss_c = loss_c.view(num, -1) loss_c[pos] = 0 in multibox_loss.py, this error disappear. But come with another error : "File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward loss_l /= N RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'" The type of tensor is not match, how can I fix it ?

change the data type of N to FloatTensor.

usherbob commented 5 years ago

What file should be updated?

You may try to update your file /home/.../ssd.pytorch/layers/modules/multibox_loss.py, and add one line as @LZP4GitHub said above.

subicWang commented 5 years ago

@usherbob python3.6+pytorch0.4.1, I added "_loss_c = lossc.view(pos.size()[0], pos.size()[1]) #add line", but I have another issue. _RuntimeError: copyif failed to synchronize: device-side assert triggered

subicWang commented 5 years ago

Finally, I succeeded. step1: switch the two lines 97,98: _loss_c = loss_c.view(num, -1) lossc[pos] = 0 # filter out pos boxes for now step2: change the line144 _N = numpos.data.sum() to _N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = lossc.double()

CJJ-717 commented 5 years ago

Finally, I succeeded. step1: switch the two lines 97,98: _loss_c = loss_c.view(num, -1) lossc[pos] = 0 # filter out pos boxes for now step2: change the line144 _N = numpos.data.sum() to _N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = lossc.double()

I changed like this, but there was a RuntimeError still: RuntimeError: device-side assert triggered How can I fix it ? Looking forward to your reply.Thank you!

wisdomk commented 5 years ago

by changing the order of line 97 and 98 it throws a new error for me

Traceback (most recent call last):
  File "train.py", line 254, in <module>
    train()
  File "train.py", line 182, in train
    loc_loss += loss_l.data[0]
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

any suggestions?

PS: I tried as well converting the loss to double as mentioned above and still the same error!


### solved apparently 'loss_l.data[0]' should be replaced with 'loss_l.item()' instead this replacement applies on every loss_x.data[0] in the file!

leaf918 commented 5 years ago

Finally, I succeeded. step1: switch the two lines 97,98: _loss_c = loss_c.view(num, -1) lossc[pos] = 0 # filter out pos boxes for now step2: change the line144 _N = numpos.data.sum() to _N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = lossc.double()

很棒,但是有个小bug,是line 114,不是line 144

TianSong1991 commented 5 years ago

If your Python torch version is '0.4.1' ,you can change follow step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() But if your python torch version is 1.0.1,that change is no useful.

TianSong1991 commented 5 years ago

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps: step1 and step2 change the multibox_loss.py! step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

charan1561 commented 5 years ago

loss is increasing as shown below

timer: 2.2050 sec. iter 0 || Loss: 153.4730 || timer: 1.8316 sec. iter 10 || Loss: 48.9679 || timer: 1.8920 sec. iter 20 || Loss: 191.8098 || timer: 2.0969 sec. iter 30 || Loss: 110.8081 || timer: 1.8849 sec. iter 40 || Loss: 106.9749 || timer: 1.9373 sec. iter 50 || Loss: 134.3674 || timer: 2.0012 sec. . .

help me to solve the issue.

litianciucas commented 5 years ago

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps: step1 and step2 change the multibox_loss.py! step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

thanks,that is usefully for me,but ,step3 is:line 183,184,188,191, 5 item ,loss_x.data[0] >> loss_x.data or loss.data[0] >> loss.data

blueardour commented 5 years ago

would be loss_x.data[0] >> loss_x.item() better?

espectre commented 5 years ago

@TianSong1991 Thanks a lot.Pytorch 1.0+Python 3.5 success!

zz10001 commented 5 years ago

PS: I tried as well converting the loss to double as mentioned above and still the same error!

much obligated!

mk123qwe commented 5 years ago

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps: step1 and step2 change the multibox_loss.py! step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

but loss is nan

mk123qwe commented 5 years ago

@TianSong1991 Thanks a lot.Pytorch 1.0+Python 3.5 success! but loss is nan

xafarranxera commented 5 years ago

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps: step1 and step2 change the multibox_loss.py! step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

but loss is nan

I have the same problem. Why loss is nan?

OberstWB commented 5 years ago

If your Python torch version is '0.4.1' ,you can change follow step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() But if your python torch version is 1.0.1,that change is no useful.

Hi , why don`t the loss_l divide by N?

SalahAdDin commented 5 years ago

Same problem here.

I used the @ LZP4GitHub solution and it is working fine, but i don't understand what is the difference between its solution and https://github.com/amdegroot/ssd.pytorch/pull/322 this one.

mm1327 commented 5 years ago

I have the same error.Using Pytorch1.1+python3.6

loss_c[pos] = 0 # filter out pos boxes for now IndexError: The shape of the mask [32, 8732] at index 0 does not match the shape of the indexed tensor [279424, 1] at index 0

ashleylid commented 5 years ago

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.
HaoWu1993 commented 5 years ago

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.

I've encountered the same one here, have you solve this problem?

gtwell commented 5 years ago

learning rate is too big

郭腾伟 邮箱:gtwell@163.com

签名由 网易邮箱大师 定制

On 09/22/2019 11:52, HaoWu1993 wrote:

Pytorch version:

import torch print(torch.version) 1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17) [GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now

Change line114 N = num_pos.data.sum() -> N = num_pos.data.sum().double()

and change the following two lines to: loss_l = loss_l.double() loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec. iter 0 || Loss: 11728.9388 || timer: 0.2955 sec. iter 10 || Loss: nan || timer: 0.2843 sec. iter 20 || Loss: nan || timer: 0.2890 sec. iter 30 || Loss: nan || timer: 0.2934 sec. iter 40 || Loss: nan || timer: 0.2865 sec. iter 50 || Loss: nan || timer: 0.2855 sec. iter 60 || Loss: nan || timer: 0.2889 sec. iter 70 || Loss: nan || timer: 0.2857 sec. iter 80 || Loss: nan || timer: 0.2843 sec. iter 90 || Loss: nan || timer: 0.2835 sec. iter 100 || Loss: nan || timer: 0.2846 sec. iter 110 || Loss: nan || timer: 0.2946 sec. iter 120 || Loss: nan || timer: 0.2860 sec. iter 130 || Loss: nan || timer: 0.2846 sec. iter 140 || Loss: nan || timer: 0.2962 sec. iter 150 || Loss: nan || timer: 0.2989 sec. iter 160 || Loss: nan || timer: 0.2857 sec.

I've encountered the same one here, have you solve this problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Billnut commented 5 years ago

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:


timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.

I think the loss is much enormous, you should add two lines:
loss_l /= N
loss_c /= N
mengxingkong commented 5 years ago

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.

I've encountered the same one here, have you solve this problem?

I don't change line 114, and then nan loss disappears.

Billnut commented 4 years ago

Those values of  loss: loc_loss, conf_loss are much huge out of the memory, you would utilize the codes: N = num_pos.data.sum().double()        loss_l = loss_l.double()         loss_c = loss_c.double()         loss_l /= N         loss_c /= N And at the train.py, you should using the follow  two lines instead of your codes loc_loss += loss_l.item() conf_loss += loss_c.item()

 

with best wish, better luck, good fortune.

 

------------------ 原始邮件 ------------------ 发件人: "琉璃梦"<notifications@github.com>; 发送时间: 2019年10月18日(星期五) 晚上10:01 收件人: "amdegroot/ssd.pytorch"<ssd.pytorch@noreply.github.com>; 抄送: "YUXIAOHONG"<353826721@qq.com>; "Comment"<comment@noreply.github.com>; 主题: Re: [amdegroot/ssd.pytorch] RuntimeError: The shape of the mask [32, 8732] at index 0 does not match the shape of the indexed tensor [279424, 1] at index 0 (#173)

Pytorch version: >>> import torch >>> print(torch.version) 1.1.0
Python version: Python 3.6.7 (default, Oct 22 2018, 11:32:17) [GCC 8.2.0] on linux
multibox_loss.py: Switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now Change line114 N = num_pos.data.sum() -> N = num_pos.data.sum().double() and change the following two lines to: loss_l = loss_l.double() loss_c = loss_c.double()
train.py loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data
And here is my output: timer: 11.9583 sec. iter 0 || Loss: 11728.9388 || timer: 0.2955 sec. iter 10 || Loss: nan || timer: 0.2843 sec. iter 20 || Loss: nan || timer: 0.2890 sec. iter 30 || Loss: nan || timer: 0.2934 sec. iter 40 || Loss: nan || timer: 0.2865 sec. iter 50 || Loss: nan || timer: 0.2855 sec. iter 60 || Loss: nan || timer: 0.2889 sec. iter 70 || Loss: nan || timer: 0.2857 sec. iter 80 || Loss: nan || timer: 0.2843 sec. iter 90 || Loss: nan || timer: 0.2835 sec. iter 100 || Loss: nan || timer: 0.2846 sec. iter 110 || Loss: nan || timer: 0.2946 sec. iter 120 || Loss: nan || timer: 0.2860 sec. iter 130 || Loss: nan || timer: 0.2846 sec. iter 140 || Loss: nan || timer: 0.2962 sec. iter 150 || Loss: nan || timer: 0.2989 sec. iter 160 || Loss: nan || timer: 0.2857 sec.
I've encountered the same one here, have you solve this problem?

I don't change line 114, and then nan loss disappears.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

haibochina commented 4 years ago

这些值 损失:loc_loss,conf_loss远远超出内存,您可以利用以下代码:N = num_pos.data.sum()。double()&nbsp; &nbsp; &nbsp; &nbsp; loss_l = loss_l.double()&nbsp; &nbsp; &nbsp; &nbsp; loss_c = loss_c.double()&nbsp; &nbsp; &nbsp; &nbsp; loss_l / = N&nbsp; &nbsp; &nbsp; &nbsp; loss_c / = N并且在train.py上,您应该使用following&nbsp; 用两行代码代替loc_loss + = loss_l.item()conf_loss + = loss_c.item()&nbsp; 祝你好运,好运,好运。&nbsp; ------------------&nbsp;原始邮件&nbsp; ------------------发件人:“琉璃梦” notifications@github.com ;; 发送时间:2019年10月18日(星期五)晚上10:01收件人:“ amdegroot / ssd.pytorch” ssd.pytorch@noreply.github.com ;; 抄送:“ YUXIAOHONG” 353826721@qq.com ;; “评论” comment@noreply.github.com ;; 主题:Re:[amdegroot / ssd.pytorch] RuntimeError:索引0处的蒙版[32,8732]的形状与索引0处的索引张量[279424,1]的形状不匹配(#173)Pytorch版本:&gt;&gt;&gt; 导入割炬&gt;&gt;&gt; 在Linux上的print(torch . version)1.1.0 Python版本:Python 3.6.7(默认,2018年10月22日,11:32:17)[GCC 8.2.0] multibox_loss.py:切换两行97,98:loss_c = loss_c.view(num,-1)loss_c [pos] = 0#现在过滤掉pos盒更改第114行N = num_pos.data.sum()-> N = num_pos.data.sum()。double(),并将以下两行更改为:loss_l = loss_l.double()loss_c = loss_c.double()train.py loss_l.data [0]&gt;&gt; loss_l.data loss_c.data [0]&gt;&gt; loss_c.data loss.data [0]&gt;&gt; loss.data这是我的输出:计时器:11.9583秒。迭代0 || 损失:11728.9388 || 计时器:0.2955秒。重复10 || 损失:南|| 计时器:0.2843秒。iter 20 || 损失:南|| 计时器:0.2890秒。iter 30 || 损失:南|| 计时器:0.2934秒 iter 40 || 损失:南|| 计时器:0.2865秒。重复50 || 损失:南|| 计时器:0.2855秒。iter 60 || 损失:南|| 计时器:0.2889秒 iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。重复50 || 损失:南|| 计时器:0.2855秒。iter 60 || 损失:南|| 计时器:0.2889秒 iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。重复50 || 损失:南|| 计时器:0.2855秒。iter 60 || 损失:南|| 计时器:0.2889秒 iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。楠|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。楠|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。然后南损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。然后南损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。

good! It work very good! Tank you !

SalahAdDin commented 4 years ago

@haibochina What?

haibochina commented 4 years ago

@haibochina What? It means that the loss:loc_loss,conf_loss are out of range of your ram. So you can change the source code as following : N = num_pos.data.sum(), loss_l / = N, loss_c / = N, loc_loss + = loss_l.item()conf_loss + = loss_c.item()

SalahAdDin commented 4 years ago

I think PR are welcommed.

up2m commented 4 years ago

thank you @haibochina ,about the issue of lose=nan, your method is very good!

J0hannB commented 4 years ago

I also had a nan loss issue after fixing multibox_loss.py

In my case it was because I was trying to use custom annotations and loading them as [x_center, y_center, width, height]

If anyone else is trying to do the same thing, the correct format is [x1, y1, x2, y2]

Training works now

Json0926 commented 4 years ago

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.

Because of the loss too big, I change line 115 to

   N = num_pos.data.sum().double()
   loss_l = loss_l.double()
   loss_c = loss_c.double()
   loss_l /= N
   loss_c /= N

solve the issue

ynjiun commented 4 years ago

@TianSong1991, I follow your solution and got it running normally... but after a while (after iter 90) the loss exploded to nan..., did you experience the same thing? timer: 6.1760 sec. iter 0 || Loss: 31.7677 || timer: 0.3297 sec. iter 10 || Loss: 24.6710 || timer: 0.3164 sec. iter 20 || Loss: 24.0278 || timer: 0.3214 sec. iter 30 || Loss: 25.0901 || timer: 0.3184 sec. iter 40 || Loss: 16.9485 || timer: 0.3358 sec. iter 50 || Loss: 17.5748 || timer: 0.3850 sec. iter 60 || Loss: 26.2674 || timer: 0.3207 sec. iter 70 || Loss: 20.7441 || timer: 0.3213 sec. iter 80 || Loss: 16.5515 || timer: 0.3206 sec. iter 90 || Loss: 25808.9131 || timer: 0.3171 sec. iter 100 || Loss: nan || timer: 0.3274 sec. iter 110 || Loss: nan || timer: 0.3548 sec. iter 120 || Loss: nan || timer: 0.3141 sec. iter 130 || Loss: nan || timer: 0.3231 sec. iter 140 || Loss: nan || timer: 0.3254 sec. iter 150 || Loss: nan || timer: 0.3174 sec. iter 160 || Loss: nan || timer: 0.3144 sec. iter 170 || Loss: nan || timer: 0.3679 sec. iter 180 || Loss: nan || timer: 0.3631 sec. iter 190 || Loss: nan || timer: 0.3516 sec. iter 200 || Loss: nan || timer: 0.3692 sec. iter 210 || Loss: nan || timer: 0.3523 sec. iter 220 || Loss: nan || timer: 0.3204 sec. iter 230 || Loss: nan || timer: 0.3151 sec. iter 240 || Loss: nan || timer: 0.3210 sec. iter 250 || Loss: nan || timer: 0.3241 sec. iter 260 || Loss: nan || timer: 0.3217 sec. iter 270 || Loss: nan || timer: 0.3156 sec. iter 280 || Loss: nan || timer: 0.3125 sec. iter 290 || Loss: nan || timer: 0.3196 sec. iter 300 || Loss: nan || timer: 0.3172 sec.

ynjiun commented 4 years ago

with @TianSong1991 solution except the step3 changed to following: setp 3 change the train.py! step3: change the line183,184,188,191: loss_l.data[0] >> loss_l.item() loss_c.data[0] >> loss_c.item() loss.data[0] >> loss.item()

now loss is converging...

timer: 6.1581 sec. iter 0 || Loss: 32.3338 || timer: 0.3283 sec. iter 10 || Loss: 24.8091 || timer: 0.3328 sec. iter 20 || Loss: 24.4980 || timer: 0.3275 sec. iter 30 || Loss: 21.3105 || timer: 0.3167 sec. iter 40 || Loss: 14.5682 || timer: 0.3223 sec. iter 50 || Loss: 13.0729 || timer: 0.3221 sec. iter 60 || Loss: 12.3032 || timer: 0.3383 sec. iter 70 || Loss: 10.5260 || timer: 0.3246 sec. iter 80 || Loss: 11.2028 || timer: 0.3380 sec. iter 90 || Loss: 10.1715 || timer: 0.3244 sec. iter 100 || Loss: 10.1702 || timer: 0.3342 sec. iter 110 || Loss: 9.8668 || timer: 0.3384 sec. iter 120 || Loss: 9.5938 || timer: 0.3676 sec. iter 130 || Loss: 10.0942 || timer: 0.3210 sec. iter 140 || Loss: 9.7601 || timer: 0.3246 sec. iter 150 || Loss: 10.1564 || timer: 0.3202 sec. iter 160 || Loss: 9.8361 || timer: 0.3215 sec. iter 170 || Loss: 9.3565 || timer: 0.3290 sec. iter 180 || Loss: 9.2069 || timer: 0.3481 sec. iter 190 || Loss: 9.0822 || timer: 0.3374 sec. iter 200 || Loss: 9.3702 || timer: 0.3333 sec. iter 210 || Loss: 9.6193 || timer: 0.3437 sec. iter 220 || Loss: 9.1466 || timer: 0.3590 sec. iter 230 || Loss: 8.8923 || timer: 0.3211 sec. iter 240 || Loss: 9.2617 || timer: 0.3526 sec. iter 250 || Loss: 9.1713 || timer: 0.3263 sec. iter 260 || Loss: 9.4524 || timer: 0.3262 sec. iter 270 || Loss: 9.4929 || timer: 0.3581 sec. iter 280 || Loss: 8.7274 || timer: 0.3345 sec. iter 290 || Loss: 9.6723 || timer: 0.3701 sec. ......

yingjun-zhang commented 4 years ago

with @TianSong1991 solution except the step3 changed to following: setp 3 change the train.py! step3: change the line183,184,188,191: loss_l.data[0] >> loss_l.item() loss_c.data[0] >> loss_c.item() loss.data[0] >> loss.item()

now loss is converging...

timer: 6.1581 sec. iter 0 || Loss: 32.3338 || timer: 0.3283 sec. iter 10 || Loss: 24.8091 || timer: 0.3328 sec. iter 20 || Loss: 24.4980 || timer: 0.3275 sec. iter 30 || Loss: 21.3105 || timer: 0.3167 sec. iter 40 || Loss: 14.5682 || timer: 0.3223 sec. iter 50 || Loss: 13.0729 || timer: 0.3221 sec. iter 60 || Loss: 12.3032 || timer: 0.3383 sec. iter 70 || Loss: 10.5260 || timer: 0.3246 sec. iter 80 || Loss: 11.2028 || timer: 0.3380 sec. iter 90 || Loss: 10.1715 || timer: 0.3244 sec. iter 100 || Loss: 10.1702 || timer: 0.3342 sec. iter 110 || Loss: 9.8668 || timer: 0.3384 sec. iter 120 || Loss: 9.5938 || timer: 0.3676 sec. iter 130 || Loss: 10.0942 || timer: 0.3210 sec. iter 140 || Loss: 9.7601 || timer: 0.3246 sec. iter 150 || Loss: 10.1564 || timer: 0.3202 sec. iter 160 || Loss: 9.8361 || timer: 0.3215 sec. iter 170 || Loss: 9.3565 || timer: 0.3290 sec. iter 180 || Loss: 9.2069 || timer: 0.3481 sec. iter 190 || Loss: 9.0822 || timer: 0.3374 sec. iter 200 || Loss: 9.3702 || timer: 0.3333 sec. iter 210 || Loss: 9.6193 || timer: 0.3437 sec. iter 220 || Loss: 9.1466 || timer: 0.3590 sec. iter 230 || Loss: 8.8923 || timer: 0.3211 sec. iter 240 || Loss: 9.2617 || timer: 0.3526 sec. iter 250 || Loss: 9.1713 || timer: 0.3263 sec. iter 260 || Loss: 9.4524 || timer: 0.3262 sec. iter 270 || Loss: 9.4929 || timer: 0.3581 sec. iter 280 || Loss: 8.7274 || timer: 0.3345 sec. iter 290 || Loss: 9.6723 || timer: 0.3701 sec. ......

what's your torch version and python version?

He-zl8 commented 4 years ago

When encountered timer: 10.2599 sec. iter 0 || Loss: 30.8010 || timer: 0.4961 sec. iter 10 || Loss: 19.9977 || timer: 1.1120 sec. iter 20 || Loss: 19.2539 || timer: 1.8164 sec. iter 30 || Loss: 16.7701 || timer: 0.9436 sec. iter 40 || Loss: 18.0430 || timer: 0.7898 sec. iter 50 || Loss: 25.5106 || timer: 1.0395 sec. iter 60 || Loss: 23.7020 || timer: 0.8617 sec. iter 70 || Loss: nan || timer: 1.0497 sec. iter 80 || Loss: nan || timer: 1.2802 sec.

maybe you can change lr=1e-4,when i change ,then

timer: 10.1423 sec. iter 0 || Loss: 29.5713 || timer: 0.4259 sec. iter 10 || Loss: 22.9357 || timer: 1.2987 sec. iter 20 || Loss: 20.2871 || timer: 1.1511 sec. iter 30 || Loss: 20.0152 || timer: 0.9707 sec. iter 40 || Loss: 19.3170 || timer: 0.9684 sec. iter 50 || Loss: 19.0578 || timer: 1.0160 sec. iter 60 || Loss: 19.2979 || timer: 1.2673 sec. iter 70 || Loss: 18.9950 || timer: 1.1985 sec. iter 80 || Loss: 16.6445 || timer: 1.2570 sec.

cotyyang commented 4 years ago

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps: step1 and step2 change the multibox_loss.py! step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

thanks,this answer solves my problem.

hjlee9182 commented 4 years ago

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps: step1 and step2 change the multibox_loss.py! step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

I'm also this answer solved my probelm. more correctly, loss_l = loss_l.double()/N loss_c = loss_c.doubel()/N :)

He-zl8 commented 4 years ago

Thank you. I've solved the problem. Thank you again.

------------------ 原始邮件 ------------------ 发件人: "HyunJun Lee"<notifications@github.com>; 发送时间: 2020年10月23日(星期五) 上午10:14 收件人: "amdegroot/ssd.pytorch"<ssd.pytorch@noreply.github.com>; 抄送: "贺智龙"<277309467@qq.com>; "Comment"<comment@noreply.github.com>; 主题: Re: [amdegroot/ssd.pytorch] RuntimeError: The shape of the mask [32, 8732] at index 0 does not match the shape of the indexed tensor [279424, 1] at index 0 (#173)

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps: step1 and step2 change the multibox_loss.py! step1: switch the two lines 97,98: loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now step2: change the line114 N = num_pos.data.sum() to N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double() setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

I'm also this answer solved my probelm. more correctly, loss_l = loss_l.double()/N loss_c = loss_c.doubel()/N :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Certseeds commented 3 years ago

if loss is nan,maybe the learning_rate is too large.