rnn runtime error question

ericbillwang commented 6 years ago

Dear TA,

I got below run time error. RuntimeError: Expected hidden size (1, 1L, 512), got (1L, 50L, 512L)

Below is my source code which I confirmed final rnn_input is 1 x batch x input_size Can you give me some idea of debugging direction ?

    # V = resized image features(att_feats)
            # (batch * att_size) * att_feat_size
    V = att_feats.view(-1,self.att_feat_size) # batch x 49 x 2048
    # att = W*V
            # (batch * att_size) * att_hid_size
            # batch * att_size * att_hid_size
    att = self.ctx2att(V) # (batch x 49) x 512
    att = att.view(-1,att_size,self.att_hid_size)
    # att_h = W*hidden
            # batch * att_hid_size
            # batch * att_size * att_hid_size
    att_h = self.h2att(state[0]) # batch x 512
    att_h = att_h.squeeze()
    att_h = att_h.unsqueeze(1)
    att_h = att_h.expand_as(torch.Tensor(50,att_size,self.att_hid_size))
    # e = W*tanh(att+att_h)
            # batch * att_size * att_hid_size
            # batch * att_size * att_hid_size
            # (batch * att_size) * att_hid_size
            # (batch * att_size) * 1
            # batch * att_size
    tmp_e = att + att_h
    tmp_e = tmp_e.tanh()
    tmp_e = tmp_e.view(-1,self.att_hid_size)
    e = self.alpha_net(tmp_e)
    e = e.squeeze()
    e = e.view(-1,att_size)
    # alpha = softmax(e)
            # batch * att_size
    alpha = F.softmax(e)
    alpha = alpha.unsqueeze(1)
    # V = resized image features(att_feats)
            # batch * att_size * att_feat_size
    V = V.view(-1,att_size, self.att_feat_size)
    # C = alpha*V
            # batch * att_feat_size
    C = torch.bmm(alpha,V)
    C = C.squeeze()
    #print(C.size())

    # Use rnn to generate output
    rnn_input = torch.cat((xt,C),1)
    rnn_input = rnn_input.unsqueeze(0)
    print(rnn_input.size())
    output, state = self.rnn(rnn_input, state)
    #       input: Concatenates(xt, C) in size=(1 * batch_size * input_size)

Thanks, Ericbill

ericbillwang commented 6 years ago

Above issue was resolved root cause: wrong parameter when instantiate rnn <-- missing batch_first parameter.

After that, got below error in common code <-- Any known issue here ?

File "train.py", line 235, in train(opt) File "train.py", line 143, in train loss = crit(model_output, labels[:,1:], masks[:,1:]) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 224, in call result = self.forward(*input, *kwargs) File "/home/student41/lab2/cp_lab2_image_caption/misc/utils.py", line 87, in forward output = - input.gather(1, target) mask File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 684, in gather return Gather.apply(self, dim, index) File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/tensor.py", line 560, in forward return input.gather(dim, index) RuntimeError: invalid argument 2: Input tensor must have same size as output tensor apart from the specified dimension at /pytorch/torch/lib/THC/generic/THCTensorScatterGather.cu:29

Thanks, Ericbill

connie980149 commented 6 years ago

Base on these information, I think your size of model_output is not correct. The size of the core's output is (50L, 512L) in your case, if you didn't modify the code in class CaptionModel there should be no error.

ericbillwang commented 6 years ago

Connie, Thanks for the comment. I simply try to hack my code as below, so that output can be (50L, 512L), but got other errors... Can you help comment if I have fundamental issue on Wtanh(WvV+WhHt-1) in my original post ? Thanks, Ericbill

    output, state = self.rnn(rnn_input, state)
    #       input: Concatenates(xt, C) in size=(1 * batch_size * input_size)
    output = output.squeeze()

(50L, 512L) (50L, 512L) Traceback (most recent call last): File "train.py", line 236, in train(opt) File "train.py", line 146, in train utils.clip_gradient(optimizer, opt.grad_clip) File "/home/student41/lab2/cp_lab2_image_caption/misc/utils.py", line 99, in clipgradient param.grad.data.clamp(-grad_clip, grad_clip) AttributeError: 'NoneType' object has no attribute 'data'

ericbillwang commented 6 years ago

issue resolved after implement MLP layer. But the weird thing is, why the program fail w/o implement MLP layer given I already supplied required dimension (50L, 512L) ? Maybe squeeze() function remove some important attribute ?

Peter-Chuang commented 6 years ago

Hello Ericbill, I have the same issue... What do you mean "after implement MLP player"? Could you give me some hint? Thank you.

Peter

fansia commented 6 years ago

Hi TA,

would you please provide the size for the return of ShowAttendTellCore's forward function? alpha.size() = ? output.size() = ?

that would be better for us to decide to use squeeze/unsqueeze.

Thank you, Eric

connie980149 commented 6 years ago

The return size of ShowAttendTellCore from my results are alpha.size() = batch x 49 output.size() = batch x rnn_size

I have no idea why your guys have this issue if the dimensions are the same...

fansia commented 6 years ago

I checked alpha and output size from ShowAttendTellCore: alpha.size)_ = batch x 49 output.size() = 1 x batch x rnn_size

after output = output.squeeze() ==> output.size() = batch x rnn_size

However, still get the same error like ericbill.

Puff-Wen commented 6 years ago

Hi Fansia, That means some items defined in your init but not used in forward.

fansia commented 6 years ago

Hi Puff,

Thank you. There is a classmate said dataset may be modified unintentionally. I finally solve this problem by getting data.zip from /dataset folder again, and it works.

doom8199 commented 6 years ago

Hi fansia: I have the same problem about: 'NoneType' object has no attribute 'data' But this issue came from redundant code in the init function (h2rnn & att2rnn) My training can work well just after removing them

2017-fall-DL-training-program / ImageCaption

rnn runtime error question #9