Element-Research / rnn

Recurrent Neural Network library for Torch7's nn
BSD 3-Clause "New" or "Revised" License
941 stars 313 forks source link

Implementation of DRAM model #180

Open vyouman opened 8 years ago

vyouman commented 8 years ago

Hi, I'm trying to implement the Deep Recurrent Attention Model described in the paper http://arxiv.org/pdf/1412.7755v2.pdf to apply to image caption generation instead of image classification. I will probably be able to use most of the modules in the RAM model implemented in the rnn package. For the case, I don't need to modify the Reinforce.lua interface and ReinforceNormal.lua since it's able to deal with table of rewards at every time step per batch now. All I should do here is to write a new Criterion and I've written one. And I think I should modify the RecurrentAttention module.

There is a context network, which is a 3-layer convolution network or some other CNN model presented in the paper to get the feature of the low-resolution images to feed to the second recurrent layer as the initial state to get the first location. I come up with two approaches:

  1. Maybe I should assemble the context network, the second recurrent layer and the location network to be the locator expected in the RecurrentAttention.
  2. Or I should use the context network to deal with the low-resolution image indepently and feed the feature to the 2nd recurrent layer as the initial state in the first time step directly. It seems that the second approach is more efficient and easy to implement than the first one, since there is some repeated images in the (image, caption) pair, cause one image will have more than one captions. So I want to wrap the second recurrent layer and the location network with a Recursor to be a locator expected by the RecurrentAttention module. Maybe I don't really need to modify the input in the first time step. The zero tensor will go directly to the second recurrent layer then. https://github.com/Element-Research/rnn/blob/master/RecurrentAttention.lua#L44-L48 But I have to initialize the initial state of the second LSTM layer, how can I initialize it? I read through the LSTM code, and I think I may change the userPrevOutput and userPrevCell, right? https://github.com/Element-Research/rnn/blob/master/LSTM.lua#L142-L144 For example, after I get a instance lstm, I should use something like

lstm.userPrevOutput = torch.Tensor(batchSizeoutputSize):fill(1) lstm.userPrevCell = torch.Tensor(batchSizeoutputSize):fill(0.5)

For the second question, I'll need to add some other input to the first recurrent layer, in my case, it's word vector of every time step. Finally, instead of predict classes of multi-objects, I expect to predict the caption describing the image. And there's also some logic to deal with the captions of variable lengths so I'll probably have to encapsulate a LSTM layer to do this stuff, we can think of it as a language model without the last LogSoftmax layer here, call it lm for now. And then wrap it with a Recursor with the glimpse network as a rnn like https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L130-L131, we use

rnn = nn.Recurrent(opt.hiddenSize, glimpse, lm, nn[opt.transfer](), 99999)

at last wrap the rnn and the locator as https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L145. So here comes the question, how can I pass the additional input like word vector which is not the direct input of the rnn expected by the RecurrentAttention model? The rnn here is compsed of the glimpse network and a recurrent layer, where the glimpse network expect an input of {image, location}, and in this case the recurrent layer not only expect the gt vector gained from the glimpse network but also the word vector. Should I modify the RecurrentAttention module to get more inputs? Or I don't need to modify the input of the RecurrentAttention module and the word vector can directly go from the lm module I'm going to implement above. Do you think it's feasible to do so or you have some more elegant way to implement it?

I'm a new babie of the rnn package and torch7. I'll appreciate your suggestion.:p

nicholas-leonard commented 8 years ago

@vyouman Yeah I think 2 is best. But I have to initialize the initial state of the second LSTM layer, how can I initialize it? I read through the LSTM code, and I think I may change the userPrevOutput and userPrevCell, right? #176

For your last question. I don't think you need to modify RecurrentAttention. You should be able to make replace the input layer to the Recurrent module (i.e. glimpse) with something that takes something like {word, image, location}. You can use special modules like NarrowTable, SelectTable and such to hack your way through implementing a valid glimpse. Another option is to use nngraph's nn.gModule to build your glimpse module. nngraph makes building these kind of multi-input graphs easier.

You might be new to rnn, but you seem to know what your are doing :)

vyouman commented 8 years ago

@nicholas-leonard Thanks for your advice. I decided to modify the RecurrentAttention module that takes a (image, caption) pair as input now to deal with the caption of variable lengths. I used to try to wrap the logic of dealing with problem of variable lengths but it's not so straightforward as encapsulating the logic to stop in a RecurrentAttention module. I'll unit test the modified ReucurrentAttention to see if it works tomorrow. :)

nicholas-leonard commented 8 years ago

@vyouman Glad you are moving forward with this :)

nicholas-leonard commented 8 years ago

@vyouman Any news on this?