Open vyouman opened 8 years ago
@vyouman Yeah I think 2 is best. But I have to initialize the initial state of the second LSTM layer, how can I initialize it? I read through the LSTM code, and I think I may change the userPrevOutput and userPrevCell, right? #176
For your last question. I don't think you need to modify RecurrentAttention. You should be able to make replace the input layer to the Recurrent module (i.e. glimpse
) with something that takes something like {word, image, location}
. You can use special modules like NarrowTable, SelectTable and such to hack your way through implementing a valid glimpse
. Another option is to use nngraph's nn.gModule
to build your glimpse
module. nngraph makes building these kind of multi-input graphs easier.
You might be new to rnn, but you seem to know what your are doing :)
@nicholas-leonard Thanks for your advice. I decided to modify the RecurrentAttention module that takes a (image, caption) pair as input now to deal with the caption of variable lengths. I used to try to wrap the logic of dealing with problem of variable lengths but it's not so straightforward as encapsulating the logic to stop in a RecurrentAttention module. I'll unit test the modified ReucurrentAttention to see if it works tomorrow. :)
@vyouman Glad you are moving forward with this :)
@vyouman Any news on this?
Hi, I'm trying to implement the Deep Recurrent Attention Model described in the paper http://arxiv.org/pdf/1412.7755v2.pdf to apply to image caption generation instead of image classification. I will probably be able to use most of the modules in the RAM model implemented in the rnn package. For the case, I don't need to modify the Reinforce.lua interface and ReinforceNormal.lua since it's able to deal with table of rewards at every time step per batch now. All I should do here is to write a new Criterion and I've written one. And I think I should modify the RecurrentAttention module.
There is a context network, which is a 3-layer convolution network or some other CNN model presented in the paper to get the feature of the low-resolution images to feed to the second recurrent layer as the initial state to get the first location. I come up with two approaches:
For the second question, I'll need to add some other input to the first recurrent layer, in my case, it's word vector of every time step. Finally, instead of predict classes of multi-objects, I expect to predict the caption describing the image. And there's also some logic to deal with the captions of variable lengths so I'll probably have to encapsulate a LSTM layer to do this stuff, we can think of it as a language model without the last LogSoftmax layer here, call it lm for now. And then wrap it with a Recursor with the glimpse network as a rnn like https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L130-L131, we use
at last wrap the rnn and the locator as https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L145. So here comes the question, how can I pass the additional input like word vector which is not the direct input of the rnn expected by the RecurrentAttention model? The rnn here is compsed of the glimpse network and a recurrent layer, where the glimpse network expect an input of {image, location}, and in this case the recurrent layer not only expect the gt vector gained from the glimpse network but also the word vector. Should I modify the RecurrentAttention module to get more inputs? Or I don't need to modify the input of the RecurrentAttention module and the word vector can directly go from the lm module I'm going to implement above. Do you think it's feasible to do so or you have some more elegant way to implement it?
I'm a new babie of the rnn package and torch7. I'll appreciate your suggestion.:p