Closed YuzhongHuang closed 8 years ago
You can always use MaskZero
and MaskZeroCriterion
, but since your sequence lengths vary so greatly, it will probably be much faster if you just grouped the videos by number of frames and fed in batches of uniform length (or used TrimZero
but that adds a bit of computational and cognitive overhead).
@nhynes Thanks very much for replying. I think I still need to go with MaskZero since nearly none of the videos are of same frames.
I've read some post about variable sequence training. I think I got most part except the structure of batch input.
For example, if I have 5 videos of size 32*32, V1, V2, V3, V4, V5. Respectively, they have 3, 4, 5, 6, 7 frames.
Should the table be arranged in a way that
{ [V1_1, V1_2, V1_3], [V2_1, V2_2, V2_3, V2_4], [V3_1, V3_2, V3_3, V3_4, V3_5], [V4_1, V4_2, V4_3, V4_4, V4_5, V4_6], [V5_1, V5_2, V5_3, V5_4, V5_5, V5_6, V5_7] }
or should it be arranges as
{ [V1_1, V2_1, V3_1, V4_1, V5_1], [V1_2, V2_2, V3_2, V4_2, V5_2], [V1_3, V2_3, V3_3, V4_3, V5_3], [V2_4, V3_4, V4_4, V5_4], [V3_5, V4_5, V5_5], [V4_6, V5_6], [V5_7], }
The latter, in which each table entry is a time step. Don't forget the zero padding, though!
{
[V1_1, V2_1, V3_1, V4_1, V5_1],
[V1_2, V2_2, V3_2, V4_2, V5_2],
[V1_3, V2_3, V3_3, V4_3, V5_3],
[0, V2_4, V3_4, V4_4, V5_4],
[0, 0, V3_5, V4_5, V5_5],
[0, 0, 0, V4_6, V5_6],
[0, 0, 0, 0, V5_7],
}
(assuming that [...]
is a tensor).
It might help to check out the diagrams for the Sequencer input/output format.
You might also want to try to mash your videos into a contiguous tensor and take advantage of the speedup offered by SeqLSTM. For even more speed, you could also try using the cuDNN RNNs without bothering to mask the outputs as long as you mask the downstream components like the criterion.
@nhynes Thanks again! This post is very helpful! But instead of putting videos into lstm immediately, I first want to extract spatial information using a CNN. I wonder what is the best way to do that. Feed a batch inputs into a nn.Sequential():add(CNN) and then a sequencer, or just put CNN and LSTM together into a sequencer?
I don't understand the second part. I actually don't quite understand the zero mask. I only know that zero mask allows the sequencer to reset its state when detecting a zero tensor in a sequence. It would be great if you can refer me to some post.
extract spatial information using a CNN
Joint training might be a bit tricky, considering that your sequence lengths vary so greatly. You'll waste a huge amount of computation forwarding padding through a CNN. Thus, I'd recommend "off-roading" a bit and do something like the following (akin to a sequencer of sequencers):
local seqCNN = {nn.Sequencer(cnn)}
for i=2,batchSize do
seqCNN[i] = seqCNN[1]:clone()
end
local embs = torch.zeros(maxSeqlen, batchSize, cnn.outputDim)
for i=1,batchSize do
-- batchFrames is (maxSeqlen x batchSize x nChannels x width x height)
-- batchSeqlens is a vector of the actual sequence lengths
local frames = batchFrames:select(2, i):sub(1, batchSeqlens[i])
embs:select(2, i):sub(1, seqlens[i]):copy(seqCNN[i]:forward(frames))
end
rnn:forward(embs)
Unfortunately, each Sequencer
must save its outputs to compute the gradient. I strongly suspect that so many copies of your CNN will not fit on a GPU (or two or three).
Perhaps as a quick proof-of-concept, you might try pre-extracting your embeddings and feeding those in directly.
Just to make sure, you've already determined that the "more standard" approach of using volumetric convolutions isn't appropriate for your task, right? One benefit of using a convolutional architecture is that it can preserve low-level information that a CNN embedding might discard; besides, with enough depth, the receptive field will cover most of the sequence anyway. Another, probably more practical benefit, is that your model is more likely to fit on fewer than four GPUs.
I'm doing video classification with video data of different lengths, varying from 20 to 500 frames. I can input one video data with any frames into the network without any problem, since my underlying network structure is not dependent on the frame numbers. But I wonder how do I put them into a batch and do the training? If I convert those video data to tensors with uniform lengths and feed to an RNN, I will have to shrink those videos of 500 frames to 20 frames, which will lose a lot of information. I wonder is there anyway that I can train the network with those videos without uniforming their lengths? Thanks very much in advance!!