batching with a dataset of variable length sequences

YuzhongHuang commented 8 years ago

I'm doing video classification with video data of different lengths, varying from 20 to 500 frames. I can input one video data with any frames into the network without any problem, since my underlying network structure is not dependent on the frame numbers. But I wonder how do I put them into a batch and do the training? If I convert those video data to tensors with uniform lengths and feed to an RNN, I will have to shrink those videos of 500 frames to 20 frames, which will lose a lot of information. I wonder is there anyway that I can train the network with those videos without uniforming their lengths? Thanks very much in advance!!

nhynes commented 8 years ago

You can always use MaskZero and MaskZeroCriterion, but since your sequence lengths vary so greatly, it will probably be much faster if you just grouped the videos by number of frames and fed in batches of uniform length (or used TrimZero but that adds a bit of computational and cognitive overhead).

YuzhongHuang commented 8 years ago

@nhynes Thanks very much for replying. I think I still need to go with MaskZero since nearly none of the videos are of same frames.

I've read some post about variable sequence training. I think I got most part except the structure of batch input.

For example, if I have 5 videos of size 32*32, V1, V2, V3, V4, V5. Respectively, they have 3, 4, 5, 6, 7 frames.

Should the table be arranged in a way that

{ [V1_1, V1_2, V1_3], [V2_1, V2_2, V2_3, V2_4], [V3_1, V3_2, V3_3, V3_4, V3_5], [V4_1, V4_2, V4_3, V4_4, V4_5, V4_6], [V5_1, V5_2, V5_3, V5_4, V5_5, V5_6, V5_7] }

or should it be arranges as

{ [V1_1, V2_1, V3_1, V4_1, V5_1], [V1_2, V2_2, V3_2, V4_2, V5_2], [V1_3, V2_3, V3_3, V4_3, V5_3], [V2_4, V3_4, V4_4, V5_4], [V3_5, V4_5, V5_5], [V4_6, V5_6], [V5_7], }

nhynes commented 8 years ago

The latter, in which each table entry is a time step. Don't forget the zero padding, though!

{
[V1_1, V2_1, V3_1, V4_1, V5_1],
[V1_2, V2_2, V3_2, V4_2, V5_2],
[V1_3, V2_3, V3_3, V4_3, V5_3],
[0, V2_4, V3_4, V4_4, V5_4],
[0, 0, V3_5, V4_5, V5_5],
[0, 0, 0, V4_6, V5_6],
[0, 0, 0, 0, V5_7],
}

(assuming that [...] is a tensor).

It might help to check out the diagrams for the Sequencer input/output format.

You might also want to try to mash your videos into a contiguous tensor and take advantage of the speedup offered by SeqLSTM. For even more speed, you could also try using the cuDNN RNNs without bothering to mask the outputs as long as you mask the downstream components like the criterion.

YuzhongHuang commented 8 years ago

@nhynes Thanks again! This post is very helpful! But instead of putting videos into lstm immediately, I first want to extract spatial information using a CNN. I wonder what is the best way to do that. Feed a batch inputs into a nn.Sequential():add(CNN) and then a sequencer, or just put CNN and LSTM together into a sequencer?

I don't understand the second part. I actually don't quite understand the zero mask. I only know that zero mask allows the sequencer to reset its state when detecting a zero tensor in a sequence. It would be great if you can refer me to some post.

nhynes commented 8 years ago

extract spatial information using a CNN

Joint training might be a bit tricky, considering that your sequence lengths vary so greatly. You'll waste a huge amount of computation forwarding padding through a CNN. Thus, I'd recommend "off-roading" a bit and do something like the following (akin to a sequencer of sequencers):

local seqCNN = {nn.Sequencer(cnn)}
for i=2,batchSize do
    seqCNN[i] = seqCNN[1]:clone()
end

local embs = torch.zeros(maxSeqlen, batchSize, cnn.outputDim)
for i=1,batchSize do
    -- batchFrames is (maxSeqlen x batchSize x nChannels x width x height)
    -- batchSeqlens is a vector of the actual sequence lengths
    local frames = batchFrames:select(2, i):sub(1, batchSeqlens[i])
    embs:select(2, i):sub(1, seqlens[i]):copy(seqCNN[i]:forward(frames))
end

rnn:forward(embs)

Unfortunately, each Sequencer must save its outputs to compute the gradient. I strongly suspect that so many copies of your CNN will not fit on a GPU (or two or three). Perhaps as a quick proof-of-concept, you might try pre-extracting your embeddings and feeding those in directly.

Just to make sure, you've already determined that the "more standard" approach of using volumetric convolutions isn't appropriate for your task, right? One benefit of using a convolutional architecture is that it can preserve low-level information that a CNN embedding might discard; besides, with enough depth, the receptive field will cover most of the sequence anyway. Another, probably more practical benefit, is that your model is more likely to fit on fewer than four GPUs.

Element-Research / rnn

batching with a dataset of variable length sequences #326