Closed semchan closed 4 years ago
I have tried to modify the code for arbitrary length videos but failed because the code has to build the ground truth confidence map which has size of BxTxT, and a sample mask for generating output confidence map which has an even bigger size (Tx(NTT)) and it destroys all the training process quickly.
I have tried to modify the code for arbitrary length videos but failed because the code has to build the ground truth confidence map which has size of BxTxT, and a sample mask for generating output confidence map which has an even bigger size (Tx(N_T_T)) and it destroys all the training process quickly.
you are right. one of the solution is rescale the feature's temporal dimension to 100 for inference, but I don't think it is the best way.
It's not convenient to train with unfixed length of the feature's temporal dimension because the temporal dimension of videos in one batch should be same. You can try to set the batch size to 1.
It's not convenient to train with unfixed length of the feature's temporal dimension because the temporal dimension of videos in one batch should be same. You can try to set the batch size to 1.
I see. It is indeed a issue when training with a unfixed length. But for the inference is that can be unfixed? It seems that still not convenient yet since the "mask" should be generated in the "BMN" init; if the "mask" generated in "def forward()", it will cost much time for computing.
Yes you are right. it's possible to use unfix length but will be unconvinient. For every length, you need to generated a mask.
After reading through the code for sample_mask generation, I think it is kind of an ineffective way because we just care about 32 points in an anchor (selectively chosen) but we have to generate a whole weighting mask for every features on temporal dimension of the video. I think there should be a better way to do it by some select functions of pytorch, I hope there will be an implementation on this part to make it more effective.
Currently I have dropped the PEM part of the code to test on my own problem using only the start and end scores and it saves me a lot of time, memory and thoughts for editing the code. According to the paper I think dropping PEM part doesn't decrease too much of the result so I hope it will work fine to me.
Actually, without PEM, the recall will drop a lot. As for the mask, we indeed only need 32 points, but if we generate a whole weighting mask we can reuse it rather than generating it on fly.
Actually, without PEM, the recall will drop a lot. As for the mask, we indeed only need 32 points, but if we generate a whole weighting mask we can reuse it rather than generating it on fly.
I am sorry, is it the result you got from experiment yourself ? Could you tell more about it because I saw from Table 4 of the original paper, the result just dropped 2% on AR@100, on validation set ?
Is each video sampled at different intervals How does the author turn unequal video into equal length? rescaled the feature length of all videos to same length 100,
Actually, without PEM, the recall will drop a lot. As for the mask, we indeed only need 32 points, but if we generate a whole weighting mask we can reuse it rather than generating it on fly.
I am sorry, is it the result you got from experiment yourself ? Could you tell more about it because I saw from Table 4 of the original paper, the result just dropped 2% on AR@100, on validation set ?
In my expriments, only use TEM can get AR@100 72.29, and only use PEM can get AR@100 75.08. It seems PEM is more import than TEM.
Since the training used fixed length of the feature's temporal dimension (~100), is that can be used to unfixed the length of the temporal dimension? and how? thanks a lot.