facebookresearch / AVT

Code release for ICCV 2021 paper "Anticipative Video Transformer"
Apache License 2.0
151 stars 28 forks source link

Couple questions about classification loss #27

Closed zerodecoder1 closed 2 years ago

zerodecoder1 commented 2 years ago

Hi @rohitgirdhar,

Thanks for your great work -- I found it very interesting and plan to use it in my work! I was hoping to clear up exactly how the loss functions are working with the feature decoding since I was a little confused:

The decoder at each timestep from 1..t outputs features (in causal manner), which are then passed through a linear layer to obtain predicted frame features. Another linear layer on top of this then predicts distribution over action classes. Thus, we have t action predictions. Does the predictions for timestep 1 use the action label from timestep 2? The predictions for timestep t from my understanding represent the action at timestep t+1 (next action we want to anticipate). Based on the implementation, I was wondering if the classification loss also does a loss based on predictions for the next frame's labels and the first frame label is not used? Sorry if this is confusing, hope you can help clear my understanding!

rohitgirdhar commented 2 years ago

Hi @zerodecoder1 Thanks for your interest and kind words! Yes, the model at time "t" tries to predict the action at "t+1". The feature regression loss is set up so here: https://github.com/facebookresearch/AVT/blob/2d6781d5315a4c53bd059b1cd11ee46bd4427648/models/future_prediction.py#L212-L214 Regarding the classification loss, the model returns the past features here, which contains the initial features from the model and predicted future https://github.com/facebookresearch/AVT/blob/2d6781d5315a4c53bd059b1cd11ee46bd4427648/models/future_prediction.py#L249-L250 which are marked as "past" and passed through a classifier https://github.com/facebookresearch/AVT/blob/2d6781d5315a4c53bd059b1cd11ee46bd4427648/models/base_model.py#L201 and then I incur the loss with the true labels for these frames.

So I don't use the actual future intermediate predictions to predict class labels; however since I incur a feature regression loss with the predicted future, and classify the actual intermediate feature, it effectively forces the predicted future features also be classify-able to the intermediate action classes.

zerodecoder1 commented 2 years ago

Thanks so much for the detailed response @rohitgirdhar. This clears it up -- just to confirm my understanding, the loss is taken on the features outputted by AVT-B (which are also 'past')?

Also, I had another quick question regarding some experiments in your paper. Are all models with AVT-B trained end-to-end (ex: in table 4)? Do these models also use the loss function with additional terms (feature regression loss/recognition loss)? From your code, I'm guessing the other results with backbones such as TSN/irCSN are trained without end-to-end?

Thanks so much!

rohitgirdhar commented 2 years ago

Yes that is correct. The "past" features are used to predict the "past" action classes. The future ones could also have been used to predict the future action classes (past actions classes right shifted by 1 for correspondence), though I don't explore that in this work.

Yes correct; AVT-b is trained end-to-end with all the losses (except in ablations where I evaluate the effect of individual losses). The TSN/irCSN backbones are fixed and only the head is trained (similar to prior work like RULSTM etc).

zerodecoder1 commented 2 years ago

Got it!

Are the AVT-H models trained with TSN/irCSN also trained with the other losses, and are all models trained with 10 seconds past horizon? Thanks!

rohitgirdhar commented 2 years ago

Yes correct.

zerodecoder1 commented 2 years ago

Got it -- for the recognition loss for TSN for example, if it is taken on the past features provided by TSN backbone and since TSN is not trained end-to-end, does this loss have any effect on the model weights?

rohitgirdhar commented 2 years ago

Yes it doesn't change the backbone, but the classifier is applied on the backbone features to get the distribution over the classes, and those classifier weights will get updated with that loss.

zerodecoder1 commented 2 years ago

Would this update the recognition classifier weights? Is this classifier different than the anticipation classification head, or is it the same classifier head (MLP) that is used for both recognition and anticipation? Thanks!

rohitgirdhar commented 2 years ago

It is the same classification layer that decodes any feature (past or future) into the classification logits https://github.com/facebookresearch/AVT/blob/2d6781d5315a4c53bd059b1cd11ee46bd4427648/models/base_model.py#L206