facebookresearch / hiera

Hiera: A fast, powerful, and simple hierarchical vision transformer.
Apache License 2.0
870 stars 39 forks source link

hiera_mae returns the feature map from the last layer. #31

Closed tapohongchen closed 4 months ago

tapohongchen commented 4 months ago

I want to use hiera_mae to extract features, but the shape of x output from the last layer of hiera_mae is 2x1568x512. If I resize it to 14x14 patches, the dimensions become 2x8x14x14x512 (I am using a frame sampling strategy of 16 frames, and the input image size is 2x16x224x224x3). I would like to ask if this is correct?

dbolya commented 4 months ago

Yes, this should be correct if you're using the MAE model directly. Note that the MAE version of the model doesn't have the pooling layer in the last stage (which would take the features from 8x14x14 -> 8x7x7). If you use the classification pretrained models, the correct dimension would be 2x8x7x7x512.

Ah, I forgot to mention: the token order for the MAE model is not Height x Width. It's unrolled at the start so maxpools and masking are faster. Instead, pass the argument return_intermediates=True into the forward pass, and it will do all the unrolling and reshaping for you. This will return a list, where the last element is a tensor of shape [2x8x14x14xC] as you desired.

tapohongchen commented 4 months ago

Thank you very much, I see it now. The values of the intermediates I am getting are shown in the figure below. The shape of the last element is not [2x8x14x14xC]. Do I need to reshape it to obtain the final feature map? code

dbolya commented 4 months ago

Ah, I see where the confusion is. Is there a reason you're using hiera.mae_hiera_base_16x224 instead of hiera.hiera_base_16x224 with checkpoint=mae_k400?

The model you're currently using is meant for MAE training and thus automatically masks the input (hence why return_intermediates gives fewer tokens than needed to reconstruct the whole image---60% of mask units have been deleted as per the mae model's masking ratio: 8*7*7*0.4 = 156).

If you want to use Hiera to just forward features (and not perform MAE), then you should use the non-mae Hiera model with the mae weights: hiera.hiera_base_16x224(pretrained=True, checkpoint="mae_k400"). Is that what you wanted to do instead? If you run that model with return_intermediates=True, you'll get back the feature maps in the shape that you want.

If you still want to use MAE, then you can pass in None for the mask, but then the MAE objective wouldn't make sense.

tapohongchen commented 4 months ago

Ah, I see. I will try to use the non-MAE Hiera model with the MAE weights. Thank you very much for your help.

tapohongchen commented 4 months ago

I will close this issue. Thank you again for your help.