Closed tapohongchen closed 4 months ago
Yes, this should be correct if you're using the MAE model directly. Note that the MAE version of the model doesn't have the pooling layer in the last stage (which would take the features from 8x14x14 -> 8x7x7). If you use the classification pretrained models, the correct dimension would be 2x8x7x7x512.
Ah, I forgot to mention: the token order for the MAE model is not Height x Width. It's unrolled at the start so maxpools and masking are faster. Instead, pass the argument return_intermediates=True
into the forward pass, and it will do all the unrolling and reshaping for you. This will return a list, where the last element is a tensor of shape [2x8x14x14xC] as you desired.
Thank you very much, I see it now. The values of the intermediates I am getting are shown in the figure below. The shape of the last element is not [2x8x14x14xC]. Do I need to reshape it to obtain the final feature map?
Ah, I see where the confusion is. Is there a reason you're using hiera.mae_hiera_base_16x224
instead of hiera.hiera_base_16x224
with checkpoint=mae_k400
?
The model you're currently using is meant for MAE training and thus automatically masks the input (hence why return_intermediates gives fewer tokens than needed to reconstruct the whole image---60% of mask units have been deleted as per the mae model's masking ratio: 8*7*7*0.4 = 156).
If you want to use Hiera to just forward features (and not perform MAE), then you should use the non-mae Hiera model with the mae weights: hiera.hiera_base_16x224(pretrained=True, checkpoint="mae_k400")
. Is that what you wanted to do instead? If you run that model with return_intermediates=True
, you'll get back the feature maps in the shape that you want.
If you still want to use MAE, then you can pass in None
for the mask, but then the MAE objective wouldn't make sense.
Ah, I see. I will try to use the non-MAE Hiera model with the MAE weights. Thank you very much for your help.
I will close this issue. Thank you again for your help.
I want to use hiera_mae to extract features, but the shape of x output from the last layer of hiera_mae is 2x1568x512. If I resize it to 14x14 patches, the dimensions become 2x8x14x14x512 (I am using a frame sampling strategy of 16 frames, and the input image size is 2x16x224x224x3). I would like to ask if this is correct?