YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Positional embedding #45

Closed lmaxwell closed 2 years ago

lmaxwell commented 2 years ago

The paper https://arxiv.org/pdf/2012.12877v2.pdf says "We therefore cut the first dimension and interpolate the second dimension of the 24 × 24 ViT positional embedding to 12 × 100 and use it as the positional embedding for the AST.

Do the "cut" means take the first 12-dimension ? In my understanding, nn.functional.interpolate always "interpolate".


from torch import nn

h,w=4,3
pos_embed = torch.randn((1,1,h,w))

a = nn.functional.interpolate(pos_embed,scale_factor=(2/h,3/w),mode='bilinear')
print("position embedding:\n",pos_embed)
print("{},{}->{},{}:\n".format(h,w,2,3),a)```

---------------------------------------------
position embedding:
 tensor([[[[-0.5638,  0.0127, -2.4190],
          [ 0.2434,  0.3804, -0.2128],
          [ 0.2813, -0.7966, -0.3580],
          [-1.2754, -0.2837,  1.6149]]]])
4,3->2,3:
 tensor([[[[-0.1602,  0.1966, -1.3159],
          [-0.4971, -0.5402,  0.6284]]]])
YuanGongND commented 2 years ago

Hi there,

The cut code is here (when ImageNet pertaining is used):

https://github.com/YuanGongND/ast/blob/7252ecec6baf4cc81c7fecd0b6bfabc8e484b685/src/models/ast_models.py#L100-L103

We explicitly do the slicing (also it is from the middle, but I think take the first 12 dimension should have similar performance) rather than using nn.functional.interpolate.

When AudioSet pretraining is used, we assume the pretraining/downstream tasks have same f_dim, so only adjust the t_dim.

Yuan

lmaxwell commented 2 years ago

I got it, thank you for your explanation. closed.