Closed lmaxwell closed 2 years ago
Hi there,
The cut code is here (when ImageNet pertaining is used):
We explicitly do the slicing (also it is from the middle, but I think take the first 12 dimension should have similar performance) rather than using nn.functional.interpolate
.
When AudioSet pretraining is used, we assume the pretraining/downstream tasks have same f_dim
, so only adjust the t_dim
.
Yuan
I got it, thank you for your explanation. closed.
The paper https://arxiv.org/pdf/2012.12877v2.pdf says "We therefore cut the first dimension and interpolate the second dimension of the 24 × 24 ViT positional embedding to 12 × 100 and use it as the positional embedding for the AST.
Do the "cut" means take the first 12-dimension ? In my understanding, nn.functional.interpolate always "interpolate".