YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

some question about Deit's two [cls] token processing. #77

Open liyunlongaaa opened 1 year ago

liyunlongaaa commented 1 year ago

Hi, sorry to bother you. Why are the two special [CLS]tokens in DeiT said to be average as a single [CLS] token in the paper, but in the code I see that they are indeed cat together, what am I missing?

cls_tokens = self.v.cls_token.expand(B, -1, -1) 
dist_token = self.v.dist_token.expand(B, -1, -1)
x = torch.cat((cls_tokens, dist_token, x), dim=1)
liyunlongaaa commented 1 year ago

oh, I see it.

x = (x[:, 0] + x[:, 1]) / 2 sorry to bother you. thank you for your good work, I am newer for my master's degree in the speech area, and I want to graduate but have to post a dissertation, thank you for helping me along the way, although I haven't issued a dissertation yet haha~

YuanGongND commented 1 year ago

To use DEIT initialization, we have to initialize in the same way as DEIT, but as you point out, we average it in the forward pass.

Good luck with your dissertation.

-Yuan