Open Young973 opened 1 year ago
hi there,
It is just ImageNet pretraining.
I.e., using ImageNet pretrained DeiT as the initial weight for AST.
-Yuan
Some modification is needed. See https://github.com/YuanGongND/ast/blob/master/src/models/ast_models.py.
If you mean audio domain pretraining, that is just train AST on AudioSet (based on ImageNet initialization) with BCE loss for classification task. You can then take the model for other audio tasks (e.g., for ESC-50).
TBH, I'm a little confused about what is the objective when pretraining with AST? It seems it is not indicated in the paper. BTW, when pretraining SSAST discriminative objective is the classification with InfoNCE and generative objective is reconstruction. But what is it in AST?