YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.13k stars 212 forks source link

How to use the model for a downstream task ? #53

Closed devesh-k closed 2 years ago

devesh-k commented 2 years ago

Hi Yuan, Thanks so much for open-sourcing the code and sharing the recipies. I am trying to use the model on my own training pipeline and per your suggestions in the "read-me" using the following :

RuntimeError: The size of tensor a (1070) must match the size of tensor b (7070) at non-singleton dimension 1 . The full stack trace is as below : `--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) /tmp/ipykernel_318/1697194335.py in 1 model.cuda() ----> 2 y = model(spec)

/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1101 or _global_forward_hooks or _global_forward_pre_hooks): -> 1102 return forward_call(input, **kwargs) 1103 # Do not call functions when jit is used 1104 full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.8/site-packages/torch/autocast_mode.py in decorate_autocast(*args, kwargs) 196 def decorate_autocast(*args, *kwargs): 197 with self: --> 198 return func(args, kwargs) 199 return decorate_autocast

/tmp/ipykernel_318/3758517193.py in forward(self, x) 213 dist_token = self.v.dist_token.expand(B, -1, -1) 214 x = torch.cat((cls_tokens, dist_token, x), dim=1) --> 215 x = x + self.v.pos_embed 216 x = self.v.pos_drop(x) 217 for blk in self.v.blocks:

RuntimeError: The size of tensor a (1070) must match the size of tensor b (7070) at non-singleton dimension 1 `

Do you have any suggestions for me? I'd appreciate your inputs. Thanks, Devesh

YuanGongND commented 2 years ago

Hi Devesh,

Can you paste the code piece of how you instantiate the AST model? It seems that there's a mismatch between the input shape with what you claimed when you create the AST model.

When you use your own pipeline, there are two things that need to take care of - 1) input normalization, if you don't use AudioSet pretraining, you need to normalize the entire dataset with 0 mean and 1/0.5 std, it does not need to be strict; if you use AudioSet pretraining, it is better to use the same feature extractor and normalization with us. 2) AST model generally needs a smaller learning rate, you need to search the lr again. In my exp, when the batch size is fixed, AST typically works best with lr that is 10x smaller than an EfficientNet model with Adam optimizer.

-Yuan

devesh-k commented 2 years ago

Thanks so much for your prompt response. Here is the instantiation code: model = ASTModel(input_tdim = 5900, input_fdim= 128,label_dim=1 , imagenet_pretrain=True,audioset_pretrain=False)

calling the forward function as : model.cuda(), y = model(spec)

here spec.shape = torch.Size([1, 128, 901])

YuanGongND commented 2 years ago

Yes, I think input_tdim is the problem - we asked for each task, all inputs have the same time length, and that should be the same as input_tdim when you instantiate the AST model. For your example, setting input_tdim=901 will solve the problem for this specific audio sample.

In our recipe, our dataloader will trim/pad all audios to input_tdim. I think it should be easy to implement. You can use the mean audio length as input_tdim instead of the max and do trim/pad. The question is if your audio lengths vary a lot, you should consider a strategy instead of simple trim/pad. Another suggestion is AST is O(n^2), 5900 is generally too large for AST, you can enlarge fstride and tstride to lower the time complexity. Finally, if your downstream task is also audio classification, you can consider using audioset_pretraining=True.

-Yuan

devesh-k commented 2 years ago

Many thanks for looking into it. I am working on a bio-acoustic sample to classify mosquito wing-beats. The length of the audio samples in the dataset do vary a lot. Do you recommend padding them to a specific size in the data-loader ?
I somehow misunderstood the read-me and thought the AST code does take care of the varying length audio. I can work on padding the audio to the same duration as input_tdim .

YuanGongND commented 2 years ago

I think you can consider using audioset_pretaining=True for this application. Padding to max length is OK, but AST is O(n^2) so the computational overhead increases fast with the input length. Another solution is to crop the long audio into fixed-length clips (e.g., 10s) and do some kind of decision fusion. Or, you can using a sliding window method; Or, you can first try the short clips you have with AST as a preliminary experiment.

devesh-k commented 2 years ago

Another quick question on using audioset_pretraining=True . Given that my use-case on bio-acoustics is very different from the audioset data, do you recommend training with audioset_pretraining=True ?

devesh-k commented 2 years ago

yeah - I agree with your observation on computational overhead. I did try the sliding window approach to create spectrograms from files of varying length but that also resulted into downstream issues as each .wav file resulted into a lot of spectograms. Thanks so much for your suggestions,I am going to tweak my strategy a bit and will try with audioset_pretraining=True . Thanks once again for your time and suggestions and for your patience!

YuanGongND commented 2 years ago

It will be a quick test (just change an argument) so I think that's no hurt. My guess is if you normalize the input appropriately, that might lead to some performance improvement. It also depends on how large your dataset is. AudioSet contains many animal sounds, that's why I think it might be helpful.

devesh-k commented 2 years ago

that's great idea. I will give that a shot!