RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
344 stars 62 forks source link

masking and difference in mix-up strategy #8

Closed jasonppy closed 2 years ago

jasonppy commented 2 years ago

Hi Ke,

Thanks for the great work and open sourcing the code!

I'd like to build from your excellent codebase, and I have a few questions regarding the details:

  1. I couldn't find any information about padding mask. Is it not used in the model?
  2. the mix-up procedure seems to be a bit different from AST. 2.1 In AST, they mix up the raw waveform before applying transforms, while in HST-AT, you get fbanks first, and then mix-up fbanks. 2.2 In AST, the mix-up waveform is randomly sampled from the entire dataset, while you sample within the current batch. 2.3 In AST, the they also mix-up labels using lambda label1 + (1-lambda)label2, while HST-AT does not mix labels. Not sure if the three differences will make a big difference in performance, but I'm curious about your thoughts.

Thanks, Puyuan

jasonppy commented 2 years ago

I realize that 2.1 probably doesn't make sense due to the linearity of Fourier Transformation

RetroCirce commented 2 years ago

Hi,

  1. what do you mean about padding mask? can you explain more?
  2. since the dataset will also be shuffled for each mini batch, so you can imagine that 2.2 does not affect much whatever you choice from the current batch or entire dataset.

For 2.3, actually this is a very interesting problem, because we have tried also mix-up with label. We found that with label mix-up did not perform better than only use mix-up input (about 1-3% drop). We think this is because AudioSet is a special dataset because not all labels are correct (i.e. with 100 confidence). When you do the so-called mix-up on the AudioSet, it actually did more about "noise perturbation" instead of real "mix-up". This increases the model's tolerance and sensitivity to recognize some very rare events. And thus it gets a better result.

jasonppy commented 2 years ago

1.about 20% audios in audioset have length less than 10s, but they are all padded to 1024 frames (10s.). Does the model use padding mask to prevent those padded tokens from being attended to? 2.3 thanks for your insight. It's interesting to see that mix-up have different behavior for CNN (in PSLA, it improves 3% mAP absolute) compared to HST-AT

RetroCirce commented 2 years ago

Hi,

For your (1) question: if the audioset "internally" pads these audios to be 10 secs and releases them, we did not have these pad masks. If these audios are originally less then 10sec before feeding into our model, we have several ways to make them 10 secs: https://github.com/RetroCirce/HTS-Audio-Transformer/blob/356521f5dbb1893082c449a4993977fd624905f0/model/htsat.py#L728-L761