Closed jasonppy closed 2 years ago
I realize that 2.1 probably doesn't make sense due to the linearity of Fourier Transformation
Hi,
For 2.3, actually this is a very interesting problem, because we have tried also mix-up with label. We found that with label mix-up did not perform better than only use mix-up input (about 1-3% drop). We think this is because AudioSet is a special dataset because not all labels are correct (i.e. with 100 confidence). When you do the so-called mix-up on the AudioSet, it actually did more about "noise perturbation" instead of real "mix-up". This increases the model's tolerance and sensitivity to recognize some very rare events. And thus it gets a better result.
1.about 20% audios in audioset have length less than 10s, but they are all padded to 1024 frames (10s.). Does the model use padding mask to prevent those padded tokens from being attended to? 2.3 thanks for your insight. It's interesting to see that mix-up have different behavior for CNN (in PSLA, it improves 3% mAP absolute) compared to HST-AT
Hi,
For your (1) question: if the audioset "internally" pads these audios to be 10 secs and releases them, we did not have these pad masks. If these audios are originally less then 10sec before feeding into our model, we have several ways to make them 10 secs: https://github.com/RetroCirce/HTS-Audio-Transformer/blob/356521f5dbb1893082c449a4993977fd624905f0/model/htsat.py#L728-L761
Hi Ke,
Thanks for the great work and open sourcing the code!
I'd like to build from your excellent codebase, and I have a few questions regarding the details:
Thanks, Puyuan