RicherMans / PSL

Source code for ICASSP2022 "Pseudo Strong labels for large scale weakly supervised audio tagging"
GNU General Public License v3.0
30 stars 4 forks source link

How can i get the pretrained_teacher model by myself? #2

Open SteveTanggithub opened 1 year ago

SteveTanggithub commented 1 year ago

How can i get the pretrained_teacher model by myself instead of using the ones u provided?

RicherMans commented 1 year ago

Hey there, just train a classifier on the standard 10s scale on Audioset. You might notice, but I generally avoid publishing that code that uses 10s training on Audioset, mainly because of the issue when downloading the dataset, which would lead to a lot of questions.

There are just some caveats that are not really discussed in the paper, but are vital to the success of a good teacher:

  1. For CNNs/CRNNs, do not use global average (or any global -) pooling methods. These generally do not generalize to smaller-sized inputs that are necessary for PSL. Instead, always use decision-mean or -max pooling that can accurately predict single time-level audio tags. Here is an issue where we discussed the downfalls of global pooling methods.
  2. For Transformers, the positional embedding is crucial. Many implementations that I have seen use a wrong positional embedding. For example AST uses a time-frequency positional embedding, which is not variable in its size for cases where the input size changes. Thus that repo uses the extreme solution of always padding to 10s, which is completely unbearable in my opinion. The easiest way to avoid any problems is to have time/frequency independent positional embeddings, as proposed in passt, which also scale to smaller input lengths. These embeddings make actually more sense than any traditional positional embedding of a transformer: In a spectrogram, the position of your frequency is always fixed, you only need to encode that position a single time across all time-frames and not give each position a unique representation.