RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
344 stars 62 forks source link

Learning rate #2

Closed kimsojeong1225 closed 2 years ago

kimsojeong1225 commented 2 years ago

Hi I think lr in code is different for paper. In the paper show learning rate is 0.05,0.1,0.2 in the first three epochs, but code lr is 2e-5,5e-5,1e-4 (config.py lr_rate=[0.02,0.05,0.1]). I revised lr_rate=[50,100,200] to match the paper but model training shows bad results. I want to know what method is right to get the same results indicated in the paper.

RetroCirce commented 2 years ago

Hi,

Thank you for pointing out! I made a writing mistake in the paper and I will correct them. The "learning rate" I mentioned in the paper is actually the learning_rate_scale. The true learning rate = lr lr_scale = 1e-3 [0.02, 0.05, 0.1]

In the experiment, we tried two settings [0.02, 0.05, 0.1] and [0.05, 0.1, 0.2], so the learning rate is actually [2e-5, 5e-5, 1e-4], and [5e-5, 1e-4, 2e-4].

Either of it can perform the result (or very similar result) in the paper, our checkpoint is trained under [5e-5, 1e-4, 2e-4] (lr_scale = [0.05, 0.1, 0.2]).

Best, Ke

RetroCirce commented 2 years ago

As another notice, we use the weight-averaging for the model, so you will get the top-10 checkpoints during the training. Usually their mAP will be around 0.459-0.467, then you do the weight average of them, you will get about 0.469-0.473. And we choose one model with 0.471 as the public checkpoint.

kimsojeong1225 commented 2 years ago

thanks for the kind reply :D!

mhamzaerol commented 1 year ago

Hello,

I am having difficulty with understanding a part of the learning rate scheduler.

I suppose the line 248 in the sed_model.py file is based on the learning rates reported in the paper:

lr_scale = max(self.config.lr_rate[0] * (0.98 ** epoch), 0.03 )

Here, it seems like 0.03 could be suitable for the case of the learning rates in the paper as 0.03 is greater than 0.02. Could you confirm this, or am I misunderstanding something?

If this is the case, what would be an alternative learning rate here I could be using while training the model with the learning rates in the config.py file? Would keeping the ratio (0.03 / 0.05) be a good approximation?

Also, after the 30th epoch, wouldn't self.config.lr_rate[0] * (0.98 ** epoch) become ~0.027 if the self.config.lr_rate[0] is set to 0.05, thus being always less than 0.03? If so, what would be the use case of this equation?

Thanks a lot!

RetroCirce commented 1 year ago

Hi, Thank you for your comment! Let me demonstrate it for you. Let us take lr_rate = [0.05, 0.1 , 0.2] First, in the first three epoch, the model is using the learning rate in a warm-up case --> the first epoch uses 0.05 1e-3 | the second epoch uses 0.1 1e-3 | the third epoch uses 0.2 * 1e-3

Second, after three epochs, the model will use the learning rate depending on what epoch it lies in: https://github.com/RetroCirce/HTS-Audio-Transformer/blob/main/sed_model.py#L246-L250

  1. < 10 epoch | use 0.2 * 1e-3
  2. 10 < epoch < 20 | use 0.1 * 1e-3
  3. 20 < epoch < 30 | use 0.05 * 1e-3
  4. 30 < epoch | use max(0.05 0.98^30), 0.03) 1e-3 = 0.03 * 1e-3

So the learning rate 0.027 * 1e-3 is only used after 30 epochs --> this makes this function seem to be useless, as it is always less than 0.03. If you train the model on the AudioSet dataset, you will find that the model will converge before 30 epochs. So that's why we think it doesn't matter to think about it after 30 epochs.

If you need to train your model after 30 epoches --> you can revise by yourself. Note that the epoch is just the fake definition --> the real affect is actually the number of training steps (each step you train a batch data). So based on your data, you might need to change this by yourself. More detail on this you can refer to this

mhamzaerol commented 1 year ago

Thank you very much for the detailed response!