Learning rate for small datasets

EmreOzkose commented 2 years ago

Hi, thank you for your great work.

Here, you mentioned that the warm-up part should be adjusted according to the dataset. Could you give some advice for small datasets? For example, I have approximately 15K samples, how can I set lr_scheduler_epoch and lr_rate?

I compared audioset config and esc config, but mentioned parameters are same.

RetroCirce commented 2 years ago

Hi,

Sorry for the late delay. I was busy these days. Let me introduce how we should think about the learning rate.

Say, in the AudioSet full set, we have 2 million samples. When you check my dataset loader, you will see in each batch, the total data number is 2 million (i.e. the same as the full set).

When I do the warm up,

for the 1st, 2nd, and 3rd epoch, you probably can check from the code that I applied different lr_ratio for the learning rate.
And from 4-10 epoches, I applied one learning rate, from 10-20 epoches another one, and 20-30 another one.
Then after 30 epoch, I apply the decayed learning rate.

Each epoch is 2 million samples. So you can calculate like: I do the 1st, 2nd, 3rd epoch warm up for each 2 million samples. And 4-10 epoch learning rate for 7 2 million samples, 10-20 epoch for 10 2 million samples.... etc.

Now you have 15K samples, suppose that each batch is 15K samples. the original 1st epoch learning rate warm up now is becoming 2M/15K = 133 epoches. The same for 2nd epoch and 3rd epoch warm up. And the same follows on the original 4-10, 10-20, 20-30, and after 30 epoches' training.

So according to the calculation, the warm up step can be:

first 130 epoches
second 130 epoches
third 130 epoches
7 * 130 epoches
1300 epoches
1300 epoches

Of course, you probably think that this is too much for warm up. My experience is, you don't have to use such many epoches. Perhaps first 50 epoches, second 50 epoches, third 50 epoches, 300 epoches, 500 epoches are enough. And the model needs about 1000 epoches training since smaller dataset usually requires less epoches to make near convergence.

When I train the model with balanced set of Audioset (20K samples), I use 100 epoches, 100 epoches, 100 epoches, 300 epoches, 500 epoches warm up, and this gives me the SOTA mAP.

Hope this calculation helps you figure out how to make the warmup.

RetroCirce commented 2 years ago

Oh, sorry, I forget to say, this is based on the single GPU, if you have multi-GPU, you can recalculate it.

EmreOzkose commented 2 years ago

I am using single GPU, thank you so much.

RetroCirce / HTS-Audio-Transformer

Learning rate for small datasets #18