Stopped during Audioset training

kimsojeong1225 commented 2 years ago

Hi, Thanks for your good study. I tried to trianing Audioset using your code. After epoch 2 10%, Training is not working.(Always happened this epoch) Not shutdown just not working. Can I know why it was happened? I attached log monitor.

kimsojeong1225 commented 2 years ago

Screenshot from 2022-02-23 11-32-00

RetroCirce commented 2 years ago

Hi,

I remembered that there was a case when training in the multi-gpu mode (ddp mode), the model will stuck in the saving checkpoint step, you can check the test_epoch_end and validate_epoch_end in the file sed_model.py (SED wrapper class).

It is a strange problem of pytorch lightening. I once met this problem when the model tries to save a model checkpoint that exceeds the max number of the checkpoints. For example, the code wants to save the 11th model, which beats one of the saved top-10 checkpoints, and it has some problems when replacing the ckpt. You can try to change the number of top-k in the pl.trainer in the main.py to see if the stuck point/step changes (I guess it will also change)

If so, I give my suggestions: (1) You can revise my code test_epoch_end/validate_epoch_end to be trained in a single-card (just add a branch if device_num == 1 and directly compute the metric without using the dist.all_gather. And you can see if this problem happens again or not.

(2) You can try to install the torch (1.7.0), or changing the naming of the saving model to see if happens again. I remembered I fixed this bug using dist.barrier, but I'm not sure if it works in other slightly different environments.

RetroCirce / HTS-Audio-Transformer

Stopped during Audioset training #1