The rough order of magnitude for the loss during pretraining is how many?

Alexander-H-Liu / dinosr

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

47 stars 4 forks source link

The rough order of magnitude for the loss during pretraining is how many? #2

Open kssmmm opened 5 months ago

kssmmm commented 5 months ago

I pretrained the model with librispeech960h and get the loss of 0.2. However, when I used the checkpoint to finetune with the librispeech100h, I got a dev-wer about 100. Did I make a mistake during the pretraining phase or the fine-tuning phase?

Alexander-H-Liu commented 5 months ago

Hi, Your training loss seems too low, should be ~1.4 after training for 200k steps and ~1.1 after 400k steps. super low loss in self-distillation usually means the teacher model collapsed (constant output regardless of input) and the training runs into trivial task.

kssmmm commented 5 months ago

Hi, Your training loss seems too low, should be ~1.4 after training for 200k steps and ~1.1 after 400k steps. super low loss in self-distillation usually means the teacher model collapsed (constant output regardless of input) and the training runs into trivial task.

Previously, I modified the values in the config file from fp16 to bf16, and also changed the max token value from 3.8 million to 2.4 million. Now I have changed them back. It seems that the loss during the pretraining phase is consistent with what you mentioned, I didn't expect these two parameters to have such a significant impact.

hadas commented 5 months ago

Hi, I ran into a similar issue with a very low loss and cluster collapse. Except for the batch size (4), I haven't changed anything in the base configuration, but it also happened with the default size. What can I do to prevent it?