YiwenShaoStephen / pychain_example

48 stars 20 forks source link

Error while training TDNN in stage 5 #13

Open shuuennokage opened 3 years ago

shuuennokage commented 3 years ago

Hello, I encountered a problem these days while running the training scripts. After initializing the dataset and model, the model started to train and a lot of errors popped out as following:

errorMessage

(I am using almost the same settings as the example of mini_librispeech with different datasets)

edit: I read about the closed issue "Loss nan", the issue had the same problem with me, but I'm having this problem all over my training set. I also checked my data and they seemed alright, what could I do to remove these errors?

shuuennokage commented 3 years ago

The "Loss nan" issue says that samples having too few #frames will cause this error (similar to the case in CTC where input_length < target_length). This problem occurs when I use my own dataset, no matter the dataset is monolingual or multilingual. How could I check where the error is occurring by myself? I'm not sure if the data or the settings are causing these problems.

shuuennokage commented 3 years ago

Hello, I dug deeper into the code and found the tensors became nan after passing the TDNN(1D dilation layer) The input tensors still have values before going into the TDNN layer: before TDNN

but after passing the TDNN layer, the tensors became nan: after TDNN

I modified the code inside model/tdnn.py to see when did the tensors turn into nan: TDNN Another weird point is this problem only appears on step 1 and following steps but doesn't appear on step 0. However, the log-prob-deriv sum is still nan and the loss is inf on step 0: image

I tried modifying the learning rate and batch size, but nothing changed. What should I do to train properly on my own dataset? Maybe I have something wrong with my dataset?

YiwenShaoStephen commented 3 years ago

Hi, thanks for the detailed information. It looks like you get 'NaN' gradients in your step 0 (your Loss for step 0 is inf) and then get 'NaN' in your network's parameters and make all activations become 'NaN' in the following steps. I suggest you look into the values for step 0 to see how this happen. Since you didn't get error information from step 0, I suspect there might be some issues in computing the loss.

shuuennokage commented 3 years ago

Hello, I actually got an error message during step 0 (although the loss is inf, not nan) image I got this error message right after the model structure is printed (layers and # of parameters)

Anyway, thank you very much for the reply! I'll try looking into the value and loss computation to see where the problem is.

shuuennokage commented 3 years ago

Hello, After several trial & errors, I decided to decrease the size of the dataset to speed up my debugging process (from 111656 to 19874) And...the bug just disappeared itself, I have no idea what happened. image No more error messages, I've deleted the debug output so nothing is printed, everything is fine now.

But now I have a new question about why this is happening, and why decreasing the size of the training set fixed this. Is this related to the batch size? (I did some research and found that the problem could be solved by decreasing the lr or increasing the batch size, so what I did is actually increasing the batch size, not just decreasing the size of the training set)

shuuennokage commented 3 years ago

Hello, I did some experiments and found that the system crashes when training data reaches about 80k~90k utterances. If the amount of data reached this number, the system will crash no matter how I adjust the lr and batch size. Is this example only suitable for small training sets? Also, the same settings work on my GTX 1080Ti but crash on Titan RTX, I don't know if it is a problem.