Closed cpyang123 closed 1 year ago
In light of this problem, we'll try some of the following options:
After a tedious amount of investigation, it was found that the NaNs were produced by zero weights and biases of the embedding layer. Added gradient clipping and decreased the weight decay of the optimizer to 0.00001, the problem was mitigated, and the model was able to run for 100 epochs and converge:
For the record, multiple other changes were made to the models:
During the training of the CRNN for the beats predictions, we the training loss gradually decreased from a very large number such as 300 to nan.
The output predictions are also dubious:
This might be due to some normalization problem, or something in the loss function. We'll need to experiment with both to find what the issue might be.