Encountering NaN output at a specific batch ID every run, and no change observed upon adjusting the learning rate

Subject

Detailed Description

I downloaded a code repository for deep learning training from GitHub and attempted to train my model using it. Unfortunately, I've encountered an issue where the loss outputs NaN consistently at the same batch ID each time I run the training. This occurs regardless of how I change the input data or initialization states.

Additionally, I tried adjusting the learning rate to address this issue, but curiously, there was no observable change— the loss remained unchanged. I have confirmed that the learning rate changes are correctly accepted and set in the code, but the problem persists.

Request for Help

I would like to understand other possible causes for this issue and if there are recommended debugging strategies or solutions. Additionally, if other developers have encountered similar problems and found solutions, I would greatly appreciate it if you could share them.

Thank you for your time and assistance!

Equationliu / Kangaroo