Loss is NaN When Training on New Dataset

GouravWadhwa / Hypergraphs-Image-Inpainting

(WACV 2021) Hyperrealistic Image Inpainting with Hypergraphs

90 stars 15 forks source link

Loss is NaN When Training on New Dataset #12

Closed aleksarias closed 2 years ago

aleksarias commented 3 years ago

I tried to train this model using new data and I'm seeing NaN being returned for loss. I added some debugging prints and it looks like the generator is returning NaNs. I added the following lines to check for this:

tf.debugging.check_numerics(prediction_coarse, message=f'{prediction_coarse=}')
tf.debugging.check_numerics(prediction_refine, message=f'{prediction_refine=}')

Is there any additional steps required to train this model on a new dataset? The command I'm using is:

python training.py --random_mask 1 --batch_size 4 --train_dir /home/alex/JupterNotebooks/MyImages/test_folder/test/ --dataset my-scans

GouravWadhwa commented 2 years ago

Hi, Can you please explain me more on this. Is the generator returning NaN from the very first iteration or it diverges after a few iterations. Can you also try with batch_size=1. Please let me know if it works or not.

abbhinavvenkat commented 2 years ago

Hi Gourav,

I am facing the same issue while training on a custom dataset. I've tried a batch size of 1 (attached image below) and for the first sample, while the coarse network has a valid reconstruction loss, the refinement network goes to NaN. From the 2nd sample onwards, everything is NaN. I've tried changing the discriminator to a custom one I often use, but the behaviour remains the same. This possibly points to something inside the generator.

Any insight into this would be of great help.

Thanks!

GouravWadhwa commented 2 years ago

Hi @abbhinavvenkat,

Can you share few samples from the dataset and also please let me know the command you used, so that I can at least replicate this problem.

Thank you

abbhinavvenkat commented 2 years ago

Just an update,

I used "tf.debugging.enable_check_numerics()" which gave the below stack trace

So, if D has '0' in it, the pow function will be 1/sqrt(0), which results in NaN. I've added a 10^-8 to both the math.pow operations in the hypergraph layer and the model is training correct now.

GouravWadhwa commented 2 years ago

Hi @abbhinavvenkat,

Thanks for figuring it out.

I have also changed it in this repository too. I hope now it would work correctly for custom dataset too.