Closed aleksarias closed 2 years ago
Hi, Can you please explain me more on this. Is the generator returning NaN from the very first iteration or it diverges after a few iterations. Can you also try with batch_size=1. Please let me know if it works or not.
Hi Gourav,
I am facing the same issue while training on a custom dataset. I've tried a batch size of 1 (attached image below) and for the first sample, while the coarse network has a valid reconstruction loss, the refinement network goes to NaN. From the 2nd sample onwards, everything is NaN. I've tried changing the discriminator to a custom one I often use, but the behaviour remains the same. This possibly points to something inside the generator.
Any insight into this would be of great help.
Thanks!
Hi @abbhinavvenkat,
Can you share few samples from the dataset and also please let me know the command you used, so that I can at least replicate this problem.
Thank you
Just an update,
I used "tf.debugging.enable_check_numerics()" which gave the below stack trace
So, if D has '0' in it, the pow function will be 1/sqrt(0), which results in NaN. I've added a 10^-8 to both the math.pow operations in the hypergraph layer and the model is training correct now.
Hi @abbhinavvenkat,
Thanks for figuring it out.
I have also changed it in this repository too. I hope now it would work correctly for custom dataset too.
I tried to train this model using new data and I'm seeing NaN being returned for loss. I added some debugging prints and it looks like the generator is returning NaNs. I added the following lines to check for this:
Is there any additional steps required to train this model on a new dataset? The command I'm using is: