Loss_Nan during training celebA dataset

GouravWadhwa / Hypergraphs-Image-Inpainting

(WACV 2021) Hyperrealistic Image Inpainting with Hypergraphs

90 stars 15 forks source link

Loss_Nan during training celebA dataset #17

Closed inchulnim123 closed 9 months ago

inchulnim123 commented 1 year ago

Hi Gourav! I trained facades and CelebA using 2 RTX3060. it works good in facades Dataset . However in CelebA, I selected randomly 28000 images, and trained. Epoch 0~10, it works well, but after, Loss is jumping or Loss is nan. I used D_H = tf.multiply (tf.expand_dims (tf.math.pow (D + 1e-8, -0.5), axis=-1), H) B = tf.linalg.diag (tf.math.pow (B + 1e-8, -1))

I want to know why and how can i fix it?

GouravWadhwa commented 1 year ago

Hi, Are you saying this is only happening when you change from 1e-10 to 1e-8? Further, can you share the training scripts to me too if you changed something. I can take a look and let you know the problem.

GouravWadhwa commented 1 year ago

loss is most probably jumping to Nan because somewhere something might be getting divided by zero. So can you also use "tf.debugging.enable_check_numerics()" for figuring that out.

inchulnim123 commented 1 year ago

Loss_Nan happend only in CelebA-Dataset when I trained D+1e-10, D+1e-8 respectively I just changed gpu_options, and gpu_ids -> '0,1', random_mask -> 1, random_mask_type -> irregular_mask, incremental_training -> 1 in train_options.py. batch_size also 1. In Facades dataset and paris streetview, It works really good, but only celebA, It didn't work well...

GouravWadhwa commented 1 year ago

Can you please try using this: tf.debugging.enable_check_numerics()

inchulnim123 commented 1 year ago

Thanks, I'll try it. Can I check where the Nan is appearing using tf.debugging.enable_check_numerics() and ask again?

GouravWadhwa commented 1 year ago

sure