MeteoSwiss / ldcast

Latent diffusion for generative precipitation nowcasting
Apache License 2.0
111 stars 13 forks source link

NaN or Inf found in input tensor. #7

Closed bugsuse closed 1 year ago

bugsuse commented 1 year ago

Hi, I was running train_autoenc.py with default hyperparameters and I encountered this error then stopped training. Would you mind helping with this?

Epoch 11: : 1200it [21:02,  1.05s/it, loss=0.0638, v_num=0, val_loss=0.054, val_rec_loss=0.0454, val_kl_loss=0.819]                               Metric val_rec_loss improved by 0.010 >= min_delta = 0.0. New best score: 0.045                                                                     
Epoch 12: : 600it [11:53,  1.19s/it, loss=nan, v_num=0, val_loss=0.054, val_rec_loss=0.0454, val_kl_loss=0.819]NaN or Inf found in input tensor.
...
Epoch 12: : 1200it [20:38,  1.03s/it, loss=nan, v_num=0, val_loss=0.054, val_rec_loss=0.0454, val_kl_loss=0.819]NaN or Inf found in input tensor.
Epoch 12: : 1200it [20:40,  1.03s/it, loss=nan, v_num=0, val_loss=nan.0, val_rec_loss=nan.0, val_kl_loss=nan.0]                                   Monitored metric val_rec_loss = nan is not finite. Previous best value was 0.045. Signaling Trainer to stop.                                        
Epoch 12: : 1200it [20:40,  1.03s/it, loss=nan, v_num=0, val_loss=nan.0, val_rec_loss=nan.0, val_kl_loss=nan.0]
jleinonen commented 1 year ago

This seems a bit strange. I'm trying to see on my system if I can reproduce it. Meanwhile, are you able to restart from a checkpoint and see if the problem occurs again?

jleinonen commented 1 year ago

Meanwhile, in the above commit I added an easy option to continue training the autoencoder from a checkpoint (this option already existed for the diffusion model training).

bugsuse commented 1 year ago

Thanks for the suggestions! @jleinonen

I restart from a checkpoint and the problem occurs again using the command below,

time python train_autoenc.py --ckpt_path lightning_logs/version_0/checkpoints/epoch\=11-val_rec_loss\=0.0454.ckpt

I'm trying to print some message for debugging it.

     (raw, _) = split.train_valid_test_split(raw, var, chunks=chunks)

     print('RZC scale: ', raw["train"][var]["scale"])
     variables[var]["transform"] = transform.default_rainrate_transform(
         raw["train"][var]["scale"]
     ) 

The output information is as follows,

Loading data...
RZC scale: [0.00000000e+00 3.52649689e-02 7.17734098e-02 1.09569430e-01
 1.48698330e-01 1.89207077e-01 2.31144428e-01 2.74560571e-01
 3.19507957e-01 3.66040230e-01 4.14213538e-01 4.64085698e-01
 5.15716553e-01 5.69168210e-01 6.24504805e-01 6.81792855e-01
 7.41101146e-01 8.02500963e-01 8.66065979e-01 9.31872606e-01
 1.00000000e+00 1.07052994e+00 1.14354682e+00 1.21913886e+00
 1.29739666e+00 1.37841415e+00 1.46228886e+00 1.54912114e+00
 1.63901591e+00 1.73208046e+00 1.82842708e+00 1.92817140e+00
 2.03143311e+00 2.13833642e+00 2.24900961e+00 2.36358571e+00
 2.48220229e+00 2.60500193e+00 2.73213196e+00 2.86374521e+00
 3.00000000e+00 3.14105988e+00 3.28709364e+00 3.43827772e+00
 3.59479332e+00 3.75682831e+00 3.92457771e+00 4.09824228e+00
 4.27803183e+00 4.46416092e+00 4.65685415e+00 4.85634279e+00
 5.06286621e+00 5.27667284e+00 5.49801922e+00 5.72717142e+00
 5.96440458e+00 6.21000385e+00 6.46426392e+00 6.72749043e+00
 7.00000000e+00 7.28211975e+00 7.57418728e+00 7.87655544e+00
 8.18958664e+00 8.51365662e+00 8.84915543e+00 9.19648457e+00
 9.55606365e+00 9.92832184e+00 1.03137083e+01 1.07126856e+01
 1.11257324e+01 1.15533457e+01 1.19960384e+01 1.24543428e+01
 1.29288092e+01 1.34200077e+01 1.39285278e+01 1.44549809e+01
 1.50000000e+01 1.55642395e+01 1.61483746e+01 1.67531109e+01
 1.73791733e+01 1.80273132e+01 1.86983109e+01 1.93929691e+01
 2.01121273e+01 2.08566437e+01 2.16274166e+01 2.24253712e+01
 2.32514648e+01 2.41066914e+01 2.49920769e+01 2.59086857e+01
 2.68576183e+01 2.78400154e+01 2.88570557e+01 2.99099617e+01
 3.10000000e+01 3.21284790e+01 3.32967491e+01 3.45062218e+01
 3.57583466e+01 3.70546265e+01 3.83966217e+01 3.97859383e+01
 4.12242546e+01 4.27132874e+01 4.42548332e+01 4.58507423e+01
 4.75029297e+01 4.92133827e+01 5.09841537e+01 5.28173714e+01
 5.47152367e+01 5.66800308e+01 5.87141113e+01 6.08199234e+01
 6.30000000e+01 6.52569580e+01 6.75934982e+01 7.00124435e+01
 7.25166931e+01 7.51092529e+01 7.77932434e+01 8.05718765e+01
 8.34485092e+01 8.64265747e+01 8.95096664e+01 9.27014847e+01
 9.60058594e+01 9.94267654e+01 1.02968307e+02 1.06634743e+02
 1.10430473e+02 1.14360062e+02 1.18428223e+02 1.22639847e+02
 1.27000000e+02 1.31513916e+02 1.36186996e+02 1.41024887e+02
 1.46033386e+02 1.51218506e+02 1.56586487e+02 1.62143753e+02
 1.67897018e+02 1.73853149e+02 1.80019333e+02 1.86402969e+02
 1.93011719e+02 1.99853531e+02 2.06936615e+02 2.14269485e+02
 2.21860947e+02 2.29720123e+02 2.37856445e+02 2.46279694e+02
 2.55000000e+02 2.64027832e+02 2.73373993e+02 2.83049774e+02
 2.93066772e+02 3.03437012e+02 3.14172974e+02 3.25287506e+02
 3.36794037e+02 3.48706299e+02 3.61038666e+02 3.73805939e+02
 3.87023438e+02 4.00707062e+02 4.14873230e+02 4.29538971e+02
 4.44721893e+02 4.60440247e+02 4.76712891e+02 4.93559387e+02
 5.11000000e+02 5.29055664e+02 5.47747986e+02 5.67099548e+02
 5.87133545e+02 6.07874023e+02 6.29345947e+02 6.51575012e+02
 6.74588074e+02 6.98412598e+02 7.23077332e+02 7.48611877e+02
 7.75046875e+02 8.02414124e+02 8.30746460e+02 8.60077942e+02
 8.90443787e+02 9.21880493e+02 9.54425781e+02 9.88118774e+02
 1.02300000e+03 1.05911133e+03 1.09649597e+03 1.13519910e+03
 1.17526709e+03 1.21674805e+03 1.25969189e+03 1.30415002e+03
 1.35017615e+03 1.39782520e+03 1.44715466e+03 1.49822375e+03
 1.55109375e+03 1.60582825e+03 1.66249292e+03 1.72115588e+03
 1.78188757e+03 1.84476099e+03 1.90985156e+03 1.97723755e+03
 2.04700000e+03 2.11922266e+03 2.19399194e+03 2.27139819e+03
 2.35153418e+03 2.43449609e+03 2.52038379e+03 2.60930005e+03
 2.70135229e+03 2.79665039e+03 2.89530933e+03 2.99744751e+03
 3.10318750e+03 3.21265649e+03 3.32598584e+03 3.44331177e+03
 3.56477515e+03 3.69052197e+03 3.82070312e+03 3.95547510e+03
 4.09500000e+03 4.23944531e+03 4.38898389e+03 4.54379639e+03
 4.70406836e+03 4.86999219e+03 5.04176758e+03 5.21960010e+03
 5.40370459e+03 5.59430078e+03 5.79161865e+03            nan
            nan            nan            nan            nan]
/public/home/ldcast/features/transform.py:80: RuntimeWarning: divide by zero encountered in log10
  log_scale = np.log10(scale).astype(np.float32)
Loading cached sampler from ../cache/sampler_autoenc_valid.pkl.
Loading cached sampler from ../cache/sampler_autoenc_test.pkl.
Loading cached sampler from ../cache/sampler_autoenc_train.pkl.

I found the RZC scale have NaN. Is it caused by this?

jleinonen commented 1 year ago

The rain rates are stored as 8-bit unsigned int values that are then translated to physical values in mm/h using the scale array. It is true that the last elements of scale are left at nan but this is because these values should never occur in the 8-bit data. I have never seen a problem that the actual inputs to the training would contain nan, so I'm a bit puzzled by this. Could you verify by drawing samples from the datamodule and checking with e.g. np.isfinite(x).all()?

Meanwhile I re-ran the autoencoder training and I saw that around epoch 10 the training loss spikes. In the worst cases this can cause the loss to go to nan, while in other cases it recovers quickly. And it seems that after this happens once, it does not occur again. It's as if the network somehow reorganizes itself. I recall now that I found the same thing happening back in October-November when I was first training the autoencoder.

bugsuse commented 1 year ago

The rain rates are stored as 8-bit unsigned int values that are then translated to physical values in mm/h using the scale array. It is true that the last elements of scale are left at nan but this is because these values should never occur in the 8-bit data. I have never seen a problem that the actual inputs to the training would contain nan, so I'm a bit puzzled by this. Could you verify by drawing samples from the datamodule and checking with e.g. np.isfinite(x).all()?

Thanks for the explanation and suggestion. I will try to check it.

Meanwhile I re-ran the autoencoder training and I saw that around epoch 10 the training loss spikes. In the worst cases this can cause the loss to go to nan, while in other cases it recovers quickly. And it seems that after this happens once, it does not occur again. It's as if the network somehow reorganizes itself. I recall now that I found the same thing happening back in October-November when I was first training the autoencoder.

I ran again to restart from a checkpoint using the command below, and it works fine now. It's really strange.

time python train_autoenc.py --ckpt_path lightning_logs/version_0/checkpoints/epoch\=11-val_rec_loss\=0.0454.ckpt

In addition, I guess this is maybe because log10(scale) is NaN when the scale is 0, then I replaced log_scale = np.log10(scale).astype(np.float32) in ldcast/features/transform.py to log_scale = np.log10(scale+1).astype(np.float32). I tried to run train_autoenc.py and also works fine.

jleinonen commented 1 year ago

Note that a couple of lines below

log_scale = np.log10(scale).astype(np.float32)

we have

log_scale[~np.isfinite(log_scale)] = np.log10(fill_value)

which should ensure that all values in log_scale are non-NaN. (The default is fill_value=0 but in default_rainrate_transform it is set to 0.02, so it has a finite logarithm).

bugsuse commented 1 year ago

Note that a couple of lines below

log_scale = np.log10(scale).astype(np.float32)

we have

log_scale[~np.isfinite(log_scale)] = np.log10(fill_value)

which should ensure that all values in log_scale are non-NaN. (The default is fill_value=0 but in default_rainrate_transform it is set to 0.02, so it has a finite logarithm).

Thanks for the explanation! @jleinonen