Closed bugsuse closed 1 year ago
This seems a bit strange. I'm trying to see on my system if I can reproduce it. Meanwhile, are you able to restart from a checkpoint and see if the problem occurs again?
Meanwhile, in the above commit I added an easy option to continue training the autoencoder from a checkpoint (this option already existed for the diffusion model training).
Thanks for the suggestions! @jleinonen
I restart from a checkpoint and the problem occurs again using the command below,
time python train_autoenc.py --ckpt_path lightning_logs/version_0/checkpoints/epoch\=11-val_rec_loss\=0.0454.ckpt
I'm trying to print some message for debugging it.
(raw, _) = split.train_valid_test_split(raw, var, chunks=chunks)
print('RZC scale: ', raw["train"][var]["scale"])
variables[var]["transform"] = transform.default_rainrate_transform(
raw["train"][var]["scale"]
)
The output information is as follows,
Loading data...
RZC scale: [0.00000000e+00 3.52649689e-02 7.17734098e-02 1.09569430e-01
1.48698330e-01 1.89207077e-01 2.31144428e-01 2.74560571e-01
3.19507957e-01 3.66040230e-01 4.14213538e-01 4.64085698e-01
5.15716553e-01 5.69168210e-01 6.24504805e-01 6.81792855e-01
7.41101146e-01 8.02500963e-01 8.66065979e-01 9.31872606e-01
1.00000000e+00 1.07052994e+00 1.14354682e+00 1.21913886e+00
1.29739666e+00 1.37841415e+00 1.46228886e+00 1.54912114e+00
1.63901591e+00 1.73208046e+00 1.82842708e+00 1.92817140e+00
2.03143311e+00 2.13833642e+00 2.24900961e+00 2.36358571e+00
2.48220229e+00 2.60500193e+00 2.73213196e+00 2.86374521e+00
3.00000000e+00 3.14105988e+00 3.28709364e+00 3.43827772e+00
3.59479332e+00 3.75682831e+00 3.92457771e+00 4.09824228e+00
4.27803183e+00 4.46416092e+00 4.65685415e+00 4.85634279e+00
5.06286621e+00 5.27667284e+00 5.49801922e+00 5.72717142e+00
5.96440458e+00 6.21000385e+00 6.46426392e+00 6.72749043e+00
7.00000000e+00 7.28211975e+00 7.57418728e+00 7.87655544e+00
8.18958664e+00 8.51365662e+00 8.84915543e+00 9.19648457e+00
9.55606365e+00 9.92832184e+00 1.03137083e+01 1.07126856e+01
1.11257324e+01 1.15533457e+01 1.19960384e+01 1.24543428e+01
1.29288092e+01 1.34200077e+01 1.39285278e+01 1.44549809e+01
1.50000000e+01 1.55642395e+01 1.61483746e+01 1.67531109e+01
1.73791733e+01 1.80273132e+01 1.86983109e+01 1.93929691e+01
2.01121273e+01 2.08566437e+01 2.16274166e+01 2.24253712e+01
2.32514648e+01 2.41066914e+01 2.49920769e+01 2.59086857e+01
2.68576183e+01 2.78400154e+01 2.88570557e+01 2.99099617e+01
3.10000000e+01 3.21284790e+01 3.32967491e+01 3.45062218e+01
3.57583466e+01 3.70546265e+01 3.83966217e+01 3.97859383e+01
4.12242546e+01 4.27132874e+01 4.42548332e+01 4.58507423e+01
4.75029297e+01 4.92133827e+01 5.09841537e+01 5.28173714e+01
5.47152367e+01 5.66800308e+01 5.87141113e+01 6.08199234e+01
6.30000000e+01 6.52569580e+01 6.75934982e+01 7.00124435e+01
7.25166931e+01 7.51092529e+01 7.77932434e+01 8.05718765e+01
8.34485092e+01 8.64265747e+01 8.95096664e+01 9.27014847e+01
9.60058594e+01 9.94267654e+01 1.02968307e+02 1.06634743e+02
1.10430473e+02 1.14360062e+02 1.18428223e+02 1.22639847e+02
1.27000000e+02 1.31513916e+02 1.36186996e+02 1.41024887e+02
1.46033386e+02 1.51218506e+02 1.56586487e+02 1.62143753e+02
1.67897018e+02 1.73853149e+02 1.80019333e+02 1.86402969e+02
1.93011719e+02 1.99853531e+02 2.06936615e+02 2.14269485e+02
2.21860947e+02 2.29720123e+02 2.37856445e+02 2.46279694e+02
2.55000000e+02 2.64027832e+02 2.73373993e+02 2.83049774e+02
2.93066772e+02 3.03437012e+02 3.14172974e+02 3.25287506e+02
3.36794037e+02 3.48706299e+02 3.61038666e+02 3.73805939e+02
3.87023438e+02 4.00707062e+02 4.14873230e+02 4.29538971e+02
4.44721893e+02 4.60440247e+02 4.76712891e+02 4.93559387e+02
5.11000000e+02 5.29055664e+02 5.47747986e+02 5.67099548e+02
5.87133545e+02 6.07874023e+02 6.29345947e+02 6.51575012e+02
6.74588074e+02 6.98412598e+02 7.23077332e+02 7.48611877e+02
7.75046875e+02 8.02414124e+02 8.30746460e+02 8.60077942e+02
8.90443787e+02 9.21880493e+02 9.54425781e+02 9.88118774e+02
1.02300000e+03 1.05911133e+03 1.09649597e+03 1.13519910e+03
1.17526709e+03 1.21674805e+03 1.25969189e+03 1.30415002e+03
1.35017615e+03 1.39782520e+03 1.44715466e+03 1.49822375e+03
1.55109375e+03 1.60582825e+03 1.66249292e+03 1.72115588e+03
1.78188757e+03 1.84476099e+03 1.90985156e+03 1.97723755e+03
2.04700000e+03 2.11922266e+03 2.19399194e+03 2.27139819e+03
2.35153418e+03 2.43449609e+03 2.52038379e+03 2.60930005e+03
2.70135229e+03 2.79665039e+03 2.89530933e+03 2.99744751e+03
3.10318750e+03 3.21265649e+03 3.32598584e+03 3.44331177e+03
3.56477515e+03 3.69052197e+03 3.82070312e+03 3.95547510e+03
4.09500000e+03 4.23944531e+03 4.38898389e+03 4.54379639e+03
4.70406836e+03 4.86999219e+03 5.04176758e+03 5.21960010e+03
5.40370459e+03 5.59430078e+03 5.79161865e+03 nan
nan nan nan nan]
/public/home/ldcast/features/transform.py:80: RuntimeWarning: divide by zero encountered in log10
log_scale = np.log10(scale).astype(np.float32)
Loading cached sampler from ../cache/sampler_autoenc_valid.pkl.
Loading cached sampler from ../cache/sampler_autoenc_test.pkl.
Loading cached sampler from ../cache/sampler_autoenc_train.pkl.
I found the RZC
scale have NaN. Is it caused by this?
The rain rates are stored as 8-bit unsigned int values that are then translated to physical values in mm/h using the scale
array. It is true that the last elements of scale
are left at nan
but this is because these values should never occur in the 8-bit data. I have never seen a problem that the actual inputs to the training would contain nan
, so I'm a bit puzzled by this. Could you verify by drawing samples from the datamodule and checking with e.g. np.isfinite(x).all()
?
Meanwhile I re-ran the autoencoder training and I saw that around epoch 10 the training loss spikes. In the worst cases this can cause the loss to go to nan
, while in other cases it recovers quickly. And it seems that after this happens once, it does not occur again. It's as if the network somehow reorganizes itself. I recall now that I found the same thing happening back in October-November when I was first training the autoencoder.
The rain rates are stored as 8-bit unsigned int values that are then translated to physical values in mm/h using the
scale
array. It is true that the last elements ofscale
are left atnan
but this is because these values should never occur in the 8-bit data. I have never seen a problem that the actual inputs to the training would containnan
, so I'm a bit puzzled by this. Could you verify by drawing samples from the datamodule and checking with e.g.np.isfinite(x).all()
?
Thanks for the explanation and suggestion. I will try to check it.
Meanwhile I re-ran the autoencoder training and I saw that around epoch 10 the training loss spikes. In the worst cases this can cause the loss to go to
nan
, while in other cases it recovers quickly. And it seems that after this happens once, it does not occur again. It's as if the network somehow reorganizes itself. I recall now that I found the same thing happening back in October-November when I was first training the autoencoder.
I ran again to restart from a checkpoint using the command below, and it works fine now. It's really strange.
time python train_autoenc.py --ckpt_path lightning_logs/version_0/checkpoints/epoch\=11-val_rec_loss\=0.0454.ckpt
In addition, I guess this is maybe because log10(scale) is NaN when the scale
is 0, then I replaced log_scale = np.log10(scale).astype(np.float32)
in ldcast/features/transform.py
to log_scale = np.log10(scale+1).astype(np.float32)
. I tried to run train_autoenc.py
and also works fine.
Note that a couple of lines below
log_scale = np.log10(scale).astype(np.float32)
we have
log_scale[~np.isfinite(log_scale)] = np.log10(fill_value)
which should ensure that all values in log_scale
are non-NaN. (The default is fill_value=0
but in default_rainrate_transform
it is set to 0.02
, so it has a finite logarithm).
Note that a couple of lines below
log_scale = np.log10(scale).astype(np.float32)
we have
log_scale[~np.isfinite(log_scale)] = np.log10(fill_value)
which should ensure that all values in
log_scale
are non-NaN. (The default isfill_value=0
but indefault_rainrate_transform
it is set to0.02
, so it has a finite logarithm).
Thanks for the explanation! @jleinonen
Hi, I was running
train_autoenc.py
with default hyperparameters and I encountered this error then stopped training. Would you mind helping with this?