Closed CameronBodine closed 1 year ago
Hi @CameronBodine ,
My hunch is this is mixed precision (which can cause underflow/overflow and therefore nan or inf loss). Can you try to train a model but with these lines on train_model.py
commented out:
Right on the money @ebgoldstein! Running now with cat
loss. Let me know if I can report back any info, or try out anything else.
Good call @ebgoldstein and thanks @CameronBodine for reporting
It would also be useful if you can confirm if you can train using 'kld' and/or 'hinge' without mixed precision, thanks
great news @CameronBodine ..
I have run into this scenario several times.. and have always been able to train with any loss by falling back to full precision..
for now I am going to close this issue. but please reopen if there are any other problems..
@dbuscombe-usgs - feel free to reopen this.. i just saw your comment above...
It would also be useful if you can confirm if you can train using 'kld' and/or 'hinge' without mixed precision, thanks
I can confirm that both kld
and hinge
loss is reported after disabling mixed precision.
I'm adding more info related to using mixed precision, FYI. Not sure if it's helpful, but figured I would document it.
If I don't comment out the lines @ebgoldstein referenced above, I get the following error using LOSS='dice'
:
$ python train_model.py
/mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/datasets
/mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/config/Test_ExecScript.json
Using GPU
Using single GPU device
2023-02-13 12:46:12.951058: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Version: 2.11.0
Eager mode: True
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Making new directory for example model outputs: /mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/modelOut
MODE "all": using all augmented and non-augmented files
2023-02-13 12:46:15.089657: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 12:46:15.815354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14606 MB memory: -> device: 0, name: Quadro RTX 5000, pci bus id: 0000:65:00.0, compute capability: 7.5
3
1
.....................................
Creating and compiling model ...
INITIAL_EPOCH not specified in the config file. Setting to default of 0 ...
.....................................
Training model ...
Epoch 1: LearningRateScheduler setting learning rate to 1e-07.
Epoch 1/5
2023-02-13 12:46:28.331262: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8401
2023-02-13 12:46:29.121728: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-13 12:46:52.331351: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7f6d74003af0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 12:46:52.331451: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): Quadro RTX 5000, Compute Capability 7.5
2023-02-13 12:46:52.345416: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-02-13 12:46:52.564638: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-13 12:46:52.656992: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
3/3 [==============================] - 43s 2s/step - loss: 0.8905 - mean_iou: 0.0391 - dice_coef: 0.1095 - val_loss: 0.8784 - val_mean_iou: 0.0356 - val_dice_coef: 0.1216 - lr: 1.0000e-07
Epoch 2: LearningRateScheduler setting learning rate to 1.0090000000000002e-05.
Epoch 2/5
3/3 [==============================] - 3s 1s/step - loss: 0.8870 - mean_iou: 0.0424 - dice_coef: 0.1130 - val_loss: 0.8772 - val_mean_iou: 0.0329 - val_dice_coef: 0.1228 - lr: 1.0090e-05
Epoch 3: LearningRateScheduler setting learning rate to 2.008e-05.
Epoch 3/5
3/3 [==============================] - 3s 1s/step - loss: 0.8706 - mean_iou: 0.0560 - dice_coef: 0.1294 - val_loss: 0.8745 - val_mean_iou: 0.0332 - val_dice_coef: 0.1255 - lr: 2.0080e-05
Epoch 4: LearningRateScheduler setting learning rate to 3.0070000000000002e-05.
Epoch 4/5
3/3 [==============================] - 3s 1s/step - loss: 0.8517 - mean_iou: 0.0740 - dice_coef: 0.1483 - val_loss: 0.8705 - val_mean_iou: 0.0387 - val_dice_coef: 0.1295 - lr: 3.0070e-05
Epoch 5: LearningRateScheduler setting learning rate to 4.0060000000000006e-05.
Epoch 5/5
3/3 [==============================] - 3s 1s/step - loss: 0.8346 - mean_iou: 0.1016 - dice_coef: 0.1654 - val_loss: 0.8659 - val_mean_iou: 0.0577 - val_dice_coef: 0.1341 - lr: 4.0060e-05
Traceback (most recent call last):
File "train_model.py", line 920, in <module>
model.save(weights.replace('.h5','_fullmodel.h5'))
File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 775, in variables
return self._variables
AttributeError: 'LossScaleOptimizerV3' object has no attribute '_variables'
Thanks @CameronBodine
We should modify the code so unless Dice is the loss, mixed precision is disabled with a warning
On 'nan' losses with Dice, switching mixed precision off is the quick/easy way to get finite losses. However, I still have good luck with modifying the LR scheduler. So far, I've managed to get most models to converge doing this, but it is obviously a much more time-consuming process, involving trial and error
Describe the bug I am exploring differences in model performance with different hyper-parameter settings. I have successfully trained models with
dice
as the loss function. However, when attempting to train withcat
,hinge
, orkld
, the reported loss during training isnan
, despite using a range of learning rate values (1e-1 to 1e-7). See screenshot below for console output.To Reproduce Steps to reproduce the behavior:
shadowpick_0.json
file:python train_model.py
.Expected behavior I expected to see a value other then
nan
while training.Screenshots Console output:
Desktop (please complete the following information):
Additional context As I mentioned, I was able to train multiple models using
dice
with the following hyper-parameters with the same dataset linked above.I also tried other versions of Tensorflow-gpu (2.4, 2.6, 2.7, 2.8) with
kld
, but loss was reported asing
.