Doodleverse / doodleverse_utils

A set of common Doodleverse tools and utilities
MIT License
4 stars 3 forks source link

Troubleshooting Dice loss returning nan with a binary problem #13

Closed dbuscombe-usgs closed 2 years ago

dbuscombe-usgs commented 2 years ago

I am training a binary (2 class) model on a new machine with Ubuntu OS and using a RTX 3080 Ti. I immediately get 'nan' loss when using Dice and mixed_precision.set_global_policy('mixed_float16'). These are my stream-of-concious troubleshooting notes:

First, I have verified a couple of different ways that the data going into the model are fine... examining the output of make datasets, as well as using doviz=True in dotrain to see the data going into the model

So my attention turns to the model training .... From previous conversations with @ebgoldstein this problem may be because

  1. activations have wrong dtype for mixed precision or
  2. learning rate is too small (/large?) or
  3. mixed precision is not supported/stable with this hardware or
  4. models are failing when using loss weights (even if class weights are all 1.)

Troubleshooting 1., I noticed that the changes added in this PR were missing - I have no idea how! https://github.com/Doodleverse/doodleverse_utils/pull/12

When I change the dtype to float32, as should have been implemented in the above PR, i.e. in all model definitions:

    outputs = tf.keras.layers.Conv2D(
        num_classes, (1, 1), padding="same", activation="softmax", dtype='float32'
    )(
        x
    )  

then I get finite losses .... for the first ~400 steps of the first epoch, before losses are 'nan' again

I therefore reasoned that perhaps the LR was too small (I had used a min of 1e-7 and a max of 1e-3). I retrained using a min of 1e-5 and a max of 1e-2. This time, I almost immediately saw 'nan' losses.

I therefore reasoned that perhaps the LR was, in fact, too large. I retrained using a min of 1e-8 and a max of 1e-2. This time, once again, I almost immediately saw 'nan' losses.

Next, I concluded that because changing LR didnt change behaviour, I would train without mixed precision by commenting the line mixed_precision.set_global_policy('mixed_float16') in do_train. Same result ...

I noticed I was get the following warning (with tf version 2.4.1):

WARNING:tensorflow:AutoGraph could not transform <function weighted_dice_coef_loss.<locals>.weighted_MC_dice_coef_loss at 0x7fb4e58af820> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

I therefore added the decorator @tf.autograph.experimental.do_not_convert to the weighted_dice_coef_loss , dice_coef_loss, dice_multiand basic_dice_coef functions. However, that doesnt make the warning go away, and the 'nan' losses remain ...

This may be relevant because it is possible that the loss weights are failing? I compiled the model using dice_coef_loss instead of passing a vector of ones to weighted_dice_coef_loss. Result: no difference. Losses are still 'nan'!

During the above process, I noticed that dice_coef_loss was always called over dice_coef_loss

if LOSS=='dice':
    if 'LOSS_WEIGHTS' in locals():
        if LOSS_WEIGHTS is True:

        ...

        else:
            if NCLASSES==1:
                class_weights = np.ones(NCLASSES+1)
            else:
                class_weights = np.ones(NCLASSES)
                model.compile(optimizer = 'adam', loss =weighted_dice_coef_loss(NCLASSES, class_weights), metrics = [iou_multi(NCLASSES), dice_multi(NCLASSES)])

    else:

        model.compile(optimizer = 'adam', loss =dice_coef_loss(NCLASSES), metrics = [iou_multi(NCLASSES), dice_multi(NCLASSES)])

should in fact be

if LOSS=='dice':
    if 'LOSS_WEIGHTS' in locals():
        if LOSS_WEIGHTS is True:

        ...

        else:
            model.compile(optimizer = 'adam', loss =dice_coef_loss(NCLASSES), metrics = [iou_multi(NCLASSES), dice_multi(NCLASSES)])

    else:

        model.compile(optimizer = 'adam', loss =dice_coef_loss(NCLASSES), metrics = [iou_multi(NCLASSES), dice_multi(NCLASSES)])

so, removed the old code from when NCLASSES could be 1 (the old days), and default to dice_coef_loss unless LOSS_WEIGHTS is specifically True (I will push mod back to main)

Finally, I revert back to mixed precision and train a model using dice_coef_loss instead of passing a vector of ones to weighted_dice_coef_loss. No change: after a few steps, losses are 'nan'

So it seems that model losses are 'nan' irrespective of mixed precision, loss weights, and activation dtype? Next, I tried mixed precision with LR of 1e-2 to 1e-6. The model trains for longer ~289 epochs before losses are 'nan'.

Now I'm caught in a loop and feel like I have tried a lot of things and now I want to go back to the original 'fix', removing , dtype='float32' from the conv2d layer activations, although I already suspect this will do nothing and I may be permanently reverting a change I made ... ok now training with mixed precision and without dtype='float32' activations i.e. back to the original state, except using dice_coef_loss instead of passing a vector of ones to weighted_dice_coef_loss, and @tf.autograph.experimental.do_not_convert

My troubleshooting process is to change doodleverse_utils/model_imports.py code then reinstalling doodleverse_utils locally into the gym conda env, using pip install -e .), then rerun train_model.py. lossesd are 'nan' early this time! step 40. the model in general converges very minimally

2022-10-30 12:35:33.708375: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
  1/478 [..............................] - ETA: 36:25:15 - loss: 0.6067 - mean_iou: 0.3163 - dice_coef: 0.39332022-10-30 12:36:58.750752: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
  2/478 [..............................] - ETA: 17:34 - loss: 0.6132 - mean_iou: 0.3061 - dice_coef: 0.3868   2022-10-30 12:37:00.967124: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
  3/478 [..............................] - ETA: 17:30 - loss: 0.6119 - mean_iou: 0.3032 - dice_coef: 0.38812022-10-30 12:37:03.173213: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
  4/478 [..............................] - ETA: 17:27 - loss: 0.6110 - mean_iou: 0.3020 - dice_coef: 0.38902022-10-30 12:37:05.381578: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
  5/478 [..............................] - ETA: 17:25 - loss: 0.6095 - mean_iou: 0.3013 - dice_coef: 0.39052022-10-30 12:37:07.589872: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
  6/478 [..............................] - ETA: 17:22 - loss: 0.6080 - mean_iou: 0.3009 - dice_coef: 0.39202022-10-30 12:37:09.797612: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
  7/478 [..............................] - ETA: 17:20 - loss: 0.6075 - mean_iou: 0.3004 - dice_coef: 0.39252022-10-30 12:37:12.008604: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
  8/478 [..............................] - ETA: 17:18 - loss: 0.6070 - mean_iou: 0.2999 - dice_coef: 0.39302022-10-30 12:37:14.216039: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
  9/478 [..............................] - ETA: 17:16 - loss: 0.6067 - mean_iou: 0.2997 - dice_coef: 0.39332022-10-30 12:37:16.423442: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 10/478 [..............................] - ETA: 17:13 - loss: 0.6064 - mean_iou: 0.2994 - dice_coef: 0.39362022-10-30 12:37:18.625779: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 11/478 [..............................] - ETA: 17:11 - loss: 0.6062 - mean_iou: 0.2992 - dice_coef: 0.39382022-10-30 12:37:20.835938: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 12/478 [..............................] - ETA: 17:08 - loss: 0.6059 - mean_iou: 0.2991 - dice_coef: 0.39412022-10-30 12:37:23.039939: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 13/478 [..............................] - ETA: 17:06 - loss: 0.6058 - mean_iou: 0.2990 - dice_coef: 0.39422022-10-30 12:37:25.237381: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 14/478 [..............................] - ETA: 17:03 - loss: 0.6057 - mean_iou: 0.2988 - dice_coef: 0.39432022-10-30 12:37:27.438463: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 15/478 [..............................] - ETA: 17:01 - loss: 0.6058 - mean_iou: 0.2986 - dice_coef: 0.39422022-10-30 12:37:29.647112: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 16/478 [>.............................] - ETA: 16:59 - loss: 0.6058 - mean_iou: 0.2985 - dice_coef: 0.39422022-10-30 12:37:31.845952: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 17/478 [>.............................] - ETA: 16:57 - loss: 0.6058 - mean_iou: 0.2984 - dice_coef: 0.39422022-10-30 12:37:34.051563: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 18/478 [>.............................] - ETA: 16:54 - loss: 0.6058 - mean_iou: 0.2984 - dice_coef: 0.39422022-10-30 12:37:36.246736: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 19/478 [>.............................] - ETA: 16:52 - loss: 0.6058 - mean_iou: 0.2984 - dice_coef: 0.39422022-10-30 12:37:38.442368: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 20/478 [>.............................] - ETA: 16:49 - loss: 0.6058 - mean_iou: 0.2984 - dice_coef: 0.39422022-10-30 12:37:40.638743: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 21/478 [>.............................] - ETA: 16:47 - loss: 0.6057 - mean_iou: 0.2985 - dice_coef: 0.39432022-10-30 12:37:42.843767: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 22/478 [>.............................] - ETA: 16:45 - loss: 0.6057 - mean_iou: 0.2985 - dice_coef: 0.39432022-10-30 12:37:45.043624: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 23/478 [>.............................] - ETA: 16:42 - loss: 0.6058 - mean_iou: 0.2985 - dice_coef: 0.39422022-10-30 12:37:47.242053: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 24/478 [>.............................] - ETA: 16:40 - loss: 0.6059 - mean_iou: 0.2985 - dice_coef: 0.39412022-10-30 12:37:49.446041: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 25/478 [>.............................] - ETA: 16:38 - loss: 0.6060 - mean_iou: 0.2985 - dice_coef: 0.39402022-10-30 12:37:51.644480: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 26/478 [>.............................] - ETA: 16:35 - loss: 0.6059 - mean_iou: 0.2986 - dice_coef: 0.39412022-10-30 12:37:53.837735: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 27/478 [>.............................] - ETA: 16:33 - loss: 0.6058 - mean_iou: 0.2986 - dice_coef: 0.39422022-10-30 12:37:56.032186: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 28/478 [>.............................] - ETA: 16:31 - loss: 0.6056 - mean_iou: 0.2987 - dice_coef: 0.39442022-10-30 12:37:58.234342: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 29/478 [>.............................] - ETA: 16:29 - loss: 0.6055 - mean_iou: 0.2988 - dice_coef: 0.39452022-10-30 12:38:00.439232: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 30/478 [>.............................] - ETA: 16:27 - loss: 0.6055 - mean_iou: 0.2988 - dice_coef: 0.39452022-10-30 12:38:02.648938: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 31/478 [>.............................] - ETA: 16:25 - loss: 0.6054 - mean_iou: 0.2989 - dice_coef: 0.39462022-10-30 12:38:04.863657: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 32/478 [=>............................] - ETA: 16:22 - loss: 0.6054 - mean_iou: 0.2990 - dice_coef: 0.39462022-10-30 12:38:07.066945: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 33/478 [=>............................] - ETA: 16:20 - loss: 0.6054 - mean_iou: 0.2990 - dice_coef: 0.39462022-10-30 12:38:09.273594: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 34/478 [=>............................] - ETA: 16:18 - loss: 0.6054 - mean_iou: 0.2990 - dice_coef: 0.39462022-10-30 12:38:11.487653: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 35/478 [=>............................] - ETA: 16:16 - loss: 0.6053 - mean_iou: 0.2991 - dice_coef: 0.39472022-10-30 12:38:13.695949: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 36/478 [=>............................] - ETA: 16:14 - loss: 0.6052 - mean_iou: 0.2992 - dice_coef: 0.39482022-10-30 12:38:15.905956: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 37/478 [=>............................] - ETA: 16:12 - loss: 0.6052 - mean_iou: 0.2992 - dice_coef: 0.39482022-10-30 12:38:18.110166: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 38/478 [=>............................] - ETA: 16:10 - loss: 0.6050 - mean_iou: 0.2993 - dice_coef: 0.39502022-10-30 12:38:20.321424: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 39/478 [=>............................] - ETA: 16:07 - loss: 0.6049 - mean_iou: 0.2994 - dice_coef: 0.39512022-10-30 12:38:22.535782: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 40/478 [=>............................] - ETA: 16:05 - loss: 0.6047 - mean_iou: 0.2995 - dice_coef: 0.39532022-10-30 12:38:24.738298: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 41/478 [=>............................] - ETA: 16:03 - loss: 0.6046 - mean_iou: 0.2996 - dice_coef: 0.39542022-10-30 12:38:26.948515: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0
 42/478 [=>............................] - ETA: 16:01 - loss: nan - mean_iou: 0.2997 - dice_coef: nan      2022-10-30 12:38:29.155872: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_14372 in device /job:localhost/replica:0/task:0/device:GPU:0

I'm noticing a lot of garbage output from tensorflow so I'm commenting out all the @tf.autograph.experimental.do_not_convert statements in model_imports.py that I'm now less convinced I actually need. I've now reverted back to where I was in the beginning, undoing changes to mixed precision, activation dtype, and do_not_convert statements

now training a model with cat loss ... the above changes made the tensorflow garbage go away, but losses very quickly went to nan

I'm still using this LR scheduler

    "RAMPUP_EPOCHS": 20,
    "SUSTAIN_EPOCHS": 1.0,
    "EXP_DECAY": 0.9,
    "START_LR":  1e-6,
    "MIN_LR": 1e-6,
    "MAX_LR": 1e-2,

this is my imagery

    "TARGET_SIZE": [768,768],
    "MODEL": "resunet",
    "NCLASSES": 2,
    "KERNEL":9,
    "STRIDE":2,
    "BATCH_SIZE": 6,
    "FILTERS":6,
    "N_DATA_BANDS": 3,

going to redo with cat and a larger LR

    "START_LR":  1e-4,
    "MIN_LR": 1e-4,
    "MAX_LR": 1e-1,

this time it goes to nan after epoch 20. I'm starting to get a little flummoxed ... perhaps it is my hardware or drivers? this is the first time ive used this gpu. the code detects it, and I verify it is working with nvidia-smi

I will download the test dataset https://zenodo.org/record/7232051#.Y17andLMLRY and train a model using that data.... I have noticed that it takes a disturbingly long amount of time to fill the gpu with data and start model training ... perhaps my nvidia drivers are at fault? yeah, its just hanging indefinitely on the tiny test dataset ...

dbuscombe-usgs commented 2 years ago

On this hardware, I can on;y run a model without mixed precision on the test dataset, and the losses are nan ...

ebgoldstein commented 2 years ago

:thinking:

dbuscombe-usgs commented 2 years ago

I moved to a different computer known to work with Gym.... however, I ran into an issue with my gym conda env and had to reinstall Anaconda again from scratch, and remake the gym environment

My tf.__version__ is '2.3.0' - this is older, but the only one that would install in my gym environment

With this tf version mixed_precision.set_global_policy('mixed_float16') results in an error:

AttributeError: module 'tensorflow.keras.mixed_precision' has no attribute 'set_global_policy'

The apparent solution is

try:
    mixed_precision.set_global_policy('mixed_float16')
except:
    mixed_precision.experimental.set_policy('mixed_float16')

which gives me the following terminal output

2022-10-31 14:39:37.813037: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once

I think I can safely ignore this warning .... training a model now and will report back

One thing of note is that I still get this warning, which I'm not sure if I should be ignoring ...

WARNING:tensorflow:AutoGraph could not transform <function dice_multi.<locals>.dice_coef at 0x0000025FD5622280> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function dice_multi.<locals>.dice_coef at 0x0000025FD5622280> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
dbuscombe-usgs commented 2 years ago

hangs forever before getting to epoch 1. I have NO IDEA what is going on ..... I have changed anaconda versions and made a new gym conda env, but this computer's hardware worked just fine before

dbuscombe-usgs commented 2 years ago

I aborted training. After >30 minutes it hadn't started to train. Apparently it never found any of my GPUs .... back to conda! I had used conda install -c conda-forge tensorflow-gpu and it installed without error, so I had assumed it was correct ...

dbuscombe-usgs commented 2 years ago

Next, I uninstalled the conda-forge version of TF, and installed from pip. Again, it didn't find my GPUs ...

then I went here and followed the advice, installing into my existing gym env

conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
python -m pip install tensorflow

which installs tensorflow-2.10.0-cp38-cp38-win_amd64.whl

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

which is successful!

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]

Perhaps this should be added to the gym README and wiki

dbuscombe-usgs commented 2 years ago

Model is now training with mixed precision on Windows, and losses are finite and decreasing. Phew!

I will add some conda troubleshooting info to the Gym/README

By the way, the Gym yml once again did not work for me and had to install using the 'recipe' approach... I think it is time to retire the yml, like we did for Doodler

Next, I will troubleshoot my Linux box ... perhaps the issues there are similar - conda is to blame?

dbuscombe-usgs commented 2 years ago

Troubleshooting my linux box, which had recognized my gpu but it was not performing well (extremely slow), I downgraded nvidia driver from 510 to 470.

I made a conda env without specifying any version numbers, and installed the cuda stuff from conda-forge

conda create -n gym python
conda install -c conda-forge cudatoolkit cudnn pip

(installs cudatoolkit-10.2.89 and cudnn-7.6.5.32 alongside python 3.10.6)

conda install -c conda-forge tensorflow-gpu

which installs version 2.10.0. then:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

and it now sees a GPU. Next, I install the rest of the dependencies:

conda install -c conda-forge scipy numpy scikit-image cython ipython joblib tqdm pandas plotly natsort pydensecrf matplotlib

I ran the gym\utils\test_gpu script - worked very fast on that smaller dataset. Now I'm testing with the 'hatteras' test dataset. Again, I notice the very long time before it starts model training, and how little of the gpu memory the process is using. After 10 mins, model training has not started

dbuscombe-usgs commented 2 years ago

After purging my nvidia drivers and rebooting, I managed to get nvidia-driver 515 installed. With the previous gym env, it would not pick up the GPU

I removed the gym conda env and attempted to install again using the yml file. Now it is -- finally - training a model based on the test dataset!

I guess the problem was the nvidia driver. At least with my RTX 3080Ti, I needed 515-open and the conda-forge version of tensorflow.

Next I revisit the original problem, 'nan' loss using a new binary dataset. This time, data loads fast, and finite losses until (at least) epoch 2