Closed dbuscombe-usgs closed 2 years ago
On this hardware, I can on;y run a model without mixed precision on the test dataset, and the losses are nan ...
:thinking:
I moved to a different computer known to work with Gym.... however, I ran into an issue with my gym
conda env and had to reinstall Anaconda again from scratch, and remake the gym
environment
My tf.__version__
is '2.3.0' - this is older, but the only one that would install in my gym environment
With this tf version mixed_precision.set_global_policy('mixed_float16')
results in an error:
AttributeError: module 'tensorflow.keras.mixed_precision' has no attribute 'set_global_policy'
The apparent solution is
try:
mixed_precision.set_global_policy('mixed_float16')
except:
mixed_precision.experimental.set_policy('mixed_float16')
which gives me the following terminal output
2022-10-31 14:39:37.813037: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
I think I can safely ignore this warning .... training a model now and will report back
One thing of note is that I still get this warning, which I'm not sure if I should be ignoring ...
WARNING:tensorflow:AutoGraph could not transform <function dice_multi.<locals>.dice_coef at 0x0000025FD5622280> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function dice_multi.<locals>.dice_coef at 0x0000025FD5622280> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
hangs forever before getting to epoch 1. I have NO IDEA what is going on ..... I have changed anaconda versions and made a new gym conda env, but this computer's hardware worked just fine before
I aborted training. After >30 minutes it hadn't started to train. Apparently it never found any of my GPUs .... back to conda! I had used conda install -c conda-forge tensorflow-gpu
and it installed without error, so I had assumed it was correct ...
Next, I uninstalled the conda-forge version of TF, and installed from pip. Again, it didn't find my GPUs ...
then I went here and followed the advice, installing into my existing gym
env
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
python -m pip install tensorflow
which installs tensorflow-2.10.0-cp38-cp38-win_amd64.whl
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
which is successful!
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]
Perhaps this should be added to the gym README and wiki
Model is now training with mixed precision on Windows, and losses are finite and decreasing. Phew!
I will add some conda troubleshooting info to the Gym/README
By the way, the Gym yml once again did not work for me and had to install using the 'recipe' approach... I think it is time to retire the yml, like we did for Doodler
Next, I will troubleshoot my Linux box ... perhaps the issues there are similar - conda is to blame?
Troubleshooting my linux box, which had recognized my gpu but it was not performing well (extremely slow), I downgraded nvidia driver from 510 to 470.
I made a conda env without specifying any version numbers, and installed the cuda stuff from conda-forge
conda create -n gym python
conda install -c conda-forge cudatoolkit cudnn pip
(installs cudatoolkit-10.2.89 and cudnn-7.6.5.32 alongside python 3.10.6)
conda install -c conda-forge tensorflow-gpu
which installs version 2.10.0. then:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
and it now sees a GPU. Next, I install the rest of the dependencies:
conda install -c conda-forge scipy numpy scikit-image cython ipython joblib tqdm pandas plotly natsort pydensecrf matplotlib
I ran the gym\utils\test_gpu
script - worked very fast on that smaller dataset. Now I'm testing with the 'hatteras' test dataset. Again, I notice the very long time before it starts model training, and how little of the gpu memory the process is using. After 10 mins, model training has not started
After purging my nvidia drivers and rebooting, I managed to get nvidia-driver 515 installed. With the previous gym env, it would not pick up the GPU
I removed the gym conda env and attempted to install again using the yml file. Now it is -- finally - training a model based on the test dataset!
I guess the problem was the nvidia driver. At least with my RTX 3080Ti, I needed 515-open
and the conda-forge version of tensorflow.
Next I revisit the original problem, 'nan' loss using a new binary dataset. This time, data loads fast, and finite losses until (at least) epoch 2
I am training a binary (2 class) model on a new machine with Ubuntu OS and using a RTX 3080 Ti. I immediately get 'nan' loss when using Dice and
mixed_precision.set_global_policy('mixed_float16')
. These are my stream-of-concious troubleshooting notes:First, I have verified a couple of different ways that the data going into the model are fine... examining the output of make datasets, as well as using doviz=True in dotrain to see the data going into the model
So my attention turns to the model training .... From previous conversations with @ebgoldstein this problem may be because
Troubleshooting 1., I noticed that the changes added in this PR were missing - I have no idea how! https://github.com/Doodleverse/doodleverse_utils/pull/12
When I change the dtype to
float32
, as should have been implemented in the above PR, i.e. in all model definitions:then I get finite losses .... for the first ~400 steps of the first epoch, before losses are 'nan' again
I therefore reasoned that perhaps the LR was too small (I had used a min of 1e-7 and a max of 1e-3). I retrained using a min of 1e-5 and a max of 1e-2. This time, I almost immediately saw 'nan' losses.
I therefore reasoned that perhaps the LR was, in fact, too large. I retrained using a min of 1e-8 and a max of 1e-2. This time, once again, I almost immediately saw 'nan' losses.
Next, I concluded that because changing LR didnt change behaviour, I would train without mixed precision by commenting the line
mixed_precision.set_global_policy('mixed_float16')
indo_train
. Same result ...I noticed I was get the following warning (with tf version 2.4.1):
I therefore added the decorator
@tf.autograph.experimental.do_not_convert
to theweighted_dice_coef_loss
,dice_coef_loss
,dice_multi
andbasic_dice_coef
functions. However, that doesnt make the warning go away, and the 'nan' losses remain ...This may be relevant because it is possible that the
loss weights
are failing? I compiled the model usingdice_coef_loss
instead of passing a vector of ones toweighted_dice_coef_loss
. Result: no difference. Losses are still 'nan'!During the above process, I noticed that
dice_coef_loss
was always called overdice_coef_loss
should in fact be
so, removed the old code from when NCLASSES could be 1 (the old days), and default to
dice_coef_loss
unlessLOSS_WEIGHTS
is specificallyTrue
(I will push mod back to main)Finally, I revert back to mixed precision and train a model using
dice_coef_loss
instead of passing a vector of ones toweighted_dice_coef_loss
. No change: after a few steps, losses are 'nan'So it seems that model losses are 'nan' irrespective of mixed precision, loss weights, and activation dtype? Next, I tried mixed precision with LR of 1e-2 to 1e-6. The model trains for longer ~289 epochs before losses are 'nan'.
Now I'm caught in a loop and feel like I have tried a lot of things and now I want to go back to the original 'fix', removing
, dtype='float32'
from the conv2d layer activations, although I already suspect this will do nothing and I may be permanently reverting a change I made ... ok now training with mixed precision and withoutdtype='float32'
activations i.e. back to the original state, except usingdice_coef_loss
instead of passing a vector of ones toweighted_dice_coef_loss
, and@tf.autograph.experimental.do_not_convert
My troubleshooting process is to change
doodleverse_utils/model_imports.py
code then reinstallingdoodleverse_utils
locally into thegym
conda env, usingpip install -e .
), then reruntrain_model.py
. lossesd are 'nan' early this time! step 40. the model in general converges very minimallyI'm noticing a lot of garbage output from tensorflow so I'm commenting out all the
@tf.autograph.experimental.do_not_convert
statements inmodel_imports.py
that I'm now less convinced I actually need. I've now reverted back to where I was in the beginning, undoing changes to mixed precision, activation dtype, anddo_not_convert
statementsnow training a model with
cat
loss ... the above changes made the tensorflow garbage go away, but losses very quickly went to nanI'm still using this LR scheduler
this is my imagery
going to redo with
cat
and a larger LRthis time it goes to
nan
after epoch 20. I'm starting to get a little flummoxed ... perhaps it is my hardware or drivers? this is the first time ive used this gpu. the code detects it, and I verify it is working withnvidia-smi
I will download the test dataset https://zenodo.org/record/7232051#.Y17andLMLRY and train a model using that data.... I have noticed that it takes a disturbingly long amount of time to fill the gpu with data and start model training ... perhaps my nvidia drivers are at fault? yeah, its just hanging indefinitely on the tiny test dataset ...