Nan train loss with cat, hinge, and kld

CameronBodine commented 1 year ago

Describe the bug I am exploring differences in model performance with different hyper-parameter settings. I have successfully trained models with dice as the loss function. However, when attempting to train with cat, hinge, or kld, the reported loss during training is nan, despite using a range of learning rate values (1e-1 to 1e-7). See screenshot below for console output.

To Reproduce Steps to reproduce the behavior:

Install conda environment for gym (package specifics below)

Set hyper-parameters in the shadowpick_0.json file:

{
"TARGET_SIZE": [512, 512],
"MODEL": "unet",
"NCLASSES": 2,
"BATCH_SIZE": 10,
"N_DATA_BANDS": 1,
"DO_TRAIN": true,
"PATIENCE": 10,
"MAX_EPOCHS": 10,
"VALIDATION_SPLIT": 0.6,
"FILTERS": 2,
"KERNEL": 7,
"STRIDE": 2,
"LOSS": "cat",
"DROPOUT": 0.1,
"DROPOUT_CHANGE_PER_LAYER": 0.0,
"DROPOUT_TYPE": "standard",
"USE_DROPOUT_ON_UPSAMPLING": false,
"ROOT_STRING": "shadowpick",
"FILTER_VALUE": 3,
"DOPLOT": false,
"USEMASK": true,
"RAMPUP_EPOCHS": 10,
"SUSTAIN_EPOCHS": 0.0,
"EXP_DECAY": 0.9,
"START_LR": 0.1,
"MIN_LR": 0.1,
"MAX_LR": 0.1,
"AUG_ROT": 0,
"AUG_ZOOM": 0.0,
"AUG_WIDTHSHIFT": 0.05,
"AUG_HEIGHTSHIFT": 0.05,
"AUG_HFLIP": true,
"AUG_VFLIP": false,
"AUG_LOOPS": 3,
"AUG_COPIES": 3,
"TESTTIMEAUG": false,
"SET_GPU": "0",
"DO_CRF": false,
"SET_PCI_BUS_ID": true,
"WRITE_MODELMETADATA": true,
"OTSU_THRESHOLD": true
}

A subset of the dataset can be downloaded here.
Train model by running python train_model.py.

Expected behavior I expected to see a value other then nan while training.

Screenshots Console output:

Screenshot from 2023-01-27 12-42-45

Desktop (please complete the following information):

OS: Ubuntu 22.04
Conda Environment:

# packages in environment at /home/cbodine/miniconda3/envs/gym:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
absl-py                   1.4.0              pyhd8ed1ab_0    conda-forge
aiohttp                   3.8.3            py38h0a891b7_1    conda-forge
aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
alsa-lib                  1.2.8                h166bdaf_0    conda-forge
aom                       3.5.0                h27087fc_0    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
asttokens                 2.2.1              pyhd8ed1ab_0    conda-forge
astunparse                1.6.3              pyhd8ed1ab_0    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
attr                      2.5.1                h166bdaf_1    conda-forge
attrs                     22.2.0             pyh71513ae_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                pyhd8ed1ab_3    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
blinker                   1.5                pyhd8ed1ab_0    conda-forge
blosc                     1.21.3               hafa529b_0    conda-forge
brotli                    1.0.9                h166bdaf_8    conda-forge
brotli-bin                1.0.9                h166bdaf_8    conda-forge
brotlipy                  0.7.0           py38h0a891b7_1005    conda-forge
brunsli                   0.1                  h9c3ff4c_0    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
c-blosc2                  2.6.1                hf91038e_0    conda-forge
ca-certificates           2023.01.10           h06a4308_0  
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
cairo                     1.16.0            ha61ee94_1014    conda-forge
certifi                   2022.12.7        py38h06a4308_0  
cffi                      1.15.1           py38h4a40e3a_3    conda-forge
cfitsio                   4.2.0                hd9d235c_0    conda-forge
charls                    2.4.1                hcb278e6_0    conda-forge
charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
click                     8.1.3           unix_pyhd8ed1ab_2    conda-forge
cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
contourpy                 1.0.7            py38hfbd4bf9_0    conda-forge
cryptography              39.0.0           py38h1724139_0    conda-forge
cudatoolkit               11.8.0              h37601d7_11    conda-forge
cudnn                     8.4.1.50             hed8a83a_0    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
cython                    0.29.33          py38h8dc9893_0    conda-forge
cytoolz                   0.12.0           py38h0a891b7_1    conda-forge
dask-core                 2023.1.0           pyhd8ed1ab_0    conda-forge
dav1d                     1.0.0                h166bdaf_1    conda-forge
dbus                      1.13.6               h5008d03_3    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
doodleverse-utils         0.0.18                   pypi_0    pypi
executing                 1.2.0              pyhd8ed1ab_0    conda-forge
expat                     2.5.0                h27087fc_0    conda-forge
fftw                      3.3.10          nompi_hf0379b8_106    conda-forge
flatbuffers               2.0.8                hcb278e6_1    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.14.1               hc2a2eb6_0    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.38.0           py38h0a891b7_1    conda-forge
freetype                  2.12.1               hca18f0e_1    conda-forge
frozenlist                1.3.3            py38h0a891b7_0    conda-forge
fsspec                    2023.1.0           pyhd8ed1ab_0    conda-forge
gast                      0.4.0              pyh9f0ad1d_0    conda-forge
gettext                   0.21.1               h27087fc_0    conda-forge
giflib                    5.2.1                h36c2ea0_2    conda-forge
glib                      2.74.1               h6239696_1    conda-forge
glib-tools                2.74.1               h6239696_1    conda-forge
google-auth               2.16.0             pyh1a96a4e_1    conda-forge
google-auth-oauthlib      0.4.6              pyhd8ed1ab_0    conda-forge
google-pasta              0.2.0              pyh8c360ce_0    conda-forge
graphite2                 1.3.13            h58526e2_1001    conda-forge
grpc-cpp                  1.47.1               h05bd8bd_7    conda-forge
grpcio                    1.47.1           py38h7dc2bf5_7    conda-forge
gst-plugins-base          1.21.3               h4243ec0_1    conda-forge
gstreamer                 1.21.3               h25f0c4b_1    conda-forge
gstreamer-orc             0.4.33               h166bdaf_0    conda-forge
h5py                      3.8.0           nompi_py38hd5fa8ee_100    conda-forge
harfbuzz                  6.0.0                h8e241bc_0    conda-forge
hdf5                      1.12.2          nompi_h2386368_101    conda-forge
icu                       70.1                 h27087fc_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
imagecodecs               2023.1.23        py38h3ca0a39_0    conda-forge
imageio                   2.25.0             pyh24c5eb1_0    conda-forge
importlib-metadata        6.0.0              pyha770c72_0    conda-forge
ipython                   8.8.0              pyh41d4057_0    conda-forge
jack                      1.9.21               h583fa2b_2    conda-forge
jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
jpeg                      9e                   h166bdaf_2    conda-forge
jxrlib                    1.1                  h7f98852_2    conda-forge
keras                     2.10.0             pyhd8ed1ab_0    conda-forge
keras-preprocessing       1.1.2              pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.4            py38h43d8883_1    conda-forge
krb5                      1.20.1               hf9c8cef_0    conda-forge
lame                      3.100             h166bdaf_1003    conda-forge
lcms2                     2.14                 hfd0df8a_1    conda-forge
ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libabseil                 20220623.0      cxx17_h05df665_6    conda-forge
libaec                    1.0.6                hcb278e6_1    conda-forge
libavif                   0.11.1               h5cdd6b5_0    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
libbrotlidec              1.0.9                h166bdaf_8    conda-forge
libbrotlienc              1.0.9                h166bdaf_8    conda-forge
libcap                    2.66                 ha37c62d_0    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libclang                  15.0.7          default_had23c3d_0    conda-forge
libclang13                15.0.7          default_h3e3d535_0    conda-forge
libcups                   2.3.3                h36d4200_3    conda-forge
libcurl                   7.87.0               h6312ad2_0    conda-forge
libdb                     6.2.32               h9c3ff4c_0    conda-forge
libdeflate                1.17                 h0b41bf4_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               h9b69904_4    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libflac                   1.4.2                h27087fc_0    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgcrypt                 1.10.1               h166bdaf_0    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libglib                   2.74.1               h606061b_1    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
libgpg-error              1.46                 h620e276_0    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
libjpeg-turbo             2.1.4                h166bdaf_0    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
libllvm15                 15.0.7               hadd5161_0    conda-forge
libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
libogg                    1.3.4                h7f98852_1    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libopus                   1.3.1                h7f98852_1    conda-forge
libpng                    1.6.39               h753d276_0    conda-forge
libpq                     15.1                 h2baec63_3    conda-forge
libprotobuf               3.21.12              h3eb15da_0    conda-forge
libsndfile                1.2.0                hb75c966_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libsystemd0               252                  h2a991cd_0    conda-forge
libtiff                   4.5.0                h6adf6a1_2    conda-forge
libtool                   2.4.7                h27087fc_0    conda-forge
libudev1                  252                  h166bdaf_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libvorbis                 1.3.7                h9c3ff4c_0    conda-forge
libwebp-base              1.2.4                h166bdaf_0    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libxkbcommon              1.0.3                he3ba5ed_0    conda-forge
libxml2                   2.10.3               h7463322_0    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
libzopfli                 1.0.3                h9c3ff4c_0    conda-forge
locket                    1.0.0              pyhd8ed1ab_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
markdown                  3.4.1              pyhd8ed1ab_0    conda-forge
markupsafe                2.1.2            py38h1de0b5d_0    conda-forge
matplotlib                3.6.3            py38h578d9bd_0    conda-forge
matplotlib-base           3.6.3            py38hd6c3c57_0    conda-forge
matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
mpg123                    1.31.2               hcb278e6_0    conda-forge
multidict                 6.0.4            py38h1de0b5d_0    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
mysql-common              8.0.32               h14678bc_0    conda-forge
mysql-libs                8.0.32               h54cf53e_0    conda-forge
natsort                   8.2.0              pyhd8ed1ab_0    conda-forge
nccl                      2.14.3.1             h0800d71_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
networkx                  3.0                pyhd8ed1ab_0    conda-forge
nspr                      4.35                 h27087fc_0    conda-forge
nss                       3.82                 he02c5a1_0    conda-forge
numpy                     1.23.0           py38h3a7f9d9_0    conda-forge
oauthlib                  3.2.2              pyhd8ed1ab_0    conda-forge
openjpeg                  2.5.0                hfec8fc6_2    conda-forge
openssl                   1.1.1s               h7f8727e_0  
opt_einsum                3.3.0              pyhd8ed1ab_1    conda-forge
packaging                 23.0               pyhd8ed1ab_0    conda-forge
pandas                    1.5.3            py38hdc8b05c_0    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
partd                     1.3.0              pyhd8ed1ab_0    conda-forge
pcre2                     10.40                hc3806b6_0    conda-forge
pexpect                   4.8.0              pyh1a96a4e_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    9.4.0            py38hb32c036_0    conda-forge
pip                       22.3.1             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
plotly                    5.13.0             pyhd8ed1ab_0    conda-forge
ply                       3.11                       py_1    conda-forge
pooch                     1.6.0              pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.36             pyha770c72_0    conda-forge
protobuf                  4.21.12          py38h8dc9893_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pulseaudio                16.1                 h4ab2085_1    conda-forge
pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.7                      py_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pydensecrf                1.0rc3           py38h8f669ce_4    conda-forge
pygments                  2.14.0             pyhd8ed1ab_0    conda-forge
pyjwt                     2.6.0              pyhd8ed1ab_0    conda-forge
pyopenssl                 23.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pyqt                      5.15.7           py38h7492b6b_2    conda-forge
pyqt5-sip                 12.11.0          py38hfa26641_2    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.8.16               h7a1cb2a_2  
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-flatbuffers        23.1.21            pyhd8ed1ab_0    conda-forge
python_abi                3.8                      2_cp38    conda-forge
pytz                      2022.7.1           pyhd8ed1ab_0    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
pywavelets                1.4.1            py38h7e4f40d_0    conda-forge
pyyaml                    6.0              py38h0a891b7_5    conda-forge
qt-main                   5.15.6               h18908ee_6    conda-forge
re2                       2022.06.01           h27087fc_1    conda-forge
readline                  8.2                  h5eee18b_0  
requests                  2.28.2             pyhd8ed1ab_0    conda-forge
requests-oauthlib         1.3.1              pyhd8ed1ab_0    conda-forge
rsa                       4.9                pyhd8ed1ab_0    conda-forge
scikit-image              0.19.3           py38h8f669ce_2    conda-forge
scipy                     1.10.0           py38h10c12cc_0    conda-forge
setuptools                66.1.1             pyhd8ed1ab_0    conda-forge
sip                       6.7.5            py38hfa26641_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
snappy                    1.1.9                hbd366e4_2    conda-forge
sqlite                    3.40.1               h5082296_0  
stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
tenacity                  8.1.0              pyhd8ed1ab_0    conda-forge
tensorboard               2.10.1             pyhd8ed1ab_0    conda-forge
tensorboard-data-server   0.6.1            py38h2b5fc30_4    conda-forge
tensorboard-plugin-wit    1.8.1              pyhd8ed1ab_0    conda-forge
tensorflow                2.10.0          cuda112py38hded6998_0    conda-forge
tensorflow-base           2.10.0          cuda112py38h6b2b66c_0    conda-forge
tensorflow-estimator      2.10.0          cuda112py38hf5dcc89_0    conda-forge
tensorflow-gpu            2.10.0          cuda112py38h0bbbad9_0    conda-forge
termcolor                 2.2.0              pyhd8ed1ab_0    conda-forge
tifffile                  2023.1.23.1        pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
tornado                   6.2              py38h0a891b7_1    conda-forge
tqdm                      4.64.1             pyhd8ed1ab_0    conda-forge
traitlets                 5.8.1              pyhd8ed1ab_0    conda-forge
typing-extensions         4.4.0                hd8ed1ab_0    conda-forge
typing_extensions         4.4.0              pyha770c72_0    conda-forge
unicodedata2              15.0.0           py38h0a891b7_0    conda-forge
urllib3                   1.26.14            pyhd8ed1ab_0    conda-forge
versioneer                0.28                     pypi_0    pypi
wcwidth                   0.2.6              pyhd8ed1ab_0    conda-forge
werkzeug                  2.2.2              pyhd8ed1ab_0    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
wrapt                     1.14.1           py38h0a891b7_1    conda-forge
xcb-util                  0.4.0                h516909a_0    conda-forge
xcb-util-image            0.4.0                h166bdaf_0    conda-forge
xcb-util-keysyms          0.4.0                h516909a_0    conda-forge
xcb-util-renderutil       0.3.9                h166bdaf_0    conda-forge
xcb-util-wm               0.4.1                h516909a_0    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.0.10               h7f98852_0    conda-forge
xorg-libsm                1.2.3             hd9c2040_1000    conda-forge
xorg-libx11               1.7.2                h7f98852_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h7f98852_1    conda-forge
xorg-libxrender           0.9.10            h7f98852_1003    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h7f98852_1002    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.10               h5eee18b_1  
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.8.2            py38h0a891b7_0    conda-forge
zfp                       1.0.0                h27087fc_3    conda-forge
zipp                      3.11.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge
zlib-ng                   2.0.6                h166bdaf_0    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge

Additional context As I mentioned, I was able to train multiple models using dice with the following hyper-parameters with the same dataset linked above.

{
    "TARGET_SIZE": [512, 512],
    "MODEL": "unet",
    "NCLASSES": 2,
    "BATCH_SIZE": 10,
    "N_DATA_BANDS": 1,
    "DO_TRAIN": true,
    "PATIENCE": 10,
    "MAX_EPOCHS": 10,
    "VALIDATION_SPLIT": 0.6,
    "FILTERS": 2,
    "KERNEL": 7,
    "STRIDE": 2,
    "LOSS": "dice",
    "DROPOUT": 0.1,
    "DROPOUT_CHANGE_PER_LAYER": 0.0,
    "DROPOUT_TYPE": "standard",
    "USE_DROPOUT_ON_UPSAMPLING": false,
    "ROOT_STRING": "shadowpick",
    "FILTER_VALUE": 3,
    "DOPLOT": false,
    "USEMASK": true,
    "RAMPUP_EPOCHS": 10,
    "SUSTAIN_EPOCHS": 0.0,
    "EXP_DECAY": 0.9,
    "START_LR": 1e-07,
    "MIN_LR": 1e-07,
    "MAX_LR": 0.0001,
    "AUG_ROT": 0,
    "AUG_ZOOM": 0.0,
    "AUG_WIDTHSHIFT": 0.05,
    "AUG_HEIGHTSHIFT": 0.05,
    "AUG_HFLIP": true,
    "AUG_VFLIP": false,
    "AUG_LOOPS": 3,
    "AUG_COPIES": 3,
    "TESTTIMEAUG": false,
    "SET_GPU": "0",
    "DO_CRF": false,
    "SET_PCI_BUS_ID": true,
    "WRITE_MODELMETADATA": true,
    "OTSU_THRESHOLD": true
}

I also tried other versions of Tensorflow-gpu (2.4, 2.6, 2.7, 2.8) with kld, but loss was reported as ing.

ebgoldstein commented 1 year ago

Hi @CameronBodine ,

My hunch is this is mixed precision (which can cause underflow/overflow and therefore nan or inf loss). Can you try to train a model but with these lines on train_model.py commented out:

https://github.com/Doodleverse/segmentation_gym/blob/809466a3edf097674504fc8847f82ffc70cdc2fa/train_model.py#L128-L132

CameronBodine commented 1 year ago

Right on the money @ebgoldstein! Running now with cat loss. Let me know if I can report back any info, or try out anything else.

dbuscombe-usgs commented 1 year ago

Good call @ebgoldstein and thanks @CameronBodine for reporting

dbuscombe-usgs commented 1 year ago

It would also be useful if you can confirm if you can train using 'kld' and/or 'hinge' without mixed precision, thanks

ebgoldstein commented 1 year ago

great news @CameronBodine ..

I have run into this scenario several times.. and have always been able to train with any loss by falling back to full precision..

for now I am going to close this issue. but please reopen if there are any other problems..

ebgoldstein commented 1 year ago

@dbuscombe-usgs - feel free to reopen this.. i just saw your comment above...

CameronBodine commented 1 year ago

It would also be useful if you can confirm if you can train using 'kld' and/or 'hinge' without mixed precision, thanks

I can confirm that both kld and hinge loss is reported after disabling mixed precision.

CameronBodine commented 1 year ago

I'm adding more info related to using mixed precision, FYI. Not sure if it's helpful, but figured I would document it.

If I don't comment out the lines @ebgoldstein referenced above, I get the following error using LOSS='dice':

$ python train_model.py 
/mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/datasets
/mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/config/Test_ExecScript.json
Using GPU
Using single GPU device
2023-02-13 12:46:12.951058: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Version:  2.11.0
Eager mode:  True
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Making new directory for example model outputs: /mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/modelOut
MODE "all": using all augmented and non-augmented files
2023-02-13 12:46:15.089657: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 12:46:15.815354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14606 MB memory:  -> device: 0, name: Quadro RTX 5000, pci bus id: 0000:65:00.0, compute capability: 7.5
3
1
.....................................
Creating and compiling model ...
INITIAL_EPOCH not specified in the config file. Setting to default of 0 ...
.....................................
Training model ...

Epoch 1: LearningRateScheduler setting learning rate to 1e-07.
Epoch 1/5
2023-02-13 12:46:28.331262: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8401
2023-02-13 12:46:29.121728: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-13 12:46:52.331351: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7f6d74003af0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 12:46:52.331451: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): Quadro RTX 5000, Compute Capability 7.5
2023-02-13 12:46:52.345416: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-02-13 12:46:52.564638: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-13 12:46:52.656992: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
3/3 [==============================] - 43s 2s/step - loss: 0.8905 - mean_iou: 0.0391 - dice_coef: 0.1095 - val_loss: 0.8784 - val_mean_iou: 0.0356 - val_dice_coef: 0.1216 - lr: 1.0000e-07

Epoch 2: LearningRateScheduler setting learning rate to 1.0090000000000002e-05.
Epoch 2/5
3/3 [==============================] - 3s 1s/step - loss: 0.8870 - mean_iou: 0.0424 - dice_coef: 0.1130 - val_loss: 0.8772 - val_mean_iou: 0.0329 - val_dice_coef: 0.1228 - lr: 1.0090e-05

Epoch 3: LearningRateScheduler setting learning rate to 2.008e-05.
Epoch 3/5
3/3 [==============================] - 3s 1s/step - loss: 0.8706 - mean_iou: 0.0560 - dice_coef: 0.1294 - val_loss: 0.8745 - val_mean_iou: 0.0332 - val_dice_coef: 0.1255 - lr: 2.0080e-05

Epoch 4: LearningRateScheduler setting learning rate to 3.0070000000000002e-05.
Epoch 4/5
3/3 [==============================] - 3s 1s/step - loss: 0.8517 - mean_iou: 0.0740 - dice_coef: 0.1483 - val_loss: 0.8705 - val_mean_iou: 0.0387 - val_dice_coef: 0.1295 - lr: 3.0070e-05

Epoch 5: LearningRateScheduler setting learning rate to 4.0060000000000006e-05.
Epoch 5/5
3/3 [==============================] - 3s 1s/step - loss: 0.8346 - mean_iou: 0.1016 - dice_coef: 0.1654 - val_loss: 0.8659 - val_mean_iou: 0.0577 - val_dice_coef: 0.1341 - lr: 4.0060e-05
Traceback (most recent call last):
  File "train_model.py", line 920, in <module>
    model.save(weights.replace('.h5','_fullmodel.h5'))
  File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 775, in variables
    return self._variables
AttributeError: 'LossScaleOptimizerV3' object has no attribute '_variables'

dbuscombe-usgs commented 1 year ago

Thanks @CameronBodine

We should modify the code so unless Dice is the loss, mixed precision is disabled with a warning

On 'nan' losses with Dice, switching mixed precision off is the quick/easy way to get finite losses. However, I still have good luck with modifying the LR scheduler. So far, I've managed to get most models to converge doing this, but it is obviously a much more time-consuming process, involving trial and error

Doodleverse / segmentation_gym

Nan train loss with cat, hinge, and kld #113