Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler
MIT License
45 stars 10 forks source link

Train_model script never finishes on wsl 2 #137

Closed 2320sharon closed 6 months ago

2320sharon commented 1 year ago

Bug Description: The bug causes the train_model.py script to slow down significantly and display error messages after training the model up to epoch 20. This behavior has been observed multiple times, with no other tasks running on the computer.

Steps to Reproduce:

Install the necessary dependencies: tensorflow version 2.12.0. Clone the repository from https://github.com/Doodleverse/segmentation_gym.git. Navigate to the cloned repository. Run the train_model.py script with the provided example in https://github.com/Doodleverse/segmentation_gym/wiki/02_Case-Study-Demo Expected Behavior: The train_model.py script should run smoothly without any significant slowdowns or error messages. The model should continue training beyond epoch 20 without any issues.

Desktop:

OS: Windows 11 WSL2: Ubuntu 22.04.2 LTS CUDA: nvcc 12.2.91 cuDNN: 11.8.0

The behavior has been observed with the specified versions of tensorflow, cuda-nvcc, and cudatoolkit. The issue occurs after epoch 20, causing a significant slowdown and error messages to be displayed. The bug has been reproduced three times with no other tasks running on the computer.

(gym3) sharon@Sharonator:~/gym/segmentation_gym$ python train_model.py
/home/sharon/gym/segmentation_gym/model_from_scratch_test/train_data/train_npzs
/home/sharon/gym/segmentation_gym/model_from_scratch_test/val_data/val_npzs
/home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/config/hatteras_l8_resunet.json
Using GPU
Using single GPU device
2023-07-17 14:28:00.122127: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-17 14:28:00.154163: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-17 14:28:00.555008: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Version:  2.12.1
Eager mode:  True
2023-07-17 14:28:01.916639: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 14:28:01.933201: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 14:28:01.933376: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
physical_devices : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-07-17 14:28:01.933878: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Making new directory for example model outputs: /home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/modelOut
2023-07-17 14:28:01.944968: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 14:28:01.945077: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 14:28:01.945122: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 14:28:02.731445: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 14:28:02.731579: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 14:28:02.731604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1722] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-07-17 14:28:02.731659: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 14:28:02.731709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13485 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2023-07-17 14:28:02.777211: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-17 14:28:02.792764: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
LOAD_DATA_WITH_CPU not specified in config file. Setting to "False"
.....................................
Creating and compiling model ...
Garbage collection will NOT be perfomed. To change this behaviour, set CLEAR_MEMORY=True in the config file
INITIAL_EPOCH not specified in the config file. Setting to default of 0 ...
.....................................
Training model ...
2023-07-17 14:28:03.370166: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-17 14:28:03.370384: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]

Epoch 1: LearningRateScheduler setting learning rate to 1e-07.
Epoch 1/100
2023-07-17 14:28:04.763008: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8600
2023-07-17 14:29:01.361668: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0xd504e170 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-17 14:29:01.361734: I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti Laptop GPU, Compute Capability 8.6
2023-07-17 14:29:01.366817: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-17 14:29:01.457447: I ./tensorflow/compiler/jit/device_compiler.h:180] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
WARNING:tensorflow:5 out of the last 5 calls to <function _BaseOptimizer._update_step_xla at 0x7ff4b012ec20> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
WARNING:tensorflow:6 out of the last 6 calls to <function _BaseOptimizer._update_step_xla at 0x7ff4b012ec20> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
7/7 [==============================] - ETA: 0s - loss: 0.8283 - mean_iou: 0.0933 - dice_coef: 0.17172023-07-17 14:29:46.285407: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
2023-07-17 14:29:46.285595: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
7/7 [==============================] - 104s 6s/step - loss: 0.8283 - mean_iou: 0.0933 - dice_coef: 0.1717 - val_loss: 0.8250 - val_mean_iou: 0.1117 - val_dice_coef: 0.1750 - lr: 1.0000e-07

Epoch 2: LearningRateScheduler setting learning rate to 5.095e-06.
Epoch 2/100
7/7 [==============================] - 48s 7s/step - loss: 0.8143 - mean_iou: 0.1368 - dice_coef: 0.1857 - val_loss: 0.8243 - val_mean_iou: 0.0994 - val_dice_coef: 0.1757 - lr: 5.0950e-06

Epoch 3: LearningRateScheduler setting learning rate to 1.0090000000000002e-05.
Epoch 3/100
7/7 [==============================] - 58s 8s/step - loss: 0.7744 - mean_iou: 0.2575 - dice_coef: 0.2256 - val_loss: 0.8187 - val_mean_iou: 0.1025 - val_dice_coef: 0.1813 - lr: 1.0090e-05

Epoch 4: LearningRateScheduler setting learning rate to 1.5085000000000002e-05.
Epoch 4/100
7/7 [==============================] - 70s 10s/step - loss: 0.7255 - mean_iou: 0.3595 - dice_coef: 0.2745 - val_loss: 0.8025 - val_mean_iou: 0.2076 - val_dice_coef: 0.1975 - lr: 1.5085e-05

Epoch 5: LearningRateScheduler setting learning rate to 2.008e-05.
Epoch 5/100
7/7 [==============================] - 89s 13s/step - loss: 0.6649 - mean_iou: 0.4474 - dice_coef: 0.3351 - val_loss: 0.7727 - val_mean_iou: 0.3397 - val_dice_coef: 0.2273 - lr: 2.0080e-05

Epoch 6: LearningRateScheduler setting learning rate to 2.5075000000000003e-05.
Epoch 6/100
7/7 [==============================] - 111s 16s/step - loss: 0.6000 - mean_iou: 0.5170 - dice_coef: 0.4000 - val_loss: 0.7223 - val_mean_iou: 0.4322 - val_dice_coef: 0.2777 - lr: 2.5075e-05

Epoch 7: LearningRateScheduler setting learning rate to 3.0070000000000002e-05.
Epoch 7/100
7/7 [==============================] - 133s 19s/step - loss: 0.5377 - mean_iou: 0.5634 - dice_coef: 0.4623 - val_loss: 0.6571 - val_mean_iou: 0.4958 - val_dice_coef: 0.3429 - lr: 3.0070e-05

Epoch 8: LearningRateScheduler setting learning rate to 3.5065000000000004e-05.
Epoch 8/100
7/7 [==============================] - 166s 24s/step - loss: 0.4695 - mean_iou: 0.6085 - dice_coef: 0.5305 - val_loss: 0.6231 - val_mean_iou: 0.5256 - val_dice_coef: 0.3769 - lr: 3.5065e-05

Epoch 9: LearningRateScheduler setting learning rate to 4.0060000000000006e-05.
Epoch 9/100
7/7 [==============================] - 193s 28s/step - loss: 0.4056 - mean_iou: 0.6486 - dice_coef: 0.5944 - val_loss: 0.5998 - val_mean_iou: 0.5692 - val_dice_coef: 0.4002 - lr: 4.0060e-05

Epoch 10: LearningRateScheduler setting learning rate to 4.505500000000001e-05.
Epoch 10/100
7/7 [==============================] - 223s 32s/step - loss: 0.3394 - mean_iou: 0.6948 - dice_coef: 0.6606 - val_loss: 0.5798 - val_mean_iou: 0.5873 - val_dice_coef: 0.4202 - lr: 4.5055e-05

Epoch 11: LearningRateScheduler setting learning rate to 5.005000000000001e-05.
Epoch 11/100
7/7 [==============================] - 270s 39s/step - loss: 0.2782 - mean_iou: 0.7330 - dice_coef: 0.7218 - val_loss: 0.5642 - val_mean_iou: 0.5749 - val_dice_coef: 0.4358 - lr: 5.0050e-05

Epoch 12: LearningRateScheduler setting learning rate to 5.5045000000000006e-05.
Epoch 12/100

3/7 [===========>..................] - ETA: 3:06 - loss: 0.2409 - mean_iou: 0.7553 - dice_coef: 0.7591
7/7 [==============================] - 338s 49s/step - loss: 0.2297 - mean_iou: 0.7596 - dice_coef: 0.7703 - val_loss: 0.5389 - val_mean_iou: 0.5528 - val_dice_coef: 0.4611 - lr: 5.5045e-05

Epoch 13: LearningRateScheduler setting learning rate to 6.004000000000001e-05.
Epoch 13/100
6/7 [========================>.....] - ETA: 1:42 - loss: 0.1955 - mean_iou: 0.7775 - dice_coef: 0.8045
7/7 [==============================] - 934s 147s/step - loss: 0.1911 - mean_iou: 0.7819 - dice_coef: 0.8089 - val_loss: 0.4898 - val_mean_iou: 0.5810 - val_dice_coef: 0.5102 - lr: 6.0040e-05

Epoch 14: LearningRateScheduler setting learning rate to 6.5035e-05.
Epoch 14/100
7/7 [==============================] - 1250s 178s/step - loss: 0.1716 - mean_iou: 0.7888 - dice_coef: 0.8284 - val_loss: 0.4677 - val_mean_iou: 0.5767 - val_dice_coef: 0.5323 - lr: 6.5035e-05

Epoch 15: LearningRateScheduler setting learning rate to 7.003e-05.
Epoch 15/100
6/7 [========================>.....] - ETA: 4:00 - loss: 0.1572 - mean_iou: 0.7959 - dice_co7/7 [==============================] - ETA: 0s - loss: 0.1555 - mean_iou: 0.7975 - dice_coef7/7 [==============================] - 1666s 244s/step - loss: 0.1555 - mean_iou: 0.7975 - dice_coef: 0.8445 - val_loss: 0.4266 - val_mean_iou: 0.5950 - val_dice_coef: 0.5734 - lr: 7.0030e-05

Epoch 16: LearningRateScheduler setting learning rate to 7.502500000000001e-05.
Epoch 16/100
1/7 [===>..........................] - ETA: 21:40 - loss: 0.1444 - mean_iou: 0.8064 - dice_c2/7 [=======>......................] - ETA: 14:28 - loss: 0.1494 - mean_iou: 0.8007 - dice_c3/7 [===========>..................] - ETA: 11:31 - loss: 0.1469 - mean_iou: 0.8032 - dice_c4/7 [================>.............] - ETA: 7:01 - loss: 0.1455 - mean_iou: 0.8052 - dice_co5/7 [====================>.........] - ETA: 4:10 - loss: 0.1485 - mean_iou: 0.8009 - dice_co6/7 [========================>.....] - ETA: 1:56 - loss: 0.1459 - mean_iou: 0.8037 - dice_co7/7 [==============================] - ETA: 0s - loss: 0.1456 - mean_iou: 0.8037 - dice_coef7/7 [==============================] - 882s 111s/step - loss: 0.1456 - mean_iou: 0.8037 - dice_coef: 0.8544 - val_loss: 0.4005 - val_mean_iou: 0.6009 - val_dice_coef: 0.5995 - lr: 7.5025e-05

Epoch 17: LearningRateScheduler setting learning rate to 8.002000000000001e-05.
Epoch 17/100
1/7 [===>..........................] - ETA: 8:09 - loss: 0.1467 - mean_iou: 0.7992 - dice_co2/7 [=======>......................] - ETA: 6:55 - loss: 0.1339 - mean_iou: 0.8167 - dice_co3/7 [===========>..................] - ETA: 5:34 - loss: 0.1337 - mean_iou: 0.8173 - dice_co4/7 [================>.............] - ETA: 4:13 - loss: 0.1351 - mean_iou: 0.8148 - dice_co5/7 [====================>.........] - ETA: 2:50 - loss: 0.1346 - mean_iou: 0.8147 - dice_co6/7 [========================>.....] - ETA: 1:26 - loss: 0.1343 - mean_iou: 0.8146 - dice_co7/7 [==============================] - ETA: 0s - loss: 0.1348 - mean_iou: 0.8135 - dice_coef7/7 [==============================] - 603s 87s/step - loss: 0.1348 - mean_iou: 0.8135 - dice_coef: 0.8652 - val_loss: 0.4165 - val_mean_iou: 0.5787 - val_dice_coef: 0.5835 - lr: 8.0020e-05

Epoch 18: LearningRateScheduler setting learning rate to 8.501500000000001e-05.
Epoch 18/100
1/7 [===>..........................] - ETA: 9:12 - loss: 0.1245 - mean_iou: 0.8240 - dice_co2/7 [=======>......................] - ETA: 7:48 - loss: 0.1321 - mean_iou: 0.8150 - dice_co3/7 [===========>..................] - ETA: 6:17 - loss: 0.1276 - mean_iou: 0.8211 - dice_co4/7 [================>.............] - ETA: 4:45 - loss: 0.1251 - mean_iou: 0.8240 - dice_co5/7 [====================>.........] - ETA: 3:12 - loss: 0.1249 - mean_iou: 0.8240 - dice_co6/7 [========================>.....] - ETA: 1:36 - loss: 0.1275 - mean_iou: 0.8199 - dice_co7/7 [==============================] - ETA: 0s - loss: 0.1260 - mean_iou: 0.8218 - dice_coef7/7 [==============================] - 681s 98s/step - loss: 0.1260 - mean_iou: 0.8218 - dice_coef: 0.8740 - val_loss: 0.3927 - val_mean_iou: 0.5875 - val_dice_coef: 0.6073 - lr: 8.5015e-05

Epoch 19: LearningRateScheduler setting learning rate to 9.001000000000001e-05.
Epoch 19/100
1/7 [===>..........................] - ETA: 10:10 - loss: 0.1192 - mean_iou: 0.8293 - dice_c2/7 [=======>......................] - ETA: 8:42 - loss: 0.1253 - mean_iou: 0.8207 - dice_co3/7 [===========>..................] - ETA: 6:57 - loss: 0.1215 - mean_iou: 0.8254 - dice_co4/7 [================>.............] - ETA: 5:26 - loss: 0.1228 - mean_iou: 0.8234 - dice_co5/7 [====================>.........] - ETA: 3:41 - loss: 0.1216 - mean_iou: 0.8249 - dice_co6/7 [========================>.....] - ETA: 1:52 - loss: 0.1202 - mean_iou: 0.8266 - dice_co7/7 [==============================] - ETA: 0s - loss: 0.1219 - mean_iou: 0.8242 - dice_coef7/7 [==============================] - 793s 115s/step - loss: 0.1219 - mean_iou: 0.8242 - dice_coef: 0.8781 - val_loss: 0.3944 - val_mean_iou: 0.5733 - val_dice_coef: 0.6056 - lr: 9.0010e-05

Epoch 20: LearningRateScheduler setting learning rate to 9.500500000000002e-05.
Epoch 20/100
1/7 [===>..........................] - ETA: 12:36 - loss: 0.1135 - mean_iou: 0.8339 - dice_c2/7 [=======>......................] - ETA: 10:24 - loss: 0.1121 - mean_iou: 0.8356 - dice_c3/7 [===========>..................] - ETA: 8:27 - loss: 0.1175 - mean_iou: 0.8287 - dice_coef: 0.8825 2023-07-17 17:04:30.813059: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:65]
********************************
[Compiling module a_inference__update_step_xla_700842__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-17 17:05:35.222870: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 3m4.393305286s

********************************
[Compiling module a_inference__update_step_xla_700842__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2320sharon commented 1 year ago

I'm going to try updating the config file hatteras_l8_resunet.json with the following some new parameters recommended by Dan. I will be conducting three different tests to see if these parameters make a difference.

  1. CLEAR_MEMORY=True this should allow the python garbage collector to clear any memory
  2. LOAD_DATA_WITH_CPU = True this should allow the CPU to take care of data loading while the GPU runs the model
  3. Both CLEAR_MEMORY=True and LOAD_DATA_WITH_CPU = True
2320sharon commented 1 year ago

Unfortunately, my attempt with both CLEAR_MEMORY=True and LOAD_DATA_WITH_CPU = True yielded the same result as previous tests. The model stops training at epoch 20

Here are the resulting outputs

(gym3) sharon@Sharonator:~/gym/segmentation_gym$ python train_model.py
/home/sharon/gym/segmentation_gym/model_from_scratch_test/train_data/train_npzs
/home/sharon/gym/segmentation_gym/model_from_scratch_test/val_data/val_npzs
/home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/config/hatteras_l8_resunet_both.json
Using GPU
Using single GPU device
2023-07-20 15:48:53.841794: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-20 15:48:54.049341: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-20 15:48:54.899020: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Version:  2.12.1
Eager mode:  True
2023-07-20 15:48:57.563559: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-20 15:48:57.721846: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-20 15:48:57.721946: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
physical_devices : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-07-20 15:48:57.722289: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Making new directory for example model outputs: /home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/modelOut
2023-07-20 15:48:57.758358: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-20 15:48:57.758477: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-20 15:48:57.758524: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-20 15:48:59.018407: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-20 15:48:59.018568: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-20 15:48:59.018593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1722] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-07-20 15:48:59.018744: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-20 15:48:59.018794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13485 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2023-07-20 15:48:59.102695: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-20 15:48:59.139984: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
.....................................
Creating and compiling model ...
Garbage collection will be perfomed
INITIAL_EPOCH not specified in the config file. Setting to default of 0 ...
.....................................
Training model ...
2023-07-20 15:48:59.851327: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-20 15:48:59.851681: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]

Epoch 1: LearningRateScheduler setting learning rate to 1e-07.
Epoch 1/100
2023-07-20 15:49:02.332422: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8600
2023-07-20 15:50:11.873315: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0xd7db91b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-20 15:50:11.873369: I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti Laptop GPU, Compute Capability 8.6
2023-07-20 15:50:11.901304: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-20 15:50:12.154897: I ./tensorflow/compiler/jit/device_compiler.h:180] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
WARNING:tensorflow:5 out of the last 5 calls to <function _BaseOptimizer._update_step_xla at 0x7f3514f297e0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
WARNING:tensorflow:6 out of the last 6 calls to <function _BaseOptimizer._update_step_xla at 0x7f3514f297e0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
7/7 [==============================] - ETA: 0s - loss: 0.8334 - mean_iou: 0.0842 - dice_coef: 0.16662023-07-20 15:51:19.955234: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
2023-07-20 15:51:19.955451: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
7/7 [==============================] - 142s 10s/step - loss: 0.8334 - mean_iou: 0.0842 - dice_coef: 0.1666 - val_loss: 0.8078 - val_mean_iou: 0.2061 - val_dice_coef: 0.1922 - lr: 1.0000e-07

Epoch 2: LearningRateScheduler setting learning rate to 5.095e-06.
Epoch 2/100
7/7 [==============================] - 83s 12s/step - loss: 0.8176 - mean_iou: 0.1057 - dice_coef: 0.1824 - val_loss: 0.8096 - val_mean_iou: 0.2172 - val_dice_coef: 0.1904 - lr: 5.0950e-06

Epoch 3: LearningRateScheduler setting learning rate to 1.0090000000000002e-05.
Epoch 3/100
7/7 [==============================] - 97s 14s/step - loss: 0.7678 - mean_iou: 0.1978 - dice_coef: 0.2322 - val_loss: 0.8046 - val_mean_iou: 0.2345 - val_dice_coef: 0.1954 - lr: 1.0090e-05

Epoch 4: LearningRateScheduler setting learning rate to 1.5085000000000002e-05.
Epoch 4/100
7/7 [==============================] - 120s 17s/step - loss: 0.6896 - mean_iou: 0.3296 - dice_coef: 0.3104 - val_loss: 0.7895 - val_mean_iou: 0.2572 - val_dice_coef: 0.2105 - lr: 1.5085e-05

Epoch 5: LearningRateScheduler setting learning rate to 2.008e-05.
Epoch 5/100
7/7 [==============================] - 126s 18s/step - loss: 0.6041 - mean_iou: 0.4281 - dice_coef: 0.3959 - val_loss: 0.7618 - val_mean_iou: 0.2702 - val_dice_coef: 0.2382 - lr: 2.0080e-05

Epoch 6: LearningRateScheduler setting learning rate to 2.5075000000000003e-05.
Epoch 6/100
7/7 [==============================] - 138s 20s/step - loss: 0.5186 - mean_iou: 0.5084 - dice_coef: 0.4814 - val_loss: 0.7244 - val_mean_iou: 0.3070 - val_dice_coef: 0.2756 - lr: 2.5075e-05

Epoch 7: LearningRateScheduler setting learning rate to 3.0070000000000002e-05.
Epoch 7/100
7/7 [==============================] - 161s 23s/step - loss: 0.4392 - mean_iou: 0.5696 - dice_coef: 0.5608 - val_loss: 0.6492 - val_mean_iou: 0.4186 - val_dice_coef: 0.3508 - lr: 3.0070e-05

Epoch 8: LearningRateScheduler setting learning rate to 3.5065000000000004e-05.
Epoch 8/100
7/7 [==============================] - 174s 25s/step - loss: 0.3599 - mean_iou: 0.6314 - dice_coef: 0.6401 - val_loss: 0.5844 - val_mean_iou: 0.4781 - val_dice_coef: 0.4156 - lr: 3.5065e-05

Epoch 9: LearningRateScheduler setting learning rate to 4.0060000000000006e-05.
Epoch 9/100
7/7 [==============================] - 229s 33s/step - loss: 0.2932 - mean_iou: 0.6805 - dice_coef: 0.7068 - val_loss: 0.5317 - val_mean_iou: 0.5452 - val_dice_coef: 0.4683 - lr: 4.0060e-05

Epoch 10: LearningRateScheduler setting learning rate to 4.505500000000001e-05.
Epoch 10/100
7/7 [==============================] - 262s 38s/step - loss: 0.2414 - mean_iou: 0.7171 - dice_coef: 0.7586 - val_loss: 0.5010 - val_mean_iou: 0.5658 - val_dice_coef: 0.4990 - lr: 4.5055e-05

Epoch 11: LearningRateScheduler setting learning rate to 5.005000000000001e-05.
Epoch 11/100
7/7 [==============================] - 306s 44s/step - loss: 0.2035 - mean_iou: 0.7462 - dice_coef: 0.7965 - val_loss: 0.4853 - val_mean_iou: 0.5787 - val_dice_coef: 0.5147 - lr: 5.0050e-05

Epoch 12: LearningRateScheduler setting learning rate to 5.5045000000000006e-05.
Epoch 12/100
7/7 [==============================] - 346s 50s/step - loss: 0.1803 - mean_iou: 0.7613 - dice_coef: 0.8197 - val_loss: 0.4865 - val_mean_iou: 0.5612 - val_dice_coef: 0.5135 - lr: 5.5045e-05

Epoch 13: LearningRateScheduler setting learning rate to 6.004000000000001e-05.
Epoch 13/100
7/7 [==============================] - 405s 59s/step - loss: 0.1603 - mean_iou: 0.7782 - dice_coef: 0.8397 - val_loss: 0.4856 - val_mean_iou: 0.5436 - val_dice_coef: 0.5144 - lr: 6.0040e-05

Epoch 14: LearningRateScheduler setting learning rate to 6.5035e-05.
Epoch 14/100
7/7 [==============================] - 468s 67s/step - loss: 0.1512 - mean_iou: 0.7852 - dice_coef: 0.8488 - val_loss: 0.5067 - val_mean_iou: 0.4777 - val_dice_coef: 0.4933 - lr: 6.5035e-05

Epoch 15: LearningRateScheduler setting learning rate to 7.003e-05.
Epoch 15/100
7/7 [==============================] - 532s 77s/step - loss: 0.1393 - mean_iou: 0.7966 - dice_coef: 0.8607 - val_loss: 0.4956 - val_mean_iou: 0.4781 - val_dice_coef: 0.5044 - lr: 7.0030e-05

Epoch 16: LearningRateScheduler setting learning rate to 7.502500000000001e-05.
Epoch 16/100
7/7 [==============================] - 607s 87s/step - loss: 0.1331 - mean_iou: 0.8023 - dice_coef: 0.8669 - val_loss: 0.5341 - val_mean_iou: 0.4146 - val_dice_coef: 0.4659 - lr: 7.5025e-05

Epoch 17: LearningRateScheduler setting learning rate to 8.002000000000001e-05.
Epoch 17/100
7/7 [==============================] - 696s 100s/step - loss: 0.1257 - mean_iou: 0.8103 - dice_coef: 0.8743 - val_loss: 0.5032 - val_mean_iou: 0.4373 - val_dice_coef: 0.4968 - lr: 8.0020e-05

Epoch 18: LearningRateScheduler setting learning rate to 8.501500000000001e-05.
Epoch 18/100
7/7 [==============================] - 830s 121s/step - loss: 0.1201 - mean_iou: 0.8168 - dice_coef: 0.8799 - val_loss: 0.5171 - val_mean_iou: 0.4142 - val_dice_coef: 0.4829 - lr: 8.5015e-05

Epoch 19: LearningRateScheduler setting learning rate to 9.001000000000001e-05.
Epoch 19/100
7/7 [==============================] - 876s 125s/step - loss: 0.1181 - mean_iou: 0.8180 - dice_coef: 0.8819 - val_loss: 0.4853 - val_mean_iou: 0.4409 - val_dice_coef: 0.5147 - lr: 9.0010e-05

Epoch 20: LearningRateScheduler setting learning rate to 9.500500000000002e-05.
Epoch 20/100
2/7 [=======>......................] - ETA: 14:11 - loss: 0.1077 - mean_iou: 0.8311 - dice_coef: 0.89232023-07-20 17:48:05.807679: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:65]
********************************
[Compiling module a_inference__update_step_xla_692309__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 17:50:57.945018: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 4m52.151932823s

********************************
[Compiling module a_inference__update_step_xla_692309__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 17:59:14.967959: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 2m17.16016214s

********************************
[Compiling module a_inference__update_step_xla_692789__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 18:03:59.307078: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 3m36.810617217s

********************************
[Compiling module a_inference__update_step_xla_692829__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 18:08:20.012139: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:65]
********************************
[Compiling module a_inference__update_step_xla_692869__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 18:11:24.812844: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 5m4.793492202s

********************************
[Compiling module a_inference__update_step_xla_692869__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 18:21:52.676335: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 2m10.338053854s

********************************
[Compiling module a_inference__update_step_xla_693069__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 18:25:18.646574: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:65]
********************************
[Compiling module a_inference__update_step_xla_693109__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 18:25:19.760141: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 2m1.111587202s

********************************
[Compiling module a_inference__update_step_xla_693109__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 18:35:39.276132: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:65]
********************************
[Compiling module a_inference__update_step_xla_693149__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-20 18:40:58.350673: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 7m19.06974527s

********************************
dbuscombe-usgs commented 1 year ago

hmmmm......Did you make the dataset yourself, by running make_datasets first? Or did you use the one provided, by splitting it into train and validation parts?

what BATCH_SIZE are you using?

To try help troubleshoot, I just make a dataset and trained a model using that test dataset. The model finished training on epoch 100 (the model LR should probably be adjusted, but this is a test to make sure that the model training can spin out to large epochs)

I am not using wsl, but windows native (I cant get my wsl2 installation to train models yet)

Ok, next thing to try ... HOT_START the model. You're going to tell the model to start training on epoch 20, i.e. pick up where it left off, and then give it the model weights thst you already have from the previous (failed) model training run. In the config file, add:

HOT_START: "/mnt/path/to/your/model_weights.h5" INITIAL_EPOCH: 20

dbuscombe-usgs commented 1 year ago

I am using this as a good reminder and opportunity to update the test data set, model files, and related files. In particular, we now have a new way to make the train and val datasets. Plus, we have segformers. I'm preparing a new Zenodo release

2320sharon commented 1 year ago

Thanks for the troubleshooting tips.

I did my best to follow the https://github.com/Doodleverse/segmentation_gym/wiki/02_Case-Study-Demo so I used the dataset available there and used make_dataset.py to generate the npz files.

In my config file the BATCH_SIZE=16

As for hot starting the model, I will use the h5 file in gym\my_segmentation_datasets\weights\hatteras_l8_resunet_both.h5 and modify the config file to specify hot start to let you know how it goes.

2320sharon commented 1 year ago

Okay here are the changes I made to the config file. I'll be training the model soon

  "HOT_START": "/home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/weights/hatteras_l8_resunet_both.h5",
  "INITIAL_EPOCH": 20,
  "LOAD_DATA_WITH_CPU" : true,
  "CLEAR_MEMORY":true,
2320sharon commented 1 year ago

I just started training with the hot started model. Here are the outputs I have so far


(gym3) sharon@Sharonator:~/gym/segmentation_gym$ python train_model.py
/home/sharon/gym/segmentation_gym/model_from_scratch_test/train_data/train_npzs
/home/sharon/gym/segmentation_gym/model_from_scratch_test/val_data/val_npzs
/home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/config/hatteras_l8_resunet_both.json
Using GPU
Using single GPU device
2023-07-21 09:41:50.111914: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-21 09:41:50.262230: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-21 09:41:50.862693: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Version:  2.12.1
Eager mode:  True
2023-07-21 09:41:52.450167: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-21 09:41:52.475573: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-21 09:41:52.475671: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
physical_devices : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-07-21 09:41:52.476005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Making new directory for example model outputs: /home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/modelOut
2023-07-21 09:41:52.485050: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-21 09:41:52.485124: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-21 09:41:52.485163: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-21 09:41:53.423081: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-21 09:41:53.423216: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-21 09:41:53.423238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1722] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-07-21 09:41:53.423286: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-21 09:41:53.423326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13485 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2023-07-21 09:41:53.478163: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-21 09:41:53.498486: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
.....................................
Creating and compiling model ...
Garbage collection will be perfomed
transfering model weights for hot start ...
.....................................
Training model ...
2023-07-21 09:41:54.121269: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-21 09:41:54.121430: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]

Epoch 21: LearningRateScheduler setting learning rate to 0.0001.
Epoch 21/100
2023-07-21 09:41:55.426038: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8600
2023-07-21 09:42:53.150248: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0xd600b990 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-21 09:42:53.150293: I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti Laptop GPU, Compute Capability 8.6
2023-07-21 09:42:53.170597: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-21 09:42:53.351309: I ./tensorflow/compiler/jit/device_compiler.h:180] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
WARNING:tensorflow:5 out of the last 5 calls to <function _BaseOptimizer._update_step_xla at 0x7f5ce81229e0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
WARNING:tensorflow:6 out of the last 6 calls to <function _BaseOptimizer._update_step_xla at 0x7f5ce81229e0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
1/7 [===>..........................] - ETA: 6:48 - loss: 0.1029 - mean_iou: 0.8382 - dice_co2/7 [=======>......................] - ETA: 26s - loss: 0.1698 - mean_iou: 0.7523 - dice_coe3/7 [===========>..................] - ETA: 21s - loss: 0.1748 - mean_iou: 0.7440 - dice_coe4/7 [================>.............] - ETA: 16s - loss: 0.1708 - mean_iou: 0.7465 - dice_coe5/7 [====================>.........] - ETA: 11s - loss: 0.1650 - mean_iou: 0.7530 - dice_coe6/7 [========================>.....] - ETA: 5s - loss: 0.1622 - mean_iou: 0.7561 - dice_coef7/7 [==============================] - ETA: 0s - loss: 0.1603 - mean_iou: 0.7579 - dice_coef: 0.83972023-07-21 09:43:36.245506: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
2023-07-21 09:43:36.245688: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
7/7 [==============================] - 103s 6s/step - loss: 0.1603 - mean_iou: 0.7579 - dice_coef: 0.8397 - val_loss: 0.3136 - val_mean_iou: 0.6199 - val_dice_coef: 0.6864 - lr: 1.0000e-04
Epoch 22: LearningRateScheduler setting learning rate to 9.001e-05.
Epoch 22/100
1/7 [===>..........................] - ETA: 35s - loss: 0.1438 - mean_iou: 0.7779 - dice_coe2/7 [=======>......................] - ETA: 30s - loss: 0.1356 - mean_iou: 0.7883 - dice_coe3/7 [===========>..................] - ETA: 24s - loss: 0.1326 - mean_iou: 0.7926 - dice_coe4/7 [================>.............] - ETA: 19s - loss: 0.1359 - mean_iou: 0.7889 - dice_coe5/7 [====================>.........] - ETA: 12s - loss: 0.1320 - mean_iou: 0.7938 - dice_coe6/7 [========================>.....] - ETA: 6s - loss: 0.1299 - mean_iou: 0.7961 - dice_coef7/7 [==============================] - ETA: 0s - loss: 0.1308 - mean_iou: 0.7948 - dice_coef7/7 [==============================] - 47s 7s/step - loss: 0.1308 - mean_iou: 0.7948 - dice_coef: 0.8692 - val_loss: 0.3097 - val_mean_iou: 0.6153 - val_dice_coef: 0.6903 - lr: 9.0010e-05

Epoch 23: LearningRateScheduler setting learning rate to 8.1019e-05.
Epoch 23/100
1/7 [===>..........................] - ETA: 44s - loss: 0.1069 - mean_iou: 0.8271 - dice_coe2/7 [=======>......................] - ETA: 36s - loss: 0.1094 - mean_iou: 0.8241 - dice_coe3/7 [===========>..................] - ETA: 29s - loss: 0.1146 - mean_iou: 0.8169 - dice_coe4/7 [================>.............] - ETA: 22s - loss: 0.1164 - mean_iou: 0.8142 - dice_coe5/7 [====================>.........] - ETA: 15s - loss: 0.1164 - mean_iou: 0.8143 - dice_coe6/7 [========================>.....] - ETA: 7s - loss: 0.1166 - mean_iou: 0.8140 - dice_coef7/7 [==============================] - ETA: 0s - loss: 0.1152 - mean_iou: 0.8158 - dice_coef7/7 [==============================] - 56s 8s/step - loss: 0.1152 - mean_iou: 0.8158 - dice_coef: 0.8848 - val_loss: 0.2824 - val_mean_iou: 0.6270 - val_dice_coef: 0.7176 - lr: 8.1019e-05
dbuscombe-usgs commented 1 year ago

Okay, so that tells me there is probably nothing wrong with the data. This is a problem I have found in the past - I have sometimes used this HOT START trick to get models to train. There are obvious downsides, one of which being the inability to track the full loss curve.

I think it's a hardware thing somehow ... or driver issue

dbuscombe-usgs commented 1 year ago

I have updated the zenodo release https://zenodo.org/record/8170543

and edited the wiki page with the new link https://github.com/Doodleverse/segmentation_gym/wiki/02_Case-Study-Demo

2320sharon commented 1 year ago

So the train_model script stopped at epoch 33 and just ran the validation set then finished. Here is the output of the model. I cut out a lot of the validation set outputs because they all look the same.

Epoch 33: LearningRateScheduler setting learning rate to 2.831471069445191e-05.
Epoch 33/100
1/7 [===>..........................] - ETA: 5:14 - loss: 0.0809 - mean_iou: 0.8626 - dice_co2/7 [=======>......................] - ETA: 4:07 - loss: 0.0853 - mean_iou: 0.8564 - dice_co3/7 [===========>..................] - ETA: 3:19 - loss: 0.0904 - mean_iou: 0.8492 - dice_co4/7 [================>.............] - ETA: 2:33 - loss: 0.0883 - mean_iou: 0.8522 - dice_co5/7 [====================>.........] - ETA: 1:42 - loss: 0.0882 - mean_iou: 0.8524 - dice_co6/7 [========================>.....] - ETA: 51s - loss: 0.0890 - mean_iou: 0.8511 - dice_coe7/7 [==============================] - ETA: 0s - loss: 0.0875 - mean_iou: 0.8533 - dice_coef7/7 [==============================] - 375s 54s/step - loss: 0.0875 - mean_iou: 0.8533 - dice_coef: 0.9125 - val_loss: 0.3131 - val_mean_iou: 0.5766 - val_dice_coef: 0.6869 - lr: 2.8315e-05
.....................................
Evaluating model on entire validation set ...
1/1 [==============================] - ETA: 0s - loss: 0.3131 - mean_iou: 0.5766 - dice_coef1/1 [==============================] - 1s 1s/step - loss: 0.3131 - mean_iou: 0.5766 - dice_coef: 0.6869
loss=0.3131, Mean IOU=0.5766, Mean Dice=0.6869
2023-07-21 10:17:47.844947: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
2023-07-21 10:17:47.845601: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
1/1 [==============================] - 1s 1s/step
1/1 [==============================] - 0s 112ms/step
1/1 [==============================] - 0s 115ms/step
/home/sharon/miniconda3/envs/gym3/lib/python3.10/site-packages/doodleverse_utils/model_metrics.py:46: RuntimeWarning: invalid value encountered in divide
  recall = np.diag(confusionMatrix) / confusionMatrix.sum(axis = 0)
1/1 [==============================] - 0s 120ms/step
1/1 [==============================] - 0s 96ms/step
1/1 [==============================] - 0s 90ms/step
Mean of mean IoUs (validation subset)=0.577
Mean of mean IoUs, confusion matrix (validation subset)=0.577
Mean of mean frequency weighted IoUs, confusion matrix (validation subset)=0.828
Mean of Matthews Correlation Coefficients (validation subset)=0.744
Mean of mean Dice scores (validation subset)=0.659
Mean of mean KLD scores (validation subset)=1.515
2023-07-21 10:21:18.233105: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-21 10:21:18.233408: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
1/1 [==============================] - 0s 209ms/step
1/1 [==============================] - 0s 86ms/step
1/1 [==============================] - 0s 95ms/step
/home/sharon/miniconda3/envs/gym3/lib/python3.10/site-packages/doodleverse_utils/model_metrics.py:46: RuntimeWarning: invalid value encountered in divide
  recall = np.diag(confusionMatrix) / confusionMatrix.sum(axis = 0)
1/1 [==============================] - 0s 96ms/step
1/1 [==============================] - 0s 94ms/step
1/1 [==============================] - 0s 84ms/step
1/1 [==============================] - 0s 85ms/step
1/1 [==============================] - 0s 85ms/step
1/1 [==============================] - 0s 95ms/step

Mean of mean IoUs (train subset)=0.590
Mean of mean IoUs, confusion matrix (train subset)=0.590
Mean of mean frequency weighted IoUs, confusion matrix (train subset)=0.841
Mean of Matthews Correlation Coefficients (train subset)=0.779
Mean of mean Dice scores (train subset)=0.674
Mean of mean KLD scores (train subset)=1.383
dbuscombe-usgs commented 1 year ago

Ok, well that's something. You trained a model. Its not ideal that you had to do the HOT START .... can you train models using the other config files? there is another resunet, a unet, and a segformer.

I would also use the new dataset I posted on Zenodo, that already comes pre-split.

Note that this is just a toy dataset - the images look similar, yes, but that's ok for testing purposes. It's a very small dataset so the number of validation samples is already quite small ...

2320sharon commented 1 year ago

I'll download the new datasets and model files and give them a try

2320sharon commented 1 year ago

Alright I ran make_dataset.py on the new dataset and it worked. Now I'm running train_model.py for the segformer model and its running it just finished epoch 2

2320sharon commented 1 year ago

Good news its only been 5 minutes and its already on epoch 34

2320sharon commented 1 year ago

The segformer model ran much faster than last time, but it stopped at epoch 38 and then evaluated validation dataset

Epoch 36: LearningRateScheduler setting learning rate to 2.066852409625544e-05.
Epoch 36/100
15/15 [==============================] - 10s 670ms/step - loss: 0.0609 - val_loss: 0.0921 - lr: 2.0669e-05

Epoch 37: LearningRateScheduler setting learning rate to 1.86116716866299e-05.
Epoch 37/100
15/15 [==============================] - 10s 669ms/step - loss: 0.0608 - val_loss: 0.0925 - lr: 1.8612e-05

Epoch 38: LearningRateScheduler setting learning rate to 1.676050451796691e-05.
Epoch 38/100
15/15 [==============================] - 10s 648ms/step - loss: 0.0622 - val_loss: 0.0898 - lr: 1.6761e-05
.....................................
Evaluating model on entire validation set ...
2/2 [==============================] - 1s 243ms/step - loss: 0.0898
loss=0.0898
2023-07-21 12:37:47.871138: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
2023-07-21 12:37:47.871371: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
1/1 [==============================] - 3s 3s/step
1/1 [==============================] - 1s 755ms/step
1/1 [==============================] - 1s 593ms/step
1/1 [==============================] - 1s 730ms/step
Mean of mean IoUs (validation subset)=0.736
Mean of mean IoUs, confusion matrix (validation subset)=0.736
Mean of mean frequency weighted IoUs, confusion matrix (validation subset)=0.952
Mean of Matthews Correlation Coefficients (validation subset)=0.934
Mean of mean Dice scores (validation subset)=0.808
Mean of mean KLD scores (validation subset)=0.431
2023-07-21 12:41:38.741859: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-21 12:41:38.742087: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
1/1 [==============================] - 1s 607ms/step
1/1 [==============================] - 1s 770ms/step
1/1 [==============================] - 1s 730ms/step
1/1 [==============================] - 1s 758ms/step

Mean of mean IoUs (train subset)=0.782
Mean of mean IoUs, confusion matrix (train subset)=0.782
Mean of mean frequency weighted IoUs, confusion matrix (train subset)=0.968
Mean of Matthews Correlation Coefficients (train subset)=0.957
Mean of mean Dice scores (train subset)=0.847
Mean of mean KLD scores (train subset)=0.278