Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler
MIT License
45 stars 10 forks source link

TroubleShooting WSL GPU Installation Tips #136

Closed 2320sharon closed 6 months ago

2320sharon commented 1 year ago

A few users who are on windows 11 have found they can't use their GPU to train models. They're getting segmentation faults, the environment claims it can't find the libdevice files and lots more error messages.

I have found a few different solutions that when used together actually work. I propose we make a troubleshooting wiki that helps users troubleshoot their GPU issues. We can use this troubleshooting guide for other applications that use the GPU across the doodleverse and other tools.

I'm still in the process of figuring out the exact order these need to applied in, but here is what works so far for my windows 11, wsl2, nivida 3080 3080 Ti gpu laptop. Users on other windows machine may or may not run into these errors.

Instructions for Windows Users

⚠️ Run this command each time you boot up the env⚠️ export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib

Install tensorflow and setup conda based on the latest tensorflow instructions these were the commands used: Run all these commands in WSL 2

1. Install miniconda

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

2. Create a conda environment for gym

⚠️ NEVER INSTALL TENSORFLOW WITH CONDA ON WINDOWS ITS NOT UP TO DATE

conda create -n gym python=3.10 -y
conda activate gym
conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*

3. Then configure the system paths for cuda and nivida drivers

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

4. Install the nvidia cuda-nvcc library and set the LD_LIBRARY_PATH

6. Verify these commands by running the following commands

  1. ptxas --version

    • this should give output similar to the following
      ptxas: NVIDIA (R) Ptx optimizing assembler
      Copyright (c) 2005-2023 NVIDIA Corporation
      Built on Fri_Jan__6_16:43:29_PST_2023
      Cuda compilation tools, release 12.0, V12.0.140
      Build cuda_12.0.r12.0/compiler.32267302_0
  2. python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

    • this should give output similar to the following
      2023-01-27 14:04:05.322921: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
      To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
      2023-01-27 14:04:05.855462: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/wheatley/miniconda3/envs/tf_2-11_gpu/lib/
      2023-01-27 14:04:05.855515: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/wheatley/miniconda3/envs/tf_2-11_gpu/lib/
      2023-01-27 14:04:05.855522: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
      2023-01-27 14:04:06.270401: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
      2023-01-27 14:04:06.274057: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
      2023-01-27 14:04:06.274215: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
      [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

    ⚠️ Run this command each time you boot up the env⚠️ export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib

2320sharon commented 1 year ago

On Friday I ran the train_model.py script and successfully training the models for roughly 10 epochs and then stopped training. Today I came back to continue training the model and after running export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib I noticed two things when I run the train_model.py script

  1. My GPU is not being recognized
  2. I'm getting a Segmentation fault fault exception after Epoch 1/100``

This is actually good news because it gives me the opportunity to debug the segmentation fault error Dan kept running into. I'll update this issue with my findings and troubleshooting.

2320sharon commented 1 year ago

Trouble Shooting

  1. Run ptxas --version Output: looks good
    ptxas: NVIDIA (R) Ptx optimizing assembler
    Copyright (c) 2005-2023 NVIDIA Corporation
    Built on Tue_Jun_13_19:13:58_PDT_2023
    Cuda compilation tools, release 12.2, V12.2.91
    Build cuda_12.2.r12.2/compiler.32965470_0
  2. Run python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" Output: Looks like GPU is not being found again. Maybe I need to set LD_LIBRARY_PATH library path again?
    2023-07-17 09:19:38.318761: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
    2023-07-17 09:19:38.350391: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
    To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2023-07-17 09:19:38.837469: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
    2023-07-17 09:19:39.619013: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
    Your kernel may have been built without NUMA support.
    2023-07-17 09:19:39.638724: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
    Skipping registering GPU devices...
    []
  3. Sadly, no GPU found after this too :(

    (gym) sharon@Sharonator:~/gym/segmentation_gym$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
    (gym) sharon@Sharonator:~/gym/segmentation_gym$ echo $LD_LIBRARY_PATH
    :/home/sharon/miniconda3/envs/gym/lib/:/home/sharon/miniconda3/envs/gym/lib/
  4. I found something that worked
2320sharon commented 1 year ago

After trouble shooting I ran python train_model.py and I got the following output

(gym) sharon@Sharonator:~/gym/segmentation_gym$ python train_model.py
/home/sharon/gym/segmentation_gym/model_from_scratch_test/train_data/train_npzs
/home/sharon/gym/segmentation_gym/model_from_scratch_test/val_data/val_npzs
/home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/config/hatteras_l8_resunet.json
Using GPU
Using single GPU device
2023-07-17 09:28:53.035583: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-17 09:28:53.067546: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-17 09:28:53.538306: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Version:  2.12.0
Eager mode:  True
2023-07-17 09:28:54.507412: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 09:28:54.524106: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 09:28:54.524209: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
physical_devices : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-07-17 09:28:54.524543: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Making new directory for example model outputs: /home/sharon/gym/segmentation_gym/my_segmentation_gym_datasets/modelOut
2023-07-17 09:28:54.534886: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 09:28:54.534967: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 09:28:54.534990: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 09:28:56.261248: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 09:28:56.261389: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 09:28:56.261396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1722] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-07-17 09:28:56.261420: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 09:28:56.261470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13485 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2023-07-17 09:28:56.321278: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-17 09:28:56.336157: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [21]
         [[{{node Placeholder/_0}}]]
LOAD_DATA_WITH_CPU not specified in config file. Setting to "False"
.....................................
Creating and compiling model ...
Garbage collection will NOT be perfomed. To change this behaviour, set CLEAR_MEMORY=True in the config file
INITIAL_EPOCH not specified in the config file. Setting to default of 0 ...
.....................................
Training model ...
2023-07-17 09:28:56.906975: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]
2023-07-17 09:28:56.907210: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [122]
         [[{{node Placeholder/_0}}]]

Epoch 1: LearningRateScheduler setting learning rate to 1e-07.
Epoch 1/100
2023-07-17 09:28:58.596377: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:417] Loaded runtime CuDNN library: 8.1.0 but source was compiled with: 8.6.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2023-07-17 09:28:58.598123: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at conv_ops.cc:1068 : UNIMPLEMENTED: DNN library is not found.
Traceback (most recent call last):
  File "/home/sharon/gym/segmentation_gym/train_model.py", line 868, in <module>
    history = model.fit(train_ds, steps_per_epoch=steps_per_epoch, epochs=MAX_EPOCHS,
  File "/home/sharon/miniconda3/envs/gym/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/sharon/miniconda3/envs/gym/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError: Exception encountered when calling layer 'conv2d' (type Conv2D).

{{function_node __wrapped__Conv2D_device_/job:localhost/replica:0/task:0/device:GPU:0}} DNN library is not found. [Op:Conv2D]

Call arguments received by layer 'conv2d' (type Conv2D):
  • inputs=tf.Tensor(shape=(16, 768, 768, 3), dtype=float16)
2320sharon commented 1 year ago
(gym) sharon@Sharonator:~/gym/segmentation_gym$ python3 -m pip install nvidia-cudnn-cu11==8.6.0.163
Requirement already satisfied: nvidia-cudnn-cu11==8.6.0.163 in /home/sharon/miniconda3/envs/gym/lib/python3.10/site-packages (8.6.0.163)
Requirement already satisfied: nvidia-cublas-cu11 in /home/sharon/miniconda3/envs/gym/lib/python3.10/site-packages (from nvidia-cudnn-cu11==8.6.0.163) (11.11.3.6)
(gym) sharon@Sharonator:~/gym/segmentation_gym$ pip install  nvidia-cudnn-cu11==8.6.0.163
Requirement already satisfied: nvidia-cudnn-cu11==8.6.0.163 in /home/sharon/miniconda3/envs/gym/lib/python3.10/site-packages (8.6.0.163)
Requirement already satisfied: nvidia-cublas-cu11 in /home/sharon/miniconda3/envs/gym/lib/python3.10/site-packages (from nvidia-cudnn-cu11==8.6.0.163) (11.11.3.6)

I found out what was causing the error: When I reactivated my tf 11 environment I re-ran the command /home/sharon/miniconda3/envs/gym/lib , but when I reactivated gym I didn't run the command /home/sharon/miniconda3/envs/gym/lib again, which caused XLA flags to be stuck at the old location in env tf11.

(gym) sharon@Sharonator:~/gym/segmentation_gym$ echo $XLA_FLAGS
--xla_gpu_cuda_data_dir=/home/sharon/miniconda3/envs/tf11/lib

to fix this I ran /home/sharon/miniconda3/envs/gym/lib while the gym env was active

2320sharon commented 1 year ago

Interesting it appears path for the LD_LIBRARY_PATH contains both the path for my tff11 environment and my gym environment.... I bet this is causing some confusion when tensorflow is attemmpting to load the CuDNN library because my tf11 has cuda 8.1 while gym has cuda 8.6. I guess I need a wasy to modify LD_LIBRARY_PATH so its cleared out and replaced each time a different env is activated.

(gym) sharon@Sharonator:~/gym/segmentation_gym$ echo $LD_LIBRARY_PATH :/home/sharon/miniconda3/envs/gym/lib/:/home/sharon/miniconda3/envs/gym/lib/:/home/sharon/miniconda3/envs/tf11/lib/:/home/sharon/miniconda3/envs/tf11/lib/python3.10/site-packages/nvidia/cudnn/lib:/home/sharon/miniconda3/envs/gym/lib/:/home/sharon/miniconda3/envs/gym/lib/

2320sharon commented 1 year ago

Okay so here is my idea: clear the LD_LIBRARY_PATH variable and then set it to the currently loaded environment's library

(gym) sharon@Sharonator:~/gym/segmentation_gym$ export LD_LIBRARY_PATH=
(gym) sharon@Sharonator:~/gym/segmentation_gym$ echo $ LD_LIBRARY_PATH
$ LD_LIBRARY_PATH
(gym) sharon@Sharonator:~/gym/segmentation_gym$ echo $LD_LIBRARY_PATH

(gym) sharon@Sharonator:~/gym/segmentation_gym$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
(gym) sharon@Sharonator:~/gym/segmentation_gym$ echo $LD_LIBRARY_PATH
:/home/sharon/miniconda3/envs/gym/lib
(gym) sharon@Sharonator:~/gym/segmentation_gym$ echo $XLA_FLAGS
--xla_gpu_cuda_data_dir=/home/sharon/miniconda3/envs/gym/lib

Now that the $LD_LIBRARY_PATH = :/home/sharon/miniconda3/envs/gym/lib I ran python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" to see if my GPU would still be discovered and it wasn't. This must mean something else it wrong.


(gym) sharon@Sharonator:~/gym/segmentation_gym$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-07-17 10:11:34.339992: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-17 10:11:34.366936: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-17 10:11:34.757321: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-07-17 10:11:35.485352: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-17 10:11:35.502513: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
2320sharon commented 1 year ago

After developing a better understanding of how wsl works and what each of these commands do I think I've developed a better understanding of what was going wrong and what install instructions we should give them.

Here are some things to understand

  1. The /env_vars.sh script in $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh is used to create environment variable whenever the conda environment is activated
  2. /usr/lib/wsl/lib contains the files like libcudnn_cnn_infer.so.8 because the model will only run if /usr/lib/wsl/lib is in the path $LD_LIBRARY_PATH
  3. LD_LIBRARY_PATH is modifed by each export LD_LIBRARY_PATH command so its important to clear it before activating a conda environment so it only contains the relevant paths

Here is what I think needs to be changed in order for the models to run The content of env_vars.sh needs to be modified so that LD_LIBRARY_PATH, CUDNN_PATH, and XLA_FLAGS are set correctly. Here is what I have so far, but I'll be modifying this and the commands to correctly create a gym env in wsl

export LD_LIBRARY_PATH=
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib

Then export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

2320sharon commented 1 year ago

something strange I noticed is that with each epoch that completes the model is getting slower and slower to train

7/7 [==============================] - 104s 6s/step - loss: 0.8242 - mean_iou: 0.0675 - dice_coef: 0.1758 - val_loss: 0.8274 - val_mean_iou: 0.0618 - val_dice_coef: 0.1726 - lr: 1.0000e-07

Epoch 2: LearningRateScheduler setting learning rate to 5.095e-06.
Epoch 2/100
7/7 [==============================] - 47s 7s/step - loss: 0.8091 - mean_iou: 0.0829 - dice_coef: 0.1909 - val_loss: 0.8179 - val_mean_iou: 0.0769 - val_dice_coef: 0.1821 - lr: 5.0950e-06

Epoch 3: LearningRateScheduler setting learning rate to 1.0090000000000002e-05.
Epoch 3/100
7/7 [==============================] - 57s 8s/step - loss: 0.7651 - mean_iou: 0.1538 - dice_coef: 0.2349 - val_loss: 0.7997 - val_mean_iou: 0.0876 - val_dice_coef: 0.2003 - lr: 1.0090e-05

Epoch 4: LearningRateScheduler setting learning rate to 1.5085000000000002e-05.
Epoch 4/100
7/7 [==============================] - 83s 12s/step - loss: 0.7162 - mean_iou: 0.2567 - dice_coef: 0.2838 - val_loss: 0.7794 - val_mean_iou: 0.1456 - val_dice_coef: 0.2206 - lr: 1.5085e-05

Epoch 5: LearningRateScheduler setting learning rate to 2.008e-05.
Epoch 5/100
7/7 [==============================] - 104s 15s/step - loss: 0.6682 - mean_iou: 0.3527 - dice_coef: 0.3318 - val_loss: 0.7558 - val_mean_iou: 0.2589 - val_dice_coef: 0.2442 - lr: 2.0080e-05

Epoch 6: LearningRateScheduler setting learning rate to 2.5075000000000003e-05.
Epoch 6/100
7/7 [==============================] - 108s 16s/step - loss: 0.6181 - mean_iou: 0.4349 - dice_coef: 0.3819 - val_loss: 0.7265 - val_mean_iou: 0.3294 - val_dice_coef: 0.2735 - lr: 2.5075e-05

Epoch 7: LearningRateScheduler setting learning rate to 3.0070000000000002e-05.
Epoch 7/100
7/7 [==============================] - 135s 20s/step - loss: 0.5702 - mean_iou: 0.5035 - dice_coef: 0.4298 - val_loss: 0.6928 - val_mean_iou: 0.3725 - val_dice_coef: 0.3072 - lr: 3.0070e-05

Epoch 8: LearningRateScheduler setting learning rate to 3.5065000000000004e-05.
Epoch 8/100
7/7 [==============================] - 166s 24s/step - loss: 0.5161 - mean_iou: 0.5398 - dice_coef: 0.4839 - val_loss: 0.6759 - val_mean_iou: 0.3845 - val_dice_coef: 0.3241 - lr: 3.5065e-05

Epoch 9: LearningRateScheduler setting learning rate to 4.0060000000000006e-05.
Epoch 9/100
7/7 [==============================] - 199s 29s/step - loss: 0.4622 - mean_iou: 0.5562 - dice_coef: 0.5378 - val_loss: 0.6454 - val_mean_iou: 0.4053 - val_dice_coef: 0.3546 - lr: 4.0060e-05

Epoch 10: LearningRateScheduler setting learning rate to 4.505500000000001e-05.
Epoch 10/100
7/7 [==============================] - 243s 35s/step - loss: 0.4075 - mean_iou: 0.6055 - dice_coef: 0.5925 - val_loss: 0.5980 - val_mean_iou: 0.4915 - val_dice_coef: 0.4020 - lr: 4.5055e-05

Epoch 11: LearningRateScheduler setting learning rate to 5.005000000000001e-05.
Epoch 11/100
7/7 [==============================] - 279s 40s/step - loss: 0.3529 - mean_iou: 0.6701 - dice_coef: 0.6471 - val_loss: 0.5568 - val_mean_iou: 0.5332 - val_dice_coef: 0.4432 - lr: 5.0050e-05

Epoch 12: LearningRateScheduler setting learning rate to 5.5045000000000006e-05.
Epoch 12/100
7/7 [==============================] - 309s 45s/step - loss: 0.3038 - mean_iou: 0.7141 - dice_coef: 0.6962 - val_loss: 0.5020 - val_mean_iou: 0.5834 - val_dice_coef: 0.4980 - lr: 5.5045e-05

Epoch 13: LearningRateScheduler setting learning rate to 6.004000000000001e-05.
Epoch 13/100
7/7 [==============================] - 378s 55s/step - loss: 0.2523 - mean_iou: 0.7485 - dice_coef: 0.7477 - val_loss: 0.4510 - val_mean_iou: 0.6049 - val_dice_coef: 0.5490 - lr: 6.0040e-05

Epoch 14: LearningRateScheduler setting learning rate to 6.5035e-05.
Epoch 14/100
7/7 [==============================] - 445s 65s/step - loss: 0.2153 - mean_iou: 0.7634 - dice_coef: 0.7847 - val_loss: 0.4214 - val_mean_iou: 0.6122 - val_dice_coef: 0.5786 - lr: 6.5035e-05

Epoch 15: LearningRateScheduler setting learning rate to 7.003e-05.
Epoch 15/100
7/7 [==============================] - 522s 75s/step - loss: 0.1790 - mean_iou: 0.7828 - dice_coef: 0.8210 - val_loss: 0.3876 - val_mean_iou: 0.6245 - val_dice_coef: 0.6124 - lr: 7.0030e-05

Epoch 16: LearningRateScheduler setting learning rate to 7.502500000000001e-05.
Epoch 16/100
7/7 [==============================] - 608s 88s/step - loss: 0.1608 - mean_iou: 0.7893 - dice_coef: 0.8392 - val_loss: 0.3796 - val_mean_iou: 0.6103 - val_dice_coef: 0.6204 - lr: 7.5025e-05

Epoch 17: LearningRateScheduler setting learning rate to 8.002000000000001e-05.
Epoch 17/100
7/7 [==============================] - 677s 98s/step - loss: 0.1454 - mean_iou: 0.7990 - dice_coef: 0.8546 - val_loss: 0.3747 - val_mean_iou: 0.6016 - val_dice_coef: 0.6253 - lr: 8.0020e-05

Epoch 18: LearningRateScheduler setting learning rate to 8.501500000000001e-05.
Epoch 18/100
7/7 [==============================] - 754s 109s/step - loss: 0.1332 - mean_iou: 0.8095 - dice_coef: 0.8668 - val_loss: 0.3783 - val_mean_iou: 0.5886 - val_dice_coef: 0.6217 - lr: 8.5015e-05

Epoch 19: LearningRateScheduler setting learning rate to 9.001000000000001e-05.
Epoch 19/100
4/7 [================>.............] - ETA: 5:55 - loss: 0.1296 - mean_iou: 0.8113 - dice_coef: 0.8704
7/7 [==============================] - 842s 121s/step - loss: 0.1298 - mean_iou: 0.8103 - dice_coef: 0.8702 - val_loss: 0.3611 - val_mean_iou: 0.5946 - val_dice_coef: 0.6389 - lr: 9.0010e-05

Epoch 20: LearningRateScheduler setting learning rate to 9.500500000000002e-05.
Epoch 20/100
6/7 [========================>.....] - ETA: 2:05 - loss: 0.1213 - mean_iou: 0.8189 - dice_coef: 0.87872023-07-17 12:49:09.654075: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:65]
********************************
[Compiling module a_inference__update_step_xla_715621__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2023-07-17 12:49:09.884397: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 2m0.240132865s
2320sharon commented 1 year ago

To get everything to work in 1 go here are the commands I ran


conda create -n gym3 python=3.10 -y
conda activate gym3
conda install -c conda-forge cudatoolkit=11.8.0 -y
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
conda install -c nvidia cuda-nvcc --yes

conda install -c conda-forge scikit-image ipython tqdm pandas natsort matplotlib transformers -y
python -m pip install doodleverse_utils chardet

pip uninstall h5py --yes
conda install -c conda-forge h5py -y

mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice/
cp -p $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
dbuscombe-usgs commented 1 year ago

I managed to use this recipe to successfully install a gym conda env on wsl2. I verified that it is able to see my GPUs

I was able to create a new dataset using make_dataset.py and the Cape Hatteras test dataset

dbuscombe-usgs commented 1 year ago

When I run train_model.py, I get this error:

Epoch 1/100
2023-07-20 14:11:10.389694: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:417] Loaded runtime CuDNN library: 8.5.0 but source was compiled with: 8.6.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2023-07-20 14:11:10.391317: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at conv_ops.cc:1068 : UNIMPLEMENTED: DNN library is not found.
Traceback (most recent call last):
  File "/mnt/f/dbuscombe_github/doodleverse/segmentation_gym/train_model.py", line 856, in <module>
    history = model.fit(train_ds, steps_per_epoch=steps_per_epoch, epochs=MAX_EPOCHS,
  File "/home/elwha/miniconda3/envs/gym/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/elwha/miniconda3/envs/gym/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError: Exception encountered when calling layer 'conv2d' (type Conv2D).

{{function_node __wrapped__Conv2D_device_/job:localhost/replica:0/task:0/device:GPU:0}} DNN library is not found. [Op:Conv2D]
dbuscombe-usgs commented 1 year ago

But it finds my GPUs

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]

2320sharon commented 1 year ago

Hmm this line leads me to believe the incorrect version of cuda is being loaded. I would use the echo commands to check $CUDNN_PATH XLA_FLAGS, ,and $LD_LIBRARY_PATH variables, maybe a path to another environment is in these pathes and its reading the incorrect cuda version from. Loaded runtime CuDNN library: 8.5.0 but source was compiled with: 8.6.0

2320sharon commented 1 year ago

Here are the outputs for the environment variable I set

(base) sharon@Sharonator:~/gym/segmentation_gym$ conda activate gym3
(gym3) sharon@Sharonator:~/gym/segmentation_gym$ echo $LD_LIBRARY_PATH
/usr/lib/wsl/lib::/home/sharon/miniconda3/envs/gym3/lib/:/home/sharon/miniconda3/envs/gym3/lib/python3.10/site-packages/nvidia/cudnn/lib
(gym3) sharon@Sharonator:~/gym/segmentation_gym$ echo $CUDNN_PATH
/home/sharon/miniconda3/envs/gym3/lib/python3.10/site-packages/nvidia/cudnn
(gym3) sharon@Sharonator:~/gym/segmentation_gym$ echo $XLA_FLAGS
--xla_gpu_cuda_data_dir=/home/sharon/miniconda3/envs/gym3/lib
dbuscombe-usgs commented 1 year ago
(base) elwha@DESKTOP-LKORC3I:/mnt/c/Users/Elwha$ conda activate gym
(gym) elwha@DESKTOP-LKORC3I:/mnt/c/Users/Elwha$ echo $LD_LIBRARY_PATH
/usr/lib/wsl/lib::/home/elwha/miniconda3/envs/gym/lib/:/home/elwha/.local/lib/python3.10/site-packages/nvidia/cudnn/lib
(gym) elwha@DESKTOP-LKORC3I:/mnt/c/Users/Elwha$ echo $CUDNN_PATH
/home/elwha/.local/lib/python3.10/site-packages/nvidia/cudnn
(gym) elwha@DESKTOP-LKORC3I:/mnt/c/Users/Elwha$ echo $XLA_FLAGS
--xla_gpu_cuda_data_dir=/home/elwha/miniconda3/envs/gym/lib

Then I installed

python3 -m pip install nvidia-cudnn-cu11==8.5.0.96

Now it cant find my GPU

2320sharon commented 1 year ago

Does the file $CONDA_PREFIX/lib/libdevice.10.bc exist?

wait shouldn't you be using nvidia-cudnn-cu11==8.6.0.163 because the code expects CuDNN library:8.6.0 Loaded runtime CuDNN library: 8.5.0 but source was compiled with: 8.6.0. CuDNN library needs to have matching major version and equal or higher minor version.

2320sharon commented 1 year ago

can you list the cudatoolkit and nvidia-cudnn-cu11 versions? I'm just trying to understand why it loaded version originally CuDNN library: 8.5.0

dbuscombe-usgs commented 1 year ago

It loaded 8.6 originally and it found my gpus, but then I got an error telling me to install 8.5. so i did. I'm starting from scratch. So annoying!

2320sharon commented 1 year ago

Gotta love cuda issues. 🤞 Hope that it goes better second time around

dbuscombe-usgs commented 1 year ago

I do

conda create -n gym3 python=3.10 -y
conda activate gym3
conda install -c conda-forge cudatoolkit=11.8.0 -y
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*

and on the last line I get this error

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 1.13.1 requires nvidia-cudnn-cu11==8.5.0.96; platform_system == "Linux", but you have nvidia-cudnn-cu11 8.6.0.163 which is incompatible.
Successfully installed absl-py-1.4.0 astunparse-1.6.3 cachetools-5.3.1 flatbuffers-23.5.26 gast-0.4.0 google-auth-2.22.0 google-auth-oauthlib-1.0.0 google-pasta-0.2.0 grpcio-1.56.2 h5py-3.9.0 jax-0.4.13 keras-2.12.0 libclang-16.0.6 markdown-3.4.3 ml-dtypes-0.2.0 nvidia-cudnn-cu11-8.6.0.163 oauthlib-3.2.2 opt-einsum-3.3.0 protobuf-4.23.4 pyasn1-0.5.0 pyasn1-modules-0.3.0 requests-oauthlib-1.3.1 rsa-4.9 six-1.16.0 tensorboard-2.12.3 tensorboard-data-server-0.7.1 tensorflow-2.12.1 tensorflow-estimator-2.12.0 tensorflow-io-gcs-filesystem-0.32.0 termcolor-2.3.0 urllib3-1.26.16 werkzeug-2.3.6 wrapt-1.14.1
dbuscombe-usgs commented 1 year ago

I perservered, following the rest of the recipe, hoping the error message from pip didnt matter

and now I'm training models on WSL!! 🎉

No idea what happened the first time

dbuscombe-usgs commented 6 months ago

Update February 2024

WSL2 + GPU + SegFormer models have been problematic for some time

After a lot of research, the following works on one WSL2 (Ubuntu 22-0.4) installation. This conda env recipe is currently able to create an env that can train SegFormer models on multiple GPUs

Starting from a fresh installation, install miniconda

sudo apt-get update
sudo apt-get install wget
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh
bash Miniconda3-py39_4.12.0-Linux-x86_64.sh
bash

Housecleaning:

conda update -n base conda
conda clean --all -y
python3 -m pip install --upgrade pip 

Create env:

conda create -n gym_gpu python=3.10 -y
conda activate gym_gpu
conda install -c conda-forge cudatoolkit=11.8.0 -y
conda install -c nvidia cuda-nvcc -y

python3 -m pip install nvidia-cudnn-cu11 tensorflow[and-cuda]

conda install -c conda-forge scikit-image ipython tqdm pandas natsort matplotlib -y
python3 -m pip install doodleverse_utils chardet

python3 -m pip install transformers

Test:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

python  -c "from transformers import TFSegformerForSemanticSegmentation"
dbuscombe-usgs commented 6 months ago

If anyone wants to help with the Doodleverse/Segmentation Gym/Zoo GPU conda env, you would need access to an NVIDIA-GPU enabled machine. It doesn't have to be a large machine - a laptop is fine.

conda installation steps are detailed here above. If the GPU test fails, you'll see output like this:

(gym) marda@IGSWAPWGLTW3120:~$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2024-02-26 18:25:36.666103: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-26 18:25:36.666145: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-26 18:25:36.667124: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-26 18:25:36.670805: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-26 18:25:37.186279: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[]

If it passes, it will list your GPUs, like this:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]

If the two tests pass, then you could train a model using the example dataset https://zenodo.org/records/8170543

detailed here: https://github.com/Doodleverse/segmentation_gym/wiki/02_Case-Study-Demo

To get the data, use:

wget https://zenodo.org/records/8170543/files/my_segmentation_gym_datasets_v5.zip
sudo apt-get install unzip
unzip my_segmentation_gym_datasets_v5.zip

Then clone segmentation gym and train a model on the test dataset

git clone --depth 1 https://github.com/Doodleverse/segmentation_gym.git
cd segmentation_gym
python train_model.py

Keep tabs on your gpu using watch nvidia smi

2320sharon commented 6 months ago

I'm trying out these instructions and writing some notes as I'm doing so.

For installing miniconda I already had an installation so instead of the command bash Miniconda3-py39_4.12.0-Linux-x86_64.sh I ran bash Miniconda3-py39_4.12.0-Linux-x86_64.sh -u.

I finished the installation commands and ran the test commands which both worked. Now onto testing by training a model.

Okay I ran into my first small hiccup. I ran watch nvidia smi and it gives me the output below. This is strange because when I ran python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" I got [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Every 2.0s: nvidia smi        gpu_laptop: Tue Feb 27 10:22:08 2024

sh: 1: nvidia: not found
2320sharon commented 6 months ago

We should update the gym case study to tell the user they need to select a TRAIN and VALIDATION dataset before they can train a model. The current docs don't mention that. https://github.com/Doodleverse/segmentation_gym/wiki/02_Case-Study-Demo#:~:text=Step%203%3A%20run,warnings%20from%20Tensorflow%3A

Step 3: run python train_model.py.

It will first prompt you to select the output directory where model training files were written to, e.g. /Users/Someone/model_from_scratch_test. Then it will ask for a config file. Select the hatteras_l8_resunet.json config file, e.g. /Users/Someone/my_segmentation_gym_datasets/config/hatteras_l8_resunet.json. The model will then train. Your outputs will look like this, usually with some addtional warnings from Tensorflow:

2320sharon commented 6 months ago

I'm getting this error when I'm running train_model.py I'm pretty sure I'm selecting the wrong directories or something.

Steps

  1. Select TRAIN files:
    • I selected my_segmentation_gym_datasets_v5/capehatteras_data/npz4gym/train_data
  2. Select VALIDATION files:
    • I selected my_segmentation_gym_datasets_v5/capehatteras_data/npz4gym/val_data
  3. Select Config
    • I tried a few but here is one file I tried: my_segmentation_gym_datasets_v5/config/hatteras_l8_resunet_model2.json

Error Output

2024-02-27 10:38:44.544796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13513 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
Traceback (most recent call last):
  File "/home/sharon/gym/segmentation_gym/train_model.py", line 593, in <module>
    list_ds = tf.data.Dataset.list_files(train_filenames, shuffle=False) ##dont shuffle here
  File "/home/sharon/miniconda3/envs/gym_gpu/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1301, in list_files
    assert_not_empty = control_flow_assert.Assert(
  File "/home/sharon/miniconda3/envs/gym_gpu/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/sharon/miniconda3/envs/gym_gpu/lib/python3.10/site-packages/tensorflow/python/ops/control_flow_assert.py", line 102, in Assert
    raise errors.InvalidArgumentError(
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'No files matched pattern: '
dbuscombe-usgs commented 6 months ago

Just saw this - answered on Slack. Yes, I should update the instructions

dbuscombe-usgs commented 6 months ago

I have another WSL2 environment set up now on a different machine. It sees the GPUs, but the segformer model import fails

image

image

dbuscombe-usgs commented 6 months ago

Fixed by sudo apt install libtiff5

dbuscombe-usgs commented 6 months ago

@2320sharon are you still not able to get any output when you run nvidia-smi?

dbuscombe-usgs commented 6 months ago

Closing because it seems like this workflow has been successfully tested on several machines