Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler
MIT License
45 stars 11 forks source link

Set memory growth error in `seg_images_in_folder.py` #118

Closed CameronBodine closed 1 year ago

CameronBodine commented 1 year ago

Describe the bug I receive a memory growth error when I run seg_images_in_folder.py. See console output below.

To Reproduce Steps to reproduce the behavior:

  1. Run seg_images_in_folder.py
  2. Select the sample directory
  3. Select the weights file
  4. Error is thrown as seen below

Expected behavior I expect there to not be an error.

Screenshots

$ python seg_images_in_folder.py 
2023-02-21 17:12:30.162100: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Version:  2.11.0
Eager mode:  True
Version:  2.11.0
Eager mode:  True
2023-02-21 17:12:32.645756: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-02-21 17:12:32.645820: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: filfy-Thelio-Massive
2023-02-21 17:12:32.645838: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: filfy-Thelio-Massive
2023-02-21 17:12:32.646059: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 525.85.12
2023-02-21 17:12:32.646108: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 525.85.12
2023-02-21 17:12:32.646124: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 525.85.12
GPU name:  []
Num GPUs Available:  0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
Traceback (most recent call last):
  File "seg_images_in_folder.py", line 112, in <module>
    tf.config.experimental.set_memory_growth(i, True)
  File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/tensorflow/python/framework/config.py", line 716, in set_memory_growth
    context.context().set_memory_growth(device, enable)
  File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 1636, in set_memory_growth
    raise ValueError(
ValueError: Cannot set memory growth on non-GPU and non-Pluggable devices

Desktop (please complete the following information):

The script runs as expected when I comment out: https://github.com/Doodleverse/segmentation_gym/blob/909372e182dfc8c5c0c505c0fc465d51a3e54e31/seg_images_in_folder.py#L108-L109

dbuscombe-usgs commented 1 year ago

Thank you! This is what happens when you make a big change (force users to use CPU rather than GPU), but don't test .... ha! I will make this change in the next version

ebgoldstein commented 1 year ago

i want to chime in and just say that i actually like to run inference on my GPU.. so if there is a potential to keep that old GPU code, that would be cool

dbuscombe-usgs commented 1 year ago

:pinched_fingers:

:handshake:

dbuscombe-usgs commented 1 year ago

We made this change to make things more consistent across the doodleverse, because seg2map and coastseg use cpu only. It also cleans up the code a lot. I personally never use GPU for inference because my CPUs are many and fast. So, I just assumed it was the same for everyone.

However, it's not a big deal to revert the changes

dbuscombe-usgs commented 1 year ago

Should be fixed in https://github.com/Doodleverse/segmentation_gym/commit/5ff8821adae421ee04cfbc6cd7e7b87cbf709d7e

This commit also includes some minor tweaks to make_datasets that I already implemented without first branching (doh!). Those changes are

  1. limit the number of printed examples from 100 batches to 10 batches (100 batches was taking too long)
  2. fix a small bug when FILTER<0 (this is a very specific situation when nclasses=2 and the 'in' class is specified first, not second)

I have tested with a resunet and a segformer model for NCLASSES=2