BiaPyX / BiaPy-GUI

BiaPy GUI
MIT License
6 stars 0 forks source link

Error with unet training. Capabilities problem ? #4

Open mcblache opened 2 months ago

mcblache commented 2 months ago

Hello,

While using biapy-gui to train a unet network, we encountered an unexpected stop with this error message:

Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")

Except we have a nvidia gpu correctly configured and correctly detected by biapy.

nvidia-smi 
Wed Apr 24 16:57:38 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T600 Lap...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8     2W /  35W |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |

On a other computer with an another card (GeForce RTX 3090), with the same installation, same nvidia driver the unet training is working correctly.

Same installation, same nvidia driver but different capabilities!

cf https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications.

We suspect that you use Bfloat16 unavailable for card with 7.x capabilities

Thanks

Here is the log:

Using biapyx/biapy:latest-11.8 container
Local GUI version: v1.0.6
Remote last version's hash: ffb24581dc7263a1aebbe076df443de37709ebf5
Remote last version: v1.0.6
Loaded: {'AUGMENTOR': {'ENABLE': False}, 'DATA': {'EXTRACT_RANDOM_PATCH': False, 'FORCE_RGB': True, 'PATCH_SIZE': '(256, 256, 3)', 'REFLECT_TO_COMPLETE_SHAPE': True, 'TEST': {'ARGMAX_TO_OUTPUT': True, 'CHECK_DATA': True, 'IN_MEMORY': True, 'LOAD_GT': False, 'OVERLAP': '(0,0)', 'PADDING': '(64, 64)', 'PATH': '/home/mcblache/prj/pepper/d10/test/images', 'RESOLUTION': '(1,1)'}, 'TRAIN': {'CHECK_DATA': True, 'GT_PATH': '/home/mcblache/prj/pepper/d10/test/masks', 'IN_MEMORY': True, 'MINIMUM_FOREGROUND_PER': 0.05, 'OVERLAP': '(0,0)', 'PADDING': '(0,0)', 'PATH': '/home/mcblache/prj/pepper/d10/test/images'}, 'VAL': {'FROM_TRAIN': True, 'RANDOM': True, 'RESOLUTION': '(1,1)', 'SPLIT_TRAIN': 0.1}}, 'MODEL': {'ARCHITECTURE': 'unet', 'DROPOUT_VALUES': [0.0, 0.0, 0.0, 0.0, 0.0], 'FEATURE_MAPS': [16, 32, 64, 128, 256]}, 'PROBLEM': {'NDIM': '2D', 'SEMANTIC_SEG': {'IGNORE_CLASS_ID': '0'}, 'TYPE': 'SEMANTIC_SEG'}, 'SYSTEM': {'NUM_CPUS': -1, 'NUM_WORKERS': 0, 'SEED': 0}, 'TEST': {'ENABLE': True, 'EVALUATE': True, 'VERBOSE': True}, 'TRAIN': {'ACCUM_ITER': 1, 'BATCH_SIZE': 2, 'ENABLE': True, 'EPOCHS': 10, 'LR': 0.001, 'LR_SCHEDULER': {'NAME': 'onecycle'}, 'OPTIMIZER': 'ADAMW', 'OPT_BETAS': '(0.9, 0.999)', 'PATIENCE': 2, 'W_DECAY': 0.02}}
Setting AUGMENTOR__ENABLE__INPUT : No (ENABLE)
...
Possible expected error during closing spin window: Internal C++ object (load_yaml_to_GUI_engine) already deleted.
Creating YAML file
{'status': 'Pulling from biapyx/biapy', 'id': 'latest-11.8'}
{'status': 'Digest: sha256:5b55f044be436fd00a82dd51b6f89e6411154bb247ab5041fb80179b33a5323d'}
{'status': 'Status: Image is up to date for biapyx/biapy:latest-11.8'}
Creating temporal input YAML file
Command: ['--config', '/BiaPy_files/input.yaml', '--result_dir', '/home/mcblache/prj/pepper/output', '--name', 'my_2d_semantic_segmentation', '--run_id', '1', '--dist_backend', 'nccl', '--gpu', '0']
Volumes:  {'/home/mcblache/prj/pepper/output/my_2d_semantic_segmentation/input_config/input20240424_163742.yaml': {'bind': '/BiaPy_files/input.yaml', 'mode': 'ro'}, '/home/mcblache/prj/pepper/output': {'bind': '/home/mcblache/prj/pepper/output', 'mode': 'rw'}, '/home/mcblache/prj/pepper/d10/test': {'bind': '/home/mcblache/prj/pepper/d10/test', 'mode': 'ro'}}
GPU (IDs): 0
CPUs: 5
GUI version: v1.0.6
Traceback (most recent call last):
  File "docker/api/client.py", line 268, in _raise_for_status
  File "requests/models.py", line 1021, in raise_for_status
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.45/containers/f4db26e13419ef76836e1dcc5f895815ae96a4770f38464ef27d1bf71e31dc20/start

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_functions.py", line 434, in run
  File "docker/models/containers.py", line 854, in run
  File "docker/models/containers.py", line 405, in start
  File "docker/utils/decorators.py", line 19, in wrapped
  File "docker/api/container.py", line 1126, in start
  File "docker/api/client.py", line 270, in _raise_for_status
  File "docker/errors.py", line 39, in create_api_error_from_http_exception
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.45/containers/f4db26e13419ef76836e1dcc5f895815ae96a4770f38464ef27d1bf71e31dc20/start: Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")

Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")
danifranco commented 2 months ago

Hello,

Thank you for reporting this error. We will look into it carefully so we can fix/avoid or warn the user with such GPU in futures GUI releases.

Cheers,

danifranco commented 1 month ago

Hello,

I've just found an interesting discussion on this problem where a few solutions and link to interesting tutorials are provided. Most of the times seems that a simple sudo systemctl restart docker does the trick.