OlafenwaMoses / ImageAI

A python library built to empower developers to build applications and systems with self-contained Computer Vision capabilities
https://www.genxr.co/#products
MIT License
8.48k stars 2.18k forks source link

Issues while using GPU. Loss = nan #742

Open JuICe1352 opened 2 years ago

JuICe1352 commented 2 years ago

Hello, I'm trying to train an Image-Detection mode. The training works fine on the CPU but is extremely slow. If I try to run it on the GPU the loss stays at nan for the entire duration no model is saved. The GPU also does not see use in the Taskmanager.

I'm running Windows 11 with a 3080 Laptop GPU.

Ive got tensorflow-gpu = 1.13.1, imageai = 2.1.5, keras = 2.3.1, cudnn = 7.6.5 and cudatoolkit= 10.0.130 as well as the other dependencys for imageai installed.

My code looks like this:

trainer = DetectionModelTrainer()
trainer.setModelTypeAsYOLOv3()
trainer.setDataDirectory(data_directory='HardHatImages_Det')
trainer.setTrainConfig(object_names_array=['person hardhat'],
                        batch_size=4,
                        num_experiments=20,
                        train_from_pretrained_model='Models\yolo.h5')
trainer.trainModel()

And the output like this:

Generating anchor boxes for training images and annotation...
Average IOU for 9 anchors: 0.76
Anchor Boxes generated.
Detection configuration saved in  HardHatImages_Det\json\detection_config.json
Training on:    ['person hardhat']
Training with Batch Size:  4
Number of Experiments:  20
WARNING:tensorflow:From [C:\Users\---\anaconda3\envs\ImageAIv3\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:435](): colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From [C:\Users\---\anaconda3\envs\ImageAIv3\lib\site-packages\imageai\Detection\Custom\yolo.py:24](): to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Training with transfer learning from pretrained Model
[C:\Users\---\anaconda3\envs\ImageAIv3\lib\site-packages\keras\callbacks\callbacks.py:998](): UserWarning: `epsilon` argument is deprecated and will be removed, use `min_delta` instead.
  warnings.warn('`epsilon` argument is deprecated and '
WARNING:tensorflow:From [C:\Users\---\anaconda3\envs\ImageAIv3\lib\site-packages\tensorflow\python\ops\math_ops.py:3066](): to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/20
944/944 [==============================] - 1512s 2s/step - loss: nan - yolo_layer_1_loss: nan - yolo_layer_2_loss: nan - yolo_layer_3_loss: nan - val_loss: nan - val_yolo_layer_1_loss: nan - val_yolo_layer_2_loss: nan - val_yolo_layer_3_loss: nan
Epoch 2/20
425/944 [============>.................] - ETA: 3:53 - loss: nan - yolo_layer_1_loss: nan - yolo_layer_2_loss: nan - yolo_layer_3_loss: nan

I hope someone can help me. I've tried quite a bit of stuff by not but I'm not that experienced with python so I could be missing something obvious.

Thanks in advanced Juice

EDIT: Quick addition: It stays at Epoch 1/20 for a solid 15-20min till the bar below appears and it starts doing something. In this time the CPU occasionally works a bit and the RAM stays at a consistent but quite full level. I don't know whether that's normal but I just wanted to add it.