OlafenwaMoses / ImageAI

A python library built to empower developers to build applications and systems with self-contained Computer Vision capabilities
https://www.genxr.co/#products
MIT License
8.56k stars 2.19k forks source link

I get OOM error when trying to train. #470

Open haimmm opened 4 years ago

haimmm commented 4 years ago

Im' using pyCharm, i installed ImageAI, tensorflow-gpu 1.13.1 and did everything like in the guide.

when i'm trying to run the train (on the ready hololens data set) i get this:

Epoch 1/200

  1/960 [..............................] - ETA: 18:05:24 - loss: 124.0666 - yolo_layer_1_loss: 18.8371 - yolo_layer_2_loss: 33.7255 - yolo_layer_3_loss: 71.5040Traceback (most recent call last):
  File "C:/Users/haim8/Desktop/machine learning/train.py", line 7, in <module>
    trainer.trainModel()
  File "C:\Users\haim8\Desktop\machine learning\venv\lib\site-packages\imageai\Detection\Custom\__init__.py", line 291, in trainModel
    max_queue_size=8
  File "C:\Users\haim8\Desktop\machine learning\venv\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\haim8\Desktop\machine learning\venv\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "C:\Users\haim8\Desktop\machine learning\venv\lib\site-packages\keras\engine\training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "C:\Users\haim8\Desktop\machine learning\venv\lib\site-packages\keras\engine\training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "C:\Users\haim8\Desktop\machine learning\venv\lib\site-packages\tensorflow\python\keras\backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "C:\Users\haim8\Desktop\machine learning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
    run_metadata_ptr)
  File "C:\Users\haim8\Desktop\machine learning\venv\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,52,52,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node replica_1/model_1/leaky_104/LeakyRelu}}]]

tried to lower the batch size but got this error even with just 2. can anyone help please? I'm using a laptop, but it's quite good it shouldn't be hardware issue.

rola93 commented 4 years ago

can you provide a better description of your hardware and what are you trying to train? Pycharm is very demanding in terms of resources, maybe you should avoid it while training. Also check memory status with htop before launching the training

haimmm commented 4 years ago

can you provide a better description of your hardware and what are you trying to train? Pycharm is very demanding in terms of resources, maybe you should avoid it while training. Also check memory status with htop before launching the training

Hi thanks for the reply. My laptop has i7-9750H, gtx 1650 and 8gb RAM. i also tried run it via anaconda anviroment but nothing. I tried to train both hololens example and my own data and both failed. My last option was to use colab.research.google.com but it's really slow and limited of space...

rola93 commented 4 years ago

you will need for sure to take care of ram consumption. I'm not sure on how much memory it'll consume. I've run it with 16 GB of ram with no problem.

Consider that despite of having 8GB or 16 GB, what really matters is your available memory when you put training to run, look it closer with htop

haimmm commented 4 years ago

you will need for sure to take care of ram consumption. I'm not sure on how much memory it'll consume. I've run it with 16 GB of ram with no problem.

Consider that despite of having 8GB or 16 GB, what really matters is your available memory when you put training to run, look it closer with htop

Everytime i try to run it the ram usage go to like 2.5-3gb (not hitting even 90% of general usage), and just crash it with the OOM error.. Maybe one of my versions aren't compatible? I had to download: python 6.0, tensorflow 1.13.1, imageai/keras/cv latest versions and downgraded protobuf to 3.6 in order to make it just run. Is something wrong here?