balancap / SSD-Tensorflow

Single Shot MultiBox Detector in TensorFlow
4.11k stars 1.89k forks source link

Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED #330

Closed tomrichardon closed 5 years ago

tomrichardon commented 5 years ago

Hello,

I'm trying to follow the training guide provided in the Readme, specificly the part "Fine-tuning existing SSD checkpoints". I used the exact command given in the guide, and every time I get a "Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED" error, followed by "Python has stopped working".

My environment uses tensorflow-gpu 1.10.0, cudnn 7.3.1 and cudatoolkit 9.0, which should be compatible ?

Does anyone know where this problem comes from and how to solve it ?

Here are the command I used and the output I get :

(tensorflowgpu-env) C:\Users\tomri\Documents\SSD-Tensorflow-master> python train_ssd_network.py --train_dir=./logs --dataset_dir=./tfrecords --dataset_name=pascalvoc_2007 --dataset_split_name=train --model_name=ssd_300_vgg --checkpoint_path=./checkpoints/VGG_VOC0712_SSD_300x300_iter_120000.ckpt --save_summaries_secs=60 --save_interval_secs=600 --weighgt_decay=0.0005 --optimizer=adam --learning_rate=0.001 --batch_size=2

WARNING:tensorflow:From train_ssd_network.py:200: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step

===========================================================================

Training | Evaluation flags:

===========================================================================

{'adadelta_rho': <absl.flags._flag.Flag object at 0x0000013F0D51DB38>, 'adagrad_initial_accumulator_value': <absl.flags._flag.Flag object at 0x0000013F0D51DBE0>, 'adam_beta1': <absl.flags._flag.Flag object at 0x0000013F0D51DC88>, 'adam_beta2': <absl.flags._flag.Flag object at 0x0000013F0D51DD30>, 'batch_size': <absl.flags._flag.Flag object at 0x0000013F0D525B00>, 'checkpoint_exclude_scopes': <absl.flags._flag.Flag object at 0x0000013F0D525D68>, 'checkpoint_model_scope': <absl.flags._flag.Flag object at 0x0000013F0D525CF8>, 'checkpoint_path': <absl.flags._flag.Flag object at 0x0000013F0D525C88>, 'clone_on_cpu': <absl.flags._flag.BooleanFlag object at 0x0000013F0D51D518>, 'dataset_dir': <absl.flags._flag.Flag object at 0x0000013F0D525908>, 'dataset_name': <absl.flags._flag.Flag object at 0x0000013F0D525748>, 'dataset_split_name': <absl.flags._flag.Flag object at 0x0000013F0D525860>, 'end_learning_rate': <absl.flags._flag.Flag object at 0x0000013F0D525470>, 'ftrl_initial_accumulator_value': <absl.flags._flag.Flag object at 0x0000013F0D51DF28>, 'ftrl_l1': <absl.flags._flag.Flag object at 0x0000013F0D51DFD0>, 'ftrl_l2': <absl.flags._flag.Flag object at 0x0000013F0D5250B8>, 'ftrl_learning_rate_power': <absl.flags._flag.Flag object at 0x0000013F0D51DE80>, 'gpu_memory_fraction': <absl.flags._flag.Flag object at 0x0000013F0D51D978>, 'h': <tensorflow.python.platform.app._HelpFlag object at 0x0000013F0D525E80>, 'help': <tensorflow.python.platform.app._HelpFlag object at 0x0000013F0D525E80>, 'helpfull': <tensorflow.python.platform.app._HelpfullFlag object at 0x0000013F0D525EF0>, 'helpshort': <tensorflow.python.platform.app._HelpshortFlag object at 0x0000013F0D525F60>, 'ignore_missing_vars': <absl.flags._flag.BooleanFlag object at 0x0000013F0D525E10>, 'label_smoothing': <absl.flags._flag.Flag object at 0x0000013F0D5254E0>, 'labels_offset': <absl.flags._flag.Flag object at 0x0000013F0D525978>, 'learning_rate': <absl.flags._flag.Flag object at 0x0000013F0D5253C8>, 'learning_rate_decay_factor': <absl.flags._flag.Flag object at 0x0000013F0D525588>, 'learning_rate_decay_type': <absl.flags._flag.Flag object at 0x0000013F0D525358>, 'log_every_n_steps': <absl.flags._flag.Flag object at 0x0000013F0D51D780>, 'loss_alpha': <absl.flags._flag.Flag object at 0x0000013F0261D518>, 'match_threshold': <absl.flags._flag.Flag object at 0x0000013F0D51D390>, 'max_number_of_steps': <absl.flags._flag.Flag object at 0x0000013F0D525C18>, 'model_name': <absl.flags._flag.Flag object at 0x0000013F0D525A20>, 'momentum': <absl.flags._flag.Flag object at 0x0000013F0D525160>, 'moving_average_decay': <absl.flags._flag.Flag object at 0x0000013F0D5256D8>, 'negative_ratio': <absl.flags._flag.Flag object at 0x0000013F0D53ABA8>, 'num_classes': <absl.flags._flag.Flag object at 0x0000013F0D5257B8>, 'num_clones': <absl.flags._flag.Flag object at 0x0000013F0D51D4E0>, 'num_epochs_per_decay': <absl.flags._flag.Flag object at 0x0000013F0D525630>, 'num_preprocessing_threads': <absl.flags._flag.Flag object at 0x0000013F0D51D6D8>, 'num_readers': <absl.flags._flag.Flag object at 0x0000013F0D51D630>, 'opt_epsilon': <absl.flags._flag.Flag object at 0x0000013F0D51DDD8>, 'optimizer': <absl.flags._flag.Flag object at 0x0000013F0D51DAC8>, 'preprocessing_name': <absl.flags._flag.Flag object at 0x0000013F0D525A90>, 'rmsprop_decay': <absl.flags._flag.Flag object at 0x0000013F0D5252B0>, 'rmsprop_momentum': <absl.flags._flag.Flag object at 0x0000013F0D525208>, 'save_interval_secs': <absl.flags._flag.Flag object at 0x0000013F0D51D8D0>, 'save_summaries_secs': <absl.flags._flag.Flag object at 0x0000013F0D51D828>, 'train_dir': <absl.flags._flag.Flag object at 0x0000013F0D51D438>, 'train_image_size': <absl.flags._flag.Flag object at 0x0000013F0D525BA8>, 'trainable_scopes': <absl.flags._flag.Flag object at 0x0000013F0D525DD8>, 'weight_decay': <absl.flags._flag.Flag object at 0x0000013F0D51DA20>}

===========================================================================

SSD net parameters:

===========================================================================

{'anchor_offset': 0.5, 'anchor_ratios': [[2, 0.5], [2, 0.5, 3, 0.3333333333333333], [2, 0.5, 3, 0.3333333333333333], [2, 0.5, 3, 0.3333333333333333], [2, 0.5], [2, 0.5]], 'anchor_size_bounds': [0.15, 0.9], 'anchor_sizes': [(21.0, 45.0), (45.0, 99.0), (99.0, 153.0), (153.0, 207.0), (207.0, 261.0), (261.0, 315.0)], 'anchor_steps': [8, 16, 32, 64, 100, 300], 'feat_layers': ['block4', 'block7', 'block8', 'block9', 'block10', 'block11'], 'feat_shapes': [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)], 'img_shape': (300, 300), 'no_annotation_label': 21, 'normalizations': [20, -1, -1, -1, -1, -1], 'num_classes': 21, 'prior_scaling': [0.1, 0.1, 0.2, 0.2]}

===========================================================================

Training | Evaluation dataset files:

===========================================================================

['.\tfrecords\voc_2007_train_000.tfrecord', '.\tfrecords\voc_2007_train_001.tfrecord', '.\tfrecords\voc_2007_train_002.tfrecord', '.\tfrecords\voc_2007_train_003.tfrecord', '.\tfrecords\voc_2007_train_004.tfrecord', '.\tfrecords\voc_2007_train_005.tfrecord', '.\tfrecords\voc_2007_train_006.tfrecord', '.\tfrecords\voc_2007_train_007.tfrecord', '.\tfrecords\voc_2007_train_008.tfrecord', '.\tfrecords\voc_2007_train_009.tfrecord', '.\tfrecords\voc_2007_train_010.tfrecord', '.\tfrecords\voc_2007_train_011.tfrecord', '.\tfrecords\voc_2007_train_012.tfrecord', '.\tfrecords\voc_2007_train_013.tfrecord', '.\tfrecords\voc_2007_train_014.tfrecord', '.\tfrecords\voc_2007_train_015.tfrecord', '.\tfrecords\voc_2007_train_016.tfrecord', '.\tfrecords\voc_2007_train_017.tfrecord', '.\tfrecords\voc_2007_train_018.tfrecord', '.\tfrecords\voc_2007_train_019.tfrecord', '.\tfrecords\voc_2007_train_020.tfrecord', '.\tfrecords\voc_2007_train_021.tfrecord', '.\tfrecords\voc_2007_train_022.tfrecord', '.\tfrecords\voc_2007_train_023.tfrecord', '.\tfrecords\voc_2007_train_024.tfrecord', '.\tfrecords\voc_2007_train_025.tfrecord']

INFO:tensorflow:Ignoring --checkpoint_path because a checkpoint already exists in ./logs WARNING:tensorflow:From C:\Users\tomri\anaconda3\envs\tensorflowgpu-env\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py:737: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-03-14 09:52:22.861243: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2019-03-14 09:52:25.137256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493 pciBusID: 0000:01:00.0 totalMemory: 4.00GiB freeMemory: 3.30GiB 2019-03-14 09:52:25.143998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-14 09:52:25.559193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-14 09:52:25.562986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-14 09:52:25.564744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-14 09:52:25.566650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3276 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from ./logs\model.ckpt-0 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Starting Session. INFO:tensorflow:Saving checkpoint to path ./logs\model.ckpt INFO:tensorflow:Starting Queues. INFO:tensorflow:global_step/sec: 0 2019-03-14 09:53:03.276704: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED 2019-03-14 09:53:03.283832: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED INFO:tensorflow:global_step/sec: 0

tutu123133 commented 5 years ago

Hello, I am the same environment as you, I have encountered the same problem. Now I have solved it. It is because some people on the server also use GPU at the same time.

tomrichardon commented 5 years ago

It seems like we didn't have the same problem then. Reducing the --gpu_memory_fraction value fixed it for me.

piratepanther commented 5 years ago

hi, please CHECK parameter 'gpu_memory_fraction' in train_ssd_network.py, REDUCE GPU memory fraction as Initial value 0.8 to a lower value SUCH AS 0.33.