Open HRKpython opened 5 years ago
Is it a GPU/memeory issue? I tried to use python2/CPU and now it is training.
@HRKpython could you share your evalution result ?
Can you elaborate a bit more. I have difficulty to fit the model, you ask for evaluating?
I have the same problem, but in a weird way. I have tensorflow GPU (1.12) and Python 3.6.8 installed in a virtual environment inside anaconda in windows.
When I run the code in Jupyter Notebook (configured with a kernel to use that env), the code runs fine and I can train the network. But when I simply copy all the code into a .py script and run it in the conda prompt (cmd) in the same virtual environment, I get this error:
Epoch 1/100 2019-03-11 11:54:37.227929: W .\tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node loss/lambda_1_loss/ArithmeticOptimizer/ArithmeticOptimizer/HoistCommonFactor_Add_HoistCommonFactor_Add_add_13 is missing output properties at position :0 (num_outputs=0) 2019-03-11 11:54:41.217040: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED 2019-03-11 11:54:41.221304: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED Traceback (most recent call last): File "direct.py", line 362, in
max_queue_size = 3) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper return func(*args, *kwargs) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator initial_epoch=initial_epoch) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training_generator.py", line 217, in fit_generator class_weight=class_weight) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch outputs = self.train_function(ins) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in call return self._call(inputs) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(array_vals) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1439, in call run_metadata_ptr) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node conv_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/conv_1/convolution_grad/Conv2DBackpropFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/conv_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv_1/kernel/read)]] [[{{node loss/lambda_1_loss/truediv_12/_625}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1865_loss/lambda_1_loss/truediv_12", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Running Adanet with GPU in TF 1.13.1 I get the following:
message: "Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node adanet/iteration_7/best_eval_metric_ops/strided_slice_9. Error: Pack node (adanet/iteration_7/best_eval_metric_ops/stack_9) axis attribute is out of bounds: 0"
pathname: "./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h"
I used to be able to run it without error. It gives this error in tensorflow 1.9, 1.10, 1.11, 1.12
Adding this removes the error: from tensorflow.python.keras import backend import tensorflow as tf backend.get_session().run(tf.global_variables_initializer())
but still gets NaN after a while.
I am trying to train the YOLO v2 model on the custom images. I am using tensorflow version 1.11.0 and I am using the tensorflow.keras, so I modified the tutorial a bit to be able running the YOLO model for predefined labels:
When I run this portion of he code, I get the below error:
Any help would be appreciated.