experiencor / keras-yolo2

Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).
MIT License
1.73k stars 784 forks source link

Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor #397

Open HRKpython opened 5 years ago

HRKpython commented 5 years ago

I am trying to train the YOLO v2 model on the custom images. I am using tensorflow version 1.11.0 and I am using the tensorflow.keras, so I modified the tutorial a bit to be able running the YOLO model for predefined labels:

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Reshape, Activation, Conv2D, Input, MaxPooling2D, BatchNormalization, Flatten, Dense, Lambda
from tensorflow.keras.layers import LeakyReLU, concatenate
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from tensorflow.keras.optimizers import SGD, Adam, RMSprop
#from tensorflow.keras.layers.merge import concatenate
import matplotlib.pyplot as plt
import tensorflow.keras.backend as K
import tensorflow as tf

When I run this portion of he code, I get the below error:

tb_counter  = len([log for log in os.listdir(os.path.expanduser('/home/ec2-user/Hamid/files/logs/')) if 'coco_' in log]) + 1
tensorboard = TensorBoard(log_dir=os.path.expanduser('/home/ec2-user/Hamid/files/logs/') + 'coco_' + '_' + str(tb_counter), 
                          histogram_freq=0, 
                          write_graph=True, 
                          write_images=False)

optimizer = Adam(lr=0.5e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
#optimizer = SGD(lr=1e-4, decay=0.0005, momentum=0.9)
#optimizer = RMSprop(lr=1e-4, rho=0.9, epsilon=1e-08, decay=0.0)

model.compile(loss=custom_loss, optimizer=optimizer)

model.fit_generator(generator        = train_batch, 
                    steps_per_epoch  = len(train_batch), 
                    epochs           = 100, 
                    verbose          = 1,
                    validation_data  = valid_batch,
                    validation_steps = len(valid_batch),
                    callbacks        = [early_stop, checkpoint, tensorboard], 
                    max_queue_size   = 3)
/home/ec2-user/anaconda2/envs/python3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    524             None, None,
    525             compat.as_text(c_api.TF_Message(self.status.status)),
--> 526             c_api.TF_GetCode(self.status.status))
    527     # Delete the underlying status object from memory otherwise it stays alive
    528     # as there is a reference to status from this from the traceback due to

FailedPreconditionError: Attempting to use uninitialized value loss_1/lambda_1_loss/Variable
     [[{{node loss_1/lambda_1_loss/AssignAdd}} = AssignAdd[T=DT_FLOAT, use_locking=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](loss_1/lambda_1_loss/Variable, training_1/Adam/sub_38/x)]]

Any help would be appreciated.

HRKpython commented 5 years ago

Is it a GPU/memeory issue? I tried to use python2/CPU and now it is training.

YunYang1994 commented 5 years ago

@HRKpython could you share your evalution result ?

HRKpython commented 5 years ago

Can you elaborate a bit more. I have difficulty to fit the model, you ask for evaluating?

abarajithan11 commented 5 years ago

I have the same problem, but in a weird way. I have tensorflow GPU (1.12) and Python 3.6.8 installed in a virtual environment inside anaconda in windows.

When I run the code in Jupyter Notebook (configured with a kernel to use that env), the code runs fine and I can train the network. But when I simply copy all the code into a .py script and run it in the conda prompt (cmd) in the same virtual environment, I get this error:

Epoch 1/100 2019-03-11 11:54:37.227929: W .\tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node loss/lambda_1_loss/ArithmeticOptimizer/ArithmeticOptimizer/HoistCommonFactor_Add_HoistCommonFactor_Add_add_13 is missing output properties at position :0 (num_outputs=0) 2019-03-11 11:54:41.217040: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED 2019-03-11 11:54:41.221304: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED Traceback (most recent call last): File "direct.py", line 362, in max_queue_size = 3) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper return func(*args, *kwargs) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator initial_epoch=initial_epoch) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training_generator.py", line 217, in fit_generator class_weight=class_weight) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch outputs = self.train_function(ins) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in call return self._call(inputs) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(array_vals) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1439, in call run_metadata_ptr) File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node conv_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/conv_1/convolution_grad/Conv2DBackpropFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/conv_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv_1/kernel/read)]] [[{{node loss/lambda_1_loss/truediv_12/_625}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1865_loss/lambda_1_loss/truediv_12", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

gogasca commented 5 years ago

Running Adanet with GPU in TF 1.13.1 I get the following:

message:  "Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node adanet/iteration_7/best_eval_metric_ops/strided_slice_9. Error: Pack node (adanet/iteration_7/best_eval_metric_ops/stack_9) axis attribute is out of bounds: 0"   
pathname:  "./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h"   
ksloke commented 5 years ago

I used to be able to run it without error. It gives this error in tensorflow 1.9, 1.10, 1.11, 1.12

Adding this removes the error: from tensorflow.python.keras import backend import tensorflow as tf backend.get_session().run(tf.global_variables_initializer())

but still gets NaN after a while.