GoogleCloudPlatform / cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
https://cloud.google.com/ai-platform/docs/
Apache License 2.0
1.52k stars 857 forks source link

Iris pipeline fails when run locally #4

Closed gridcellcoder closed 7 years ago

gridcellcoder commented 7 years ago

Using the current master branch (https://github.com/GoogleCloudPlatform/cloudml-samples/commit/7a002e7145af59f5f571c012ff1bada8114fa148) and tensorflow version:

python -c 'import tensorflow as tf; print(tf.__version__)'  # for Python 2
0.11.0rc1

Ubuntu version

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:    16.04
Codename:   xenial

Python version and command:

python --version
Python 2.7.12

cd cloudml-samples2/iris
python preprocess.py #no errors
python pipeline.py

I get the following warnings and error.

INFO:tensorflow:Loss for final step: 0.00384905.
WARNING:tensorflow:Given features: {'measurements': <tf.Tensor 'ParseExample/ParseExample:1' shape=(30, 4) dtype=float32>, 'key': <tf.Tensor 'ParseExample/ParseExample:0' shape=(30, 1) dtype=string>}, required signatures: {'measurements': TensorSignature(dtype=tf.float32, shape=TensorShape([Dimension(30), Dimension(4)]), is_sparse=False), 'key': TensorSignature(dtype=tf.string, shape=TensorShape([Dimension(30), Dimension(1)]), is_sparse=False)}.
WARNING:tensorflow:Given targets: Tensor("ParseExample/ParseExample:2", shape=(30, 1), dtype=int64), required signatures: TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(30), Dimension(1)]), is_sparse=False).
INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='measurements', dimension=4, default_value=None, dtype=tf.float32, normalizer=None)
WARNING:tensorflow:Please specify metrics using MetricSpec. Using bare functions or (key, fn) tuples is deprecated and support for it will be removed on Oct 1, 2016.
WARNING:tensorflow:Please specify metrics using MetricSpec. Using bare functions or (key, fn) tuples is deprecated and support for it will be removed on Oct 1, 2016.
INFO:tensorflow:Restored model from /tmp/tmpxzTBGqtrain_161023_125858_f826/model/train
INFO:tensorflow:Eval steps [0,100) for training step 5000.
INFO:tensorflow:Results after 10 steps (0.002 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 20 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 30 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 40 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 50 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 60 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 70 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 80 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 90 steps (0.002 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 100 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
W tensorflow/core/kernels/queue_base.cc:294] _6_input_producer: Skipping cancelled enqueue attempt with queue not closed
W tensorflow/core/kernels/queue_base.cc:294] _8_batch/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
W tensorflow/core/kernels/queue_base.cc:294] _8_batch/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
W tensorflow/core/kernels/queue_base.cc:294] _8_batch/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
W tensorflow/core/kernels/queue_base.cc:294] _8_batch/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors.CancelledError'>, Enqueue operation was cancelled
     [[Node: input_producer/input_producer_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:@input_producer"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer, input_producer/Identity)]]

Caused by op u'input_producer/input_producer_EnqueueMany', defined at:
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 276, in <module>
    main()
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 271, in main
    output_dir=output_dir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 105, in run
    return task()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 300, in train_and_evaluate
    name=eval_dir_suffix)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py", line 461, in evaluate
    steps=steps, metrics=metrics, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 399, in evaluate
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 758, in _evaluate_model
    features, targets = input_fn()
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 85, in input_fn
    _, examples = util.read_examples(data_paths, batch_size, shuffle)
  File "trainer/util.py", line 100, in read_examples
    filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 196, in string_input_producer
    summary_name="fraction_of_%d_full" % capacity)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 140, in input_producer
    enq = q.enqueue_many([input_tensor])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 371, in enqueue_many
    self._queue_ref, vals, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1018, in _queue_enqueue_many
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 756, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
    self._traceback = _extract_stack()

CancelledError (see above for traceback): Enqueue operation was cancelled
     [[Node: input_producer/input_producer_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:@input_producer"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer, input_producer/Identity)]]

WARNING:tensorflow:Coordinator didn't stop cleanly: Enqueue operation was cancelled
     [[Node: input_producer/input_producer_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:@input_producer"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer, input_producer/Identity)]]

Caused by op u'input_producer/input_producer_EnqueueMany', defined at:
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 276, in <module>
    main()
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 271, in main
    output_dir=output_dir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 105, in run
    return task()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 300, in train_and_evaluate
    name=eval_dir_suffix)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py", line 461, in evaluate
    steps=steps, metrics=metrics, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 399, in evaluate
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 758, in _evaluate_model
    features, targets = input_fn()
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 85, in input_fn
    _, examples = util.read_examples(data_paths, batch_size, shuffle)
  File "trainer/util.py", line 100, in read_examples
    filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 196, in string_input_producer
    summary_name="fraction_of_%d_full" % capacity)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 140, in input_producer
    enq = q.enqueue_many([input_tensor])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 371, in enqueue_many
    self._queue_ref, vals, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1018, in _queue_enqueue_many
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 756, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
    self._traceback = _extract_stack()

CancelledError (see above for traceback): Enqueue operation was cancelled
     [[Node: input_producer/input_producer_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:@input_producer"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer, input_producer/Identity)]]

INFO:tensorflow:Saving evaluation summary for 5000 step: loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9
chmeyers commented 7 years ago

The call stacks above are not a fatal error. The "Saving evaluation summary for 5000 step" line implies that it got to the final step and saved a final evaluation. Check the output/ directory to find it.

What is actually happening: This sample uses the TF.learn framework to train its model, which creates a queue and fills it with the input to loop through. When the queue is empty, training is done. However, this currently throws an exception (which allows the training loop to exit), which is printed to STDERR.