GoogleCloudPlatform / cloudml-edge-automation

Automated building and packaging of Tensorflow models in the cloud, and running them on devices
https://cloud.google.com/solutions/automating-iot-machine-learning
Apache License 2.0
15 stars 10 forks source link

Training the Model fails (exceptions attached) #4

Open kaiwaehner opened 6 years ago

kaiwaehner commented 6 years ago

I followed all steps from the tutorial (https://cloud.google.com/solutions/automating-iot-machine-learning). I also updated gcloud on my Mac before executing the steps...

Model training fails with below exceptions as part of the log... (Seems like I cannot simply download the whole log and share as zip file?)

Do you need additional information to help?

2018-07-20 10:46:38.633 CEST worker-replica-1 gapic-google-cloud-logging-v2 0.91.3 has requirement google-gax<0.16dev,>=0.15.7, but you'll have google-gax 0.12.5 which is incompatible.

2018-07-20 10:46:38.634 CEST worker-replica-1 google-cloud-logging 1.0.0 has requirement google-cloud-core<0.25dev,>=0.24.0, but you'll have google-cloud-core 0.28.1 which is incompatible.

2018-07-20 10:46:39.018 CEST worker-replica-1 The script chardetect is installed in '/root/.local/bin' which is not on PATH.

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 570, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 329, in main run(model, argv) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 465, in run dispatch(args, model, cluster, task) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 505, in dispatch Trainer(args, model, cluster, task).run_training() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 206, in run_training self.args.batch_size) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 307, in build_train_graph return self.build_graph(data_paths, batch_size, GraphMod.TRAIN) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph num_epochs=None if is_training else 2) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 47, in read_examples filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 217, in string_input_producer raise ValueError(not_null_err) ValueError: string_input_producer requires a non-null input tensor The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 570, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 329, in main run(model, argv) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 465, in run dispatch(args, model, cluster, task) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 509, in dispatch Trainer(args, model, cluster, task).run_training() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 206, in run_training self.args.batch_size) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 307, in build_train_graph return self.build_graph(data_paths, batch_size, GraphMod.TRAIN) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph num_epochs=None if is_training else 2) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 47, in read_examples filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 217, in string_input_producer raise ValueError(not_null_err) ValueError: string_input_producer requires a non-null input tensor The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 570, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 329, in main run(model, argv) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 465, in run dispatch(args, model, cluster, task) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 509, in dispatch Trainer(args, model, cluster, task).run_training() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 206, in run_training self.args.batch_size) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 307, in build_train_graph return self.build_graph(data_paths, batch_size, GraphMod.TRAIN) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph num_epochs=None if is_training else 2) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 47, in read_examples filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 217, in string_input_producer raise ValueError(not_null_err) ValueError: string_input_producer requires a non-null input tensor The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 570, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 329, in main run(model, argv) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 465, in run dispatch(args, model, cluster, task) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 509, in dispatch Trainer(args, model, cluster, task).run_training() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 206, in run_training self.args.batch_size) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 307, in build_train_graph return self.build_graph(data_paths, batch_size, GraphMod.TRAIN) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph num_epochs=None if is_training else 2) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 47, in read_examples filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 217, in string_input_producer raise ValueError(not_null_err) ValueError: string_input_producer requires a non-null input tensor The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 570, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 329, in main run(model, argv) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 465, in run dispatch(args, model, cluster, task) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 509, in dispatch Trainer(args, model, cluster, task).run_training() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 206, in run_training self.args.batch_size) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 307, in build_train_graph return self.build_graph(data_paths, batch_size, GraphMod.TRAIN) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph num_epochs=None if is_training else 2) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 47, in read_examples filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 217, in string_input_producer raise ValueError(not_null_err) ValueError: string_input_producer requires a non-null input tensor To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=317506745947&resource=ml_job%2Fjob_id%2Fequipmentparts_1_1532076159&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22equipmentparts_1_1532076159%22

vamsekumar commented 5 years ago

I am facing the same issue. Did you fix this issue or do we have an updated code?