Azure / ai-toolkit-iot-edge

AI Toolkit for Azure IoT Edge
MIT License
191 stars 103 forks source link

"MNIST handwritten digit classification using TensorFlow" failed due to "tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model" #7

Open ggsato opened 6 years ago

ggsato commented 6 years ago

Hi,

I tried to follow Quick Start to run a sample model service on my edge device, but failed as follows.

$ az ml service logs realtime -i my-mnist
2018-05-20 21:55:11,845 CRIT Supervisor running as root (no user in config file)
2018-05-20 21:55:11,847 INFO supervisord started with pid 1
2018-05-20 21:55:12,849 INFO spawned: 'rsyslog' with pid 9
2018-05-20 21:55:12,851 INFO spawned: 'program_exit' with pid 10
2018-05-20 21:55:12,853 INFO spawned: 'nginx' with pid 11
2018-05-20 21:55:12,854 INFO spawned: 'iot' with pid 12
2018-05-20 21:55:12,856 INFO spawned: 'gunicorn' with pid 13
2018-05-20 21:55:12,868 INFO success: iot entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2018-05-20 21:55:12,931 INFO exited: iot (exit status 1; expected)
2018-05-20 21:55:13,932 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-20 21:55:13,932 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-20 21:55:17,937 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2018-05-20T21:55:18.305553Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2018-05-20T21:55:18.307199Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (13)
2018-05-20T21:55:18.307539Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2018-05-20T21:55:18.308591Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2018-05-20T21:55:18.309772Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 31
Initializing logger
2018-05-20T21:55:18.773498Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2018-05-20T21:55:18.773791Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2018-05-20T21:55:18.773942Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2018-05-20T21:55:18.774278Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2018-05-20 21:55:22.053772: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:22.053811: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:22.053835: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:22.053845: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:22.053855: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20T21:55:22.976338Z, ERROR, 00000000-0000-0000-0000-000000000000, , User's init function failed
2018-05-20T21:55:22.977815Z, ERROR, 00000000-0000-0000-0000-000000000000, , Encountered Exception Traceback (most recent call last):
  File "/var/azureml-app/aml_blueprint.py", line 109, in register
    main.init()
  File "/var/azureml-app/main.py", line 41, in init
    user_init_func()
  File "/var/azureml-app/webservice_driver.py", line 9, in init
    saver.restore(sess,tf.train.latest_checkpoint('./'))
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1511, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 282, in get_matching_files
    compat.as_bytes(filename), status)]
  File "/home/mmlspark/lib/conda/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model

2018-05-20T21:55:22.978319Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 31)
2018-05-20T21:55:23.243101Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2018-05-20T21:55:23.243485Z, INFO, 00000000-0000-0000-0000-000000000000, , Reason: Worker failed to boot.
2018-05-20 21:55:23,271 INFO exited: gunicorn (exit status 3; not expected)
2018-05-20 21:55:24,274 INFO spawned: 'gunicorn' with pid 41
2018-05-20T21:55:29.717610Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2018-05-20T21:55:29.718823Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (41)
2018-05-20T21:55:29.719061Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2018-05-20T21:55:29.719820Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2018-05-20T21:55:29.720920Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 45
Initializing logger
2018-05-20T21:55:30.045352Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2018-05-20T21:55:30.045671Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2018-05-20T21:55:30.046003Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2018-05-20T21:55:30.046263Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2018-05-20 21:55:31.715341: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:31.715390: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:31.715402: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:31.715412: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:31.715422: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20T21:55:32.620562Z, ERROR, 00000000-0000-0000-0000-000000000000, , User's init function failed
2018-05-20T21:55:32.621796Z, ERROR, 00000000-0000-0000-0000-000000000000, , Encountered Exception Traceback (most recent call last):
  File "/var/azureml-app/aml_blueprint.py", line 109, in register
    main.init()
  File "/var/azureml-app/main.py", line 41, in init
    user_init_func()
  File "/var/azureml-app/webservice_driver.py", line 9, in init
    saver.restore(sess,tf.train.latest_checkpoint('./'))
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1511, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 282, in get_matching_files
    compat.as_bytes(filename), status)]
  File "/home/mmlspark/lib/conda/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model

2018-05-20T21:55:32.622017Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 45)
2018-05-20T21:55:32.873411Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2018-05-20T21:55:32.873754Z, INFO, 00000000-0000-0000-0000-000000000000, , Reason: Worker failed to boot.
2018-05-20 21:55:32,905 INFO exited: gunicorn (exit status 3; not expected)
2018-05-20 21:55:33,906 INFO gave up: gunicorn entered FATAL state, too many start retries too quickly
2018-05-20 21:55:34,908 WARN program_exit: bad result line: 'Killing supervisor with this event: ver:3.0 server:supervisor serial:0 pool:program_exit poolserial:0 eventname:PROCESS_STATE_FATAL len:58'
2018-05-20 21:55:34,908 WARN program_exit: has entered the UNKNOWN state and will no longer receive events, this usually indicates the process violated the eventlistener protocol
2018-05-20 21:55:34,908 WARN received SIGQUIT indicating exit request
2018-05-20 21:55:34,908 INFO waiting for nginx, rsyslog, program_exit to die
2018-05-20 21:55:35,913 INFO stopped: nginx (exit status 0)
2018-05-20 21:55:35,914 INFO stopped: program_exit (terminated by SIGTERM)
2018-05-20 21:55:35,915 INFO stopped: rsyslog (exit status 0)

Received 64 lines of log

Is there anything missing that I should have done?

I am not sure if this has anything to do with this issue, but I choose the file name my_ConvNet_MNIST_model.data-00000-of-00001, as instructed in 2, which is different from my_ConvNet_MNIST_model.data-00001-of-00000 in the az ml service create command.

Thanks in advance.

girishkumarbk commented 6 years ago

I see the same issue as below. Can the author put some light on this issue ?

$ az ml service logs realtime -i mnist-model 2018-05-29 06:07:05,676 CRIT Supervisor running as root (no user in config file) 2018-05-29 06:07:05,679 INFO supervisord started with pid 1 2018-05-29 06:07:06,681 INFO spawned: 'rsyslog' with pid 9 2018-05-29 06:07:06,683 INFO spawned: 'program_exit' with pid 10 2018-05-29 06:07:06,685 INFO spawned: 'nginx' with pid 11 2018-05-29 06:07:06,686 INFO spawned: 'iot' with pid 12 2018-05-29 06:07:06,688 INFO spawned: 'gunicorn' with pid 13 2018-05-29 06:07:06,702 INFO success: iot entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2018-05-29 06:07:06,757 INFO exited: iot (exit status 1; expected) 2018-05-29 06:07:07,758 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-05-29 06:07:07,758 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-05-29 06:07:11,763 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs) 2018-05-29T06:07:12.127151Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0 2018-05-29T06:07:12.128599Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (13) 2018-05-29T06:07:12.128965Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync 2018-05-29T06:07:12.129924Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300 2018-05-29T06:07:12.131108Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 31 Initializing logger 2018-05-29T06:07:12.609597Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client 2018-05-29T06:07:12.609886Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator 2018-05-29T06:07:12.610067Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks 2018-05-29T06:07:12.610244Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function 2018-05-29 06:07:16.064184: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29 06:07:16.064237: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29 06:07:16.064267: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29 06:07:16.064299: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29 06:07:16.064308: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29T06:07:16.941963Z, ERROR, 00000000-0000-0000-0000-000000000000, , User's init function failed 2018-05-29T06:07:16.943163Z, ERROR, 00000000-0000-0000-0000-000000000000, , Encountered Exception Traceback (most recent call last): File "/var/azureml-app/aml_blueprint.py", line 109, in register main.init() File "/var/azureml-app/main.py", line 41, in init user_init_func() File "/var/azureml-app/webservice_driver.py", line 9, in init saver.restore(sess,tf.train.latest_checkpoint('./')) File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1511, in latest_checkpoint if file_io.get_matching_files(v2_path) or file_io.get_matching_files( File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 282, in get_matching_files compat.as_bytes(filename), status)] File "/home/mmlspark/lib/conda/lib/python3.5/contextlib.py", line 66, in exit next(self.gen) File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model

2018-05-29T06:07:16.943407Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 31) 2018-05-29T06:07:17.162821Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master 2018-05-29T06:07:17.164038Z, INFO, 00000000-0000-0000-0000-000000000000, , Reason: Worker failed to boot. 2018-05-29 06:07:17,198 INFO exited: gunicorn (exit status 3; not expected) 2018-05-29 06:07:18,200 INFO spawned: 'gunicorn' with pid 41 2018-05-29T06:07:23.646741Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0 2018-05-29T06:07:23.647964Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (41) 2018-05-29T06:07:23.648163Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync 2018-05-29T06:07:23.649027Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300 2018-05-29T06:07:23.650233Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 45 Initializing logger 2018-05-29T06:07:23.962799Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client 2018-05-29T06:07:23.963089Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator 2018-05-29T06:07:23.963266Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks 2018-05-29T06:07:23.963389Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function 2018-05-29 06:07:25.553304: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29 06:07:25.553345: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29 06:07:25.553358: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29 06:07:25.553368: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29 06:07:25.553378: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2018-05-29T06:07:26.462690Z, ERROR, 00000000-0000-0000-0000-000000000000, , User's init function failed 2018-05-29T06:07:26.464196Z, ERROR, 00000000-0000-0000-0000-000000000000, , Encountered Exception Traceback (most recent call last): File "/var/azureml-app/aml_blueprint.py", line 109, in register main.init() File "/var/azureml-app/main.py", line 41, in init user_init_func() File "/var/azureml-app/webservice_driver.py", line 9, in init saver.restore(sess,tf.train.latest_checkpoint('./')) File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1511, in latest_checkpoint if file_io.get_matching_files(v2_path) or file_io.get_matching_files( File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 282, in get_matching_files compat.as_bytes(filename), status)] File "/home/mmlspark/lib/conda/lib/python3.5/contextlib.py", line 66, in exit next(self.gen) File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model

2018-05-29T06:07:26.464671Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 45) 2018-05-29T06:07:26.666009Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master 2018-05-29T06:07:26.666482Z, INFO, 00000000-0000-0000-0000-000000000000, , Reason: Worker failed to boot. 2018-05-29 06:07:26,697 INFO exited: gunicorn (exit status 3; not expected) 2018-05-29 06:07:27,699 INFO gave up: gunicorn entered FATAL state, too many start retries too quickly 2018-05-29 06:07:28,701 WARN program_exit: bad result line: 'Killing supervisor with this event: ver:3.0 server:supervisor serial:0 pool:program_exit poolserial:0 eventname:PROCESS_STATE_FATAL len:58' 2018-05-29 06:07:28,701 WARN program_exit: has entered the UNKNOWN state and will no longer receive events, this usually indicates the process violated the eventlistener protocol 2018-05-29 06:07:28,701 WARN received SIGQUIT indicating exit request 2018-05-29 06:07:28,701 INFO waiting for nginx, rsyslog, program_exit to die 2018-05-29 06:07:29,706 INFO stopped: nginx (exit status 0) 2018-05-29 06:07:29,707 INFO stopped: program_exit (terminated by SIGTERM) 2018-05-29 06:07:29,708 INFO stopped: rsyslog (exit status 0)

Received 64 lines of log

AdityaKurude commented 6 years ago

Hi guys, I faced the same issue while following the quick start tutorial. However, after debugging a little bit I resolved the issue. Since the checkpoint file provided in the repo contain path "/tmp/model/my_ConvNet_MNIST_model", the service tried to look for the model in the same path. Just change the path as "./my_ConvNet_MNIST_model" and then create the service.

command : $ az ml service create realtime -m my_ConvNet_MNIST_model.meta -d my_ConvNet_MNIST_model.data-00001-of-00000 -d my_ConvNet_MNIST_model.index -d checkpoint -f webservice_driver.py -n [your service name] -r python -c conda_dependencies.yml

I think this will resolve your issue.