Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.09k stars 2.52k forks source link

Errors in your scoring file's init() function #100

Closed diegoschapira closed 5 years ago

diegoschapira commented 5 years ago

Hi, trying to deploy a simple text classifier model using sklearn tfidf and LinearSVC. Managed to register models, create environment file, create image but deployment fails, suggesting there might be an issue with the init() function.

{'code': 'AciDeploymentFailed', 'message': 'Aci Deployment failed', 'details': [{'code': 'CrashLoopBackOff', 'message': "Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check the logs for your container instance high-priority.\nYou can also try to run image highpriority7173083745.azurecr.io/high-priority:7 locally. Please refer to http://aka.ms/debugimage for more information."}]}

Tested score.py locally loading persisted model from local via joblib and returns predictions as expected so nothing seems wrong with the script or models.

Any ideas?!

Here is the score.py code:

import pickle
import json
import numpy as np
import os
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

from azureml.core.model import Model

def init():
    global model_1,model_2
    # get model path
    model_path1 = Model.get_model_path("tfidf_hp")
    model_path2 = Model.get_model_path("LinearSVC_hp")
    # deserialize the model file back into a sklearn model
    model_1,model_2 = joblib.load(model_path1), joblib.load(model_path2)

def run(raw_data):
    # make prediction
    data = json.loads(raw_data)['data']
    y_pred = model_2.predict(model_1.transform(data))
    return json.dumps({"prediction": y_pred.tolist()})
rastala commented 5 years ago

You can use service.get_logs() call to get more detailed log of what exactly failed.

diegoschapira commented 5 years ago

Thanks @rastala. Going through the logs seeing below error related to loading runs. For the time being I'm not using runs/experiments to record or log anything. Just Workspace and Model to deploy the service via ACI.

azureml.core.run | DEBUG | Could not load run context Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run.,

Do I still need to start a run to create the service? If so, when should I start it and end it?

My script is as follows using the steps in your notebook:

Create Workspace >> Load models from local >> Register Models to ws >> Create Dependencies >> Deploy Image/Service.

Thanks for the prompt support. Working towards a tight deadline to deploy models and want to take advantage of azureml which seems straightforward but have not managed to make it work!

rastala commented 5 years ago

You don't need to submit a run to create web service. The "Failed to load a submitted run" is a red herring, the real error message is probably deeper down in the log. Look for errors coming from "driver_module.init()", for example:

"logger": "root", "message": "{\"apiName\": \"\", \"message\": \"Encountered Exception Traceback (most recent call last):\\n File \\\"/var/azureml-app/aml_blueprint.py\\\", line 109, in register\\n main.init()\\n File \\\"/var/azureml-app/main.py\\\", line 79, in init\\n driver_module.init()\\n File \\\"pytorch_score.py\\\", line 14, in init\\n model_path = Model.get_model_path(\'pytorch-hymenoptera\')

diegoschapira commented 5 years ago

Here is the full log. Anything that jumps out?

2018-11-30 10:46:49,485 CRIT Supervisor running as root (no user in config file) 2018-11-30 10:46:49,487 INFO supervisord started with pid 1 2018-11-30 10:46:50,489 INFO spawned: 'rsyslog' with pid 8 2018-11-30 10:46:50,491 INFO spawned: 'program_exit' with pid 9 2018-11-30 10:46:50,492 INFO spawned: 'nginx' with pid 10 2018-11-30 10:46:50,493 INFO spawned: 'iot' with pid 11 2018-11-30 10:46:50,547 INFO spawned: 'gunicorn' with pid 14 2018-11-30 10:46:50,555 INFO success: iot entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting... 2018-11-30 10:46:50,659 INFO exited: iot (exit status 1; expected) 2018-11-30 10:46:51,660 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-11-30 10:46:51,661 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-11-30 10:46:55,664 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs) {"stack_info": null, "msg": "Starting gunicorn %s", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:55.938311Z", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "level": "INFO", "message": "Starting gunicorn 19.6.0", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "gunicorn.error"} {"stack_info": null, "msg": "Listening at: %s (%s)", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:55.939105Z", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "level": "INFO", "message": "Listening at: http://127.0.0.1:9090 (14)", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "gunicorn.error"} {"stack_info": null, "msg": "Using worker: %s", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:55.939212Z", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "level": "INFO", "message": "Using worker: sync", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "gunicorn.error"} {"stack_info": null, "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:55.939764Z", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "level": "INFO", "message": "worker timeout is set to 300", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "gunicorn.error"} {"stack_info": null, "msg": "Booting worker with pid: %s", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:55.940708Z", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "level": "INFO", "message": "Booting worker with pid: 26", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "gunicorn.error"} Initializing logger {"stack_info": null, "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:57.294508Z", "path": "/var/azureml-app/aml_logger.py", "level": "INFO", "message": "{\"message\": \"Starting up app insights client\", \"requestId\": \"00000000-0000-0000-0000-000000000000\", \"apiName\": \"\"}", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "root"} {"stack_info": null, "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:57.294764Z", "path": "/var/azureml-app/aml_logger.py", "level": "INFO", "message": "{\"message\": \"Starting up request id generator\", \"requestId\": \"00000000-0000-0000-0000-000000000000\", \"apiName\": \"\"}", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "root"} {"stack_info": null, "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:57.294863Z", "path": "/var/azureml-app/aml_logger.py", "level": "INFO", "message": "{\"message\": \"Starting up app insight hooks\", \"requestId\": \"00000000-0000-0000-0000-000000000000\", \"apiName\": \"\"}", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "root"} {"stack_info": null, "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "timestamp": "2018-11-30T10:46:57.295042Z", "path": "/var/azureml-app/aml_logger.py", "level": "INFO", "message": "{\"message\": \"Invoking user's init function\", \"requestId\": \"00000000-0000-0000-0000-000000000000\", \"apiName\": \"\"}", "host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "logger": "root"} 2018-11-30 10:46:57,295 | azureml.core.run | DEBUG | Could not load run context Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run., switching offline: False 2018-11-30 10:46:57,295 | azureml.core.run | DEBUG | Could not load the run context and allow_offline set to False 2018-11-30 10:46:57,295 | azureml.core.model | DEBUG | RunEnvironmentException: Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run. 2018-11-30 10:46:57,295 | azureml.core.model | DEBUG | version is None. Latest version is 1 2018-11-30 10:46:57,296 | azureml.core.model | DEBUG | Found model path at azureml-models/tfidf_hp/1/tfidf_hp.pkl 2018-11-30 10:46:57,296 | azureml.core.run | DEBUG | Could not load run context Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run., switching offline: False 2018-11-30 10:46:57,296 | azureml.core.run | DEBUG | Could not load the run context and allow_offline set to False 2018-11-30 10:46:57,296 | azureml.core.model | DEBUG | RunEnvironmentException: Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run. 2018-11-30 10:46:57,296 | azureml.core.model | DEBUG | version is None. Latest version is 1 2018-11-30 10:46:57,296 | azureml.core.model | DEBUG | Found model path at azureml-models/LinearSVC_hp/1/LinearSVC_hp.pkl 2018-11-30 10:47:19,146 INFO exited: rsyslog (terminated by SIGKILL; not expected) python: ../sysdeps/nptl/fork.c:156: __libc_fork: Assertion `THREAD_GETMEM (self, tid) != ppid' failed. 2018-11-30 10:47:19,555 INFO spawned: 'rsyslog' with pid 31 2018-11-30 10:47:19,556 INFO exited: program_exit (terminated by SIGKILL; not expected) 2018-11-30 10:47:19,556 INFO exited: nginx (terminated by SIGKILL; not expected) 2018-11-30 10:47:19,556 INFO exited: rsyslog (terminated by SIGSEGV (core dumped); not expected) 2018-11-30 10:47:19,748 INFO spawned: 'program_exit' with pid 32 2018-11-30 10:47:19,956 INFO spawned: 'nginx' with pid 33 2018-11-30 10:47:19,957 INFO exited: gunicorn (terminated by SIGKILL; not expected) 2018-11-30 10:47:20,094 INFO reaped unknown pid 26 2018-11-30 10:47:20,094 INFO reaped unknown pid 30 2018-11-30 10:47:20,584 INFO spawned: 'rsyslog' with pid 34 2018-11-30 10:47:21,084 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-11-30 10:47:21,085 INFO spawned: 'gunicorn' with pid 37 2018-11-30 10:47:21,584 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-11-30 10:47:22,585 INFO exited: nginx (exit status 1; not expected) 2018-11-30 10:47:23,587 INFO spawned: 'nginx' with pid 39 2018-11-30 10:47:26,107 INFO exited: nginx (exit status 1; not expected) {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:26.484448Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "msg": "Starting gunicorn %s", "logger": "gunicorn.error", "message": "Starting gunicorn 19.6.0"} {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:26.485314Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "msg": "Listening at: %s (%s)", "logger": "gunicorn.error", "message": "Listening at: http://127.0.0.1:9090 (37)"} {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:26.485423Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "msg": "Using worker: %s", "logger": "gunicorn.error", "message": "Using worker: sync"} {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:26.486036Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "logger": "gunicorn.error", "message": "worker timeout is set to 300"} {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:26.486965Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "msg": "Booting worker with pid: %s", "logger": "gunicorn.error", "message": "Booting worker with pid: 42"} Initializing logger {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/var/azureml-app/aml_logger.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:27.764771Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "logger": "root", "message": "{\"requestId\": \"00000000-0000-0000-0000-000000000000\", \"apiName\": \"\", \"message\": \"Starting up app insights client\"}"} {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/var/azureml-app/aml_logger.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:27.764991Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "logger": "root", "message": "{\"requestId\": \"00000000-0000-0000-0000-000000000000\", \"apiName\": \"\", \"message\": \"Starting up request id generator\"}"} {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/var/azureml-app/aml_logger.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:27.765182Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "logger": "root", "message": "{\"requestId\": \"00000000-0000-0000-0000-000000000000\", \"apiName\": \"\", \"message\": \"Starting up app insight hooks\"}"} {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/var/azureml-app/aml_logger.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:27.765357Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "logger": "root", "message": "{\"requestId\": \"00000000-0000-0000-0000-000000000000\", \"apiName\": \"\", \"message\": \"Invoking user's init function\"}"} 2018-11-30 10:47:27,766 | azureml.core.run | DEBUG | Could not load run context Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run., switching offline: False 2018-11-30 10:47:27,766 | azureml.core.run | DEBUG | Could not load the run context and allow_offline set to False 2018-11-30 10:47:27,766 | azureml.core.model | DEBUG | RunEnvironmentException: Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run. 2018-11-30 10:47:27,767 | azureml.core.model | DEBUG | version is None. Latest version is 1 2018-11-30 10:47:27,767 | azureml.core.model | DEBUG | Found model path at azureml-models/tfidf_hp/1/tfidf_hp.pkl 2018-11-30 10:47:27,767 | azureml.core.run | DEBUG | Could not load run context Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run., switching offline: False 2018-11-30 10:47:27,767 | azureml.core.run | DEBUG | Could not load the run context and allow_offline set to False 2018-11-30 10:47:27,767 | azureml.core.model | DEBUG | RunEnvironmentException: Failed to load a submitted run, if outside of an execution context, use project.start_run to initialize an azureml.core.Run. 2018-11-30 10:47:27,767 | azureml.core.model | DEBUG | version is None. Latest version is 1 2018-11-30 10:47:27,767 | azureml.core.model | DEBUG | Found model path at azureml-models/LinearSVC_hp/1/LinearSVC_hp.pkl 2018-11-30 10:47:28,770 INFO spawned: 'nginx' with pid 47 2018-11-30 10:47:31,285 INFO exited: nginx (exit status 1; not expected) 2018-11-30 10:47:32,286 INFO gave up: nginx entered FATAL state, too many start retries too quickly 2018-11-30 10:47:33,287 WARN program_exit: bad result line: 'Killing supervisor with this event: ver:3.0 server:supervisor serial:0 pool:program_exit poolserial:0 eventname:PROCESS_STATE_FATAL len:52' 2018-11-30 10:47:33,288 WARN program_exit: has entered the UNKNOWN state and will no longer receive events, this usually indicates the process violated the eventlistener protocol 2018-11-30 10:47:33,288 WARN received SIGQUIT indicating exit request 2018-11-30 10:47:33,288 INFO waiting for gunicorn, rsyslog, program_exit to die {"host": "wk-caas-f040dbaa1e2744a3a1625e4209a1afd1-d292a1cb1e1f2ed953cb9b", "path": "/home/mmlspark/lib/conda/lib/python3.5/site-packages/gunicorn/glogging.py", "stack_info": null, "level": "INFO", "timestamp": "2018-11-30T10:47:33.288883Z", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "msg": "Handling signal: %s", "logger": "gunicorn.error", "message": "Handling signal: term"} 2018-11-30 10:47:36,293 INFO waiting for gunicorn, rsyslog, program_exit to die 2018-11-30 10:47:39,296 INFO waiting for gunicorn, rsyslog, program_exit to die 2018-11-30 10:47:42,300 INFO waiting for gunicorn, rsyslog, program_exit to die 2018-11-30 10:47:43,301 WARN killing 'gunicorn' (37) with SIGKILL 2018-11-30 10:47:44,304 INFO stopped: gunicorn (terminated by SIGKILL) 2018-11-30 10:47:45,305 INFO waiting for rsyslog, program_exit to die 2018-11-30 10:47:46,307 INFO stopped: program_exit (terminated by SIGTERM) 2018-11-30 10:47:46,308 INFO stopped: rsyslog (exit status 0)

diegoschapira commented 5 years ago

Found out it was a issue with memory when creating the deployment. 4GB solved the issue.