Unable to load model in score script - calls to init() are failing

abhijelly commented 1 year ago

I have modified the demand forecasting template for my use case. I'm unable to load my model in score script. The model I'm using is a CatBoost model which has its owns load_module() method. I've tried the following approaches which are failing -

Approach 1: Loading the model registered in the workspace
```
model_path = Model.get_model_path("forecast_model/model.cb", version=1, _workspace=Workspace.from_config())
model = CatBoostRegressor.load_model(model_path)
```
azureml.exceptions._azureml_exception.UserErrorException: UserErrorException: Message: The workspace configuration file config.json, could not be found in /tmp/549412ac-6f87-41e3-914c-be6f53b12cda/azureml-bi/106 or its parent directories. Please check whether the workspace configuration file exists, or provide the full path to the configuration file as an argument. You can download a configuration file for your workspace, via http://ml.azure.com and clicking on the name of your workspace in the right top. InnerException None ErrorResponse { "error": { "code": "UserError", "message": "The workspace configuration file config.json, could not be found in /tmp/549412ac-6f87-41e3-914c-be6f53b12cda/azureml-bi/106 or its parent directories. Please check whether the workspace configuration file exists, or provide the full path to the configuration file as an argument. You can download a configuration file for your workspace, via http://ml.azure.com and clicking on the name of your workspace in the right top." } }
Approach 2: Passing the model_path as a n argument in the ParrallRunStep
```
parser = argparse.ArgumentParser(description="get model path")
parser.add_argument("--model_path", type=str)
args = parser.parse_args()
model = CatBoostRegressor().load_model(args.model_path)
```
azureml_common.parallel_run.exception.ParallelTaskException: ParallelTaskException error. Exit code: 42. Message: Run failed. Below is the error detail: EntryScriptException: Entry script error. All tries to load the entry script or calling init() failed. Please check logs/user/error/ and logs/sys/error/ to see if some errors have occurred. No mini batch has been completed. Consider a succeeded mini batch or failed mini batch reached the max tries as completed. The init() function in the entry script had raised exception for 38 times. Please check logs at logs/user/error/* for details.

Error '2. usage: main.py [-h] [--model_path MODEL_PATH] main.py: error: unrecognized arguments: --client_sdk_version 1.47.0 --scoring_module_name [REDACTED]-forecast.py --mini_batch_size 1048576 --error_threshold -1 --output_action append_row --logging_level DEBUG --run_invocation_timeout 60 --run_max_try 3 --create_snapshot_at_runtime True --allowed_failed_count 0 --output /mnt/azureml/cr/j/e676d0761e0048e9b6c39bda36794d61/cap/data-capability/wd/parallelRunOutput --input_ds_0 raw_data --aml_core_version 1.48.0 --dataprep_version 4.8.4 --bf89b5b0_523f_4782_aedc_61bd625ee81a {"working_dir": "/mnt/azureml/cr/j/e676d0761e0048e9b6c39bda36794d61/exe/wd/0ae00b1b-8c86-4f38-ba77-b0538d66ee0b", "snapshot_dir": "/mnt/azureml/cr/j/e676d0761e0048e9b6c39bda36794d61/exe/wd", "port": 42085, "input_format": "TabularDataset", "agent_name": "process000", "inputs": ["raw_data"], "gpu_index": -1, "mini_batch_size": 1048576}.' occurred 2 times.
Approach 3: Using the AZUREML_MODEL_DIR environment variable
model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "metals_1_month_model/model.cb")
model = CatBoostRegressor.load_model(model_path)
File "/mnt/azureml/cr/j/f7ea2e8b202a4c6ca67b2c7bd8777fda/exe/wd/[REDACTED]-forecast.py", line 17, in init model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "metals_1_month_model/model.cb") File "/opt/miniconda/lib/python3.8/posixpath.py", line 76, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType

AZUREML_MODEL_DIR is None for some reaason

Approach 4: Placing the model.cb in source directory and just simply loading the model using CatBoost load_module() method

Entry script error. All tries to load the entry script or calling init() failed. Please check logs/user/error/ and logs/sys/error/ to see if some errors have occurred.No mini batch has been completed. Consider a succeeded mini batch or failed mini batch reached the max tries as completed. The init() function in the entry script had raised exception for 39 times. Please check logs at logs/user/error/ for details. Error 'catboost/libs/model/model_import_interface.h:19: Model file doesn't exist: model.cb' occurred 78 times.

Any thoughts what I might be doing wrong in my approaches? It would be highly appreciated! Thank you!

Hung20736 commented 1 year ago

The approach 1, you don't have a config.json file which specifies which subs, resource_group and workspace you are using. The config.json should be something like this: { "subscription_id": "", "resource_group": "", "workspace_name": "" }

You can add the config.json file to the source_directory which is configured in ParallelRunConfig.

Or you can hard-code doing this: ws = Workspace(subscription_id=, resource_group=, workspace_name) model_path = Model.get_model_path("forecast_model/model.cb", version=1, _workspace=ws )

abhijelly commented 1 year ago

thank you for answering!

hardcoding approach worked for me because during batch processing, the parallel worker was not recognizing the workspace config file. One correction to your answer is, Model.get_model_path() should only be given the model folder name not the complete path to the model file. After getting the model folder path, append to the "model.cb" so that model can be loaded

Azure / MachineLearningNotebooks

Unable to load model in score script - calls to init() are failing #1877