aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.14k forks source link

Baseline Job for ModelExplainabilityMonitor failing due to ClientError: An error occurred (ModelError) when calling the InvokeEndpoint operation (reached max retries: 0): Received server error (500) #3494

Open irdanish11 opened 1 year ago

irdanish11 commented 1 year ago

Describe the bug I am trying to create a SageMaker ModelExplainabilityMonitor for one of my ML model. In the ShapConfig I need to provide the SHAP baseline which I am computing by taking mean of features as suggested here. The problem is when I run the method suggest_baseline(), it starts the SageMaker processing job creates the shadow endpoint but sagemaker processing job fails and gives endpoint retries error which is given below:

ClientError: An error occurred (ModelError) when calling the InvokeEndpoint operation (reached max retries: 0): Received server error (500) 
from primary with message "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>500 Internal Server Error</title> 
<h1>Internal Server Error</h1> <p>The server encountered an internal error and was unable to complete your request. Either the server 
is overloaded or there is an error in the application.</p> ". See https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-
west-2#logEventViewer:group=/aws/sagemaker/Endpoints/sm-clarify-pipelines-xqkqe9pekm5f-MACEModel-2Al-1669797780-45e1 in account 450538937006 for more information.

When I check the cloud watch logs of the shadow endpoint created by the baseline job it shows why the server was timed out which is given below:

ERROR - random_forest_training - Exception on /invocations [POST]

Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_functions.py", line 93, in wrapper
    return fn(*args, **kwargs)
  File "/opt/ml/code/random_forest_training.py", line 38, in predict_fn
    prediction = model[0].predict_proba(input_data)
  File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 673, in predict_proba
    X = self._validate_X_predict(X)
  File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 421, in _validate_X_predict
    return self.estimators_[0]._validate_X_predict(X, check_input=True)
  File "/miniconda3/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 388, in _validate_X_predict
    X = check_array(X, dtype=DTYPE, accept_sparse="csr")
  File "/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 623, in check_array
    "if it contains a single sample.".format(array))

ValueError: Expected 2D array, got 1D array instead: array=[-0.07272727 -0.538843    0.21109799 -0.11960932  0.23030303 -0.09173553
 -0.17808585 -0.19966942 -0.06921487  0.01707989  0.          0.
 -0.02214876 -0.17888805  0.00661157 -0.04977043  0.01818182  0.15619835
  0.39504132 -0.05785124  0.01157025].

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The array that is shown in the error contains the values of the shap baseline which I have computed from the mean of features and I am already sending them in a nested list.

The data, model and shap config are given below:

df = pd.read_csv("test.csv")
columns = df.columns.to_list()

data_config = DataConfig(
    s3_data_input_path="test.csv",
    s3_output_path=baseline_results_uri,
    label=target_variable,
    headers=columns,
    dataset_type="text/csv",
)

model_config = ModelConfig(
    model_name=model_name,
    instance_count=1,
    instance_type="ml.m5.large",
    content_type="text/csv",
    accept_type="text/csv",
)

df_features = df.drop(target_variable, axis=1)
shap_baseline = [df_features.mean().to_list()]
shap_config = SHAPConfig(
    baseline=shap_baseline,
    num_samples=100,
    agg_method="mean_abs",
    save_local_shap_values=False,
)

My model expects 2D input (i.e. all features in a nested list [[0.32, 0.56, ..., 0.7]]), and what I understand from the error it is giving an error on predict method, what is weird to me is why it is showing the shap baseline array as input instead of the data from my CSV. Also, I am unaware of how can I transform the data that is being fed to the shadow endpoint.

To reproduce To reproduce this error we need to run suggest_baseline() method of ModelExplainabilityMonitor and use an ML model that expects input in 2D (i.e. all features in a nested list [[0.32, 0.56, ..., 0.7]]).

Expected behavior The expected behavior is to run the baseline job successfully and generate the reports.

System information

chinmay002 commented 7 months ago

Were you able to solve this, i am facing similar issue. tried the code from sagemaker document.