Dhruv-reviv commented 3 months ago

Describe the bug

I am trying to get inference from a deployed pretrained model on Sagemaker notebook environment. While executing the below line of code, response = predictor.predict(serialized_data)

I am receiving an SSL error as below, SSLError: SSL validation failed for https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/pytorch-inference-2024-06-11-13-39-50-210/invocations EOF occurred in violation of protocol (_ssl.c:2426)

Expected Behavior

I should be receiving the response as I tested it manually in the notebook environment without deploying the model and just loading the weights from S3 bucket and using the same piece of code for inference.

Code: input_data = (interaction_data, mt_data) results = predict_fn(input_data, model)

Result: Iteration at 0: auc 0.964, map 0.439

Current Behavior

Assuming that Model gets executed properly as defined below, pytorch_model = PyTorchModel(model_data=f's3://{model_bucket}/{model_key}', role=role, entry_point='inference.py', framework_version='1.8.1', # Specify the PyTorch version py_version='py3', sagemaker_session=sagemaker_session) and the deployment takes place properly as, predictor = pytorch_model.deploy(instance_type='ml.m5.large', initial_instance_count=1)

While executing response = predictor.predict(serialized_data),

I receive the below error, SSLError: SSL validation failed for https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/pytorch-inference-2024-06-11-13-39-50-210/invocations EOF occurred in violation of protocol (_ssl.c:2426)

Reproduction Steps

Define your data path interaction_data = "s3://path_to_pkl/interaction.pkl" auxiliary_data = "s3://path_to_pkl/auxiliary.pkl" Define Model bucket model_bucket = 'path_to_model_bucket' model_key = 'Model-Structure/model.tar.gz'

Arrange your data properly, with open("interaction.pkl", 'rb') as f: data1 = CPU_Unpickler(f).load() with open("auxiliary.pkl", 'rb') as f: data2= CPU_Unpickler(f).load() serialized_data = pickle.dumps({ 'data1': data1, 'data2': data2 }) Define your model, pytorch_model = PyTorchModel(model_data=f's3://{model_bucket}/{model_key}', role=role, entry_point='inference.py', framework_version='1.8.1', # Specify the PyTorch version py_version='py3', sagemaker_session=sagemaker_session) Deploy the model predictor = pytorch_model.deploy(instance_type='ml.m5.large', initial_instance_count=1) Get the response response = predictor.predict(serialized_data)

I also have an inference.py which I am using for model evaluation and obtaining the response, a general practice for sagemaker environment.

Possible Solution

No response

Additional Information/Context

No response

SDK version used

1.34.101

Environment details (OS name and version, etc.)

AWS Sagemaker Jupyter Notebook

tim-finnigan commented 3 months ago

Thanks for reaching out. The SSL validation failed... error has been reported several times in the past. I recommend looking through those issues. This troubleshooting section in the AWS CLI User Guide also highlights some possible causes (both the CLI and Boto3 use Botocore under the hood, so the troubleshooting steps apply to both):

The AWS CLI doesn't trust your proxy's certificate.
Your configuration isn't pointing to the correct CA root certificate location.
Your configuration isn't using the correct AWS Region.
Your TLS version needs to be updated

If you're still seeing an issue, please share a complete but code snippet to reproduce the problem and debug logs (with sensitive info redacted) by adding boto3.set_stream_logger('') to your script.

Dhruv-reviv commented 3 months ago

Hi @tim-finnigan I have already tried various hacks/tricks from other SSL validation issues before opening a new one. I also tried to downgrade boto3's version to 1.28.63 which someone suggested to be working, however being in the sagemaker environment, I am not being really able to activate a new environment I created and also in another way around, downgrading from notebook itself is not working as it still picks up the original version.

According to your suggestion I added boto3.set_stream_logger('') to my script, however on executing the response = predictor.predict(serialized_data), kernel died with the logging message as attached in the image.

Here's the code for you to work with ` import logging import boto3 import pickle import sagemaker from sagemaker.pytorch import PyTorchModel

Enable logging for boto3 and specify the logger name

boto3.set_stream_logger()

Define the S3 bucket and key where the model artifacts are stored

model_bucket = 'Bucket Name' model_key = 'Model-Structure/model.tar.gz'

Load the interaction and auxiliary data

with open("interaction.pkl", 'rb') as f: data1 = CPU_Unpickler(f).load() with open("auxiliary.pkl", 'rb') as f: data2 = CPU_Unpickler(f).load()

Serialize the data for prediction

serialized_data = pickle.dumps({ 'data1': data1, 'data2': data2 })

Create a PyTorchModel object for deploying the model

pytorch_model = PyTorchModel( model_data=f's3://{model_bucket}/{model_key}', # S3 URI for the model artifacts role=role, # IAM role with necessary permissions entry_point='inference.py', # Path to the inference script framework_version='1.13.1', # PyTorch version py_version='py39', # Python version sagemaker_session=sagemaker_session # SageMaker session )

Deploy the model to a SageMaker endpoint

predictor = pytorch_model.deploy( instance_type='ml.m5.xlarge', # Instance type initial_instance_count=1 # Number of instances )

Use the deployed model to make predictions

response = predictor.predict(serialized_data)

Print the prediction response

print(response) `

Everything executes till the response command.

Dhruv-reviv commented 3 months ago

I feel I kind of know what the error implicitly points at. The test data was too big and hence the error, I feel. However, even taking 1% of data, I am receiving a different error. If someone wants to help and/or has any idea on how to approach, please welcome. The code and inference.py still remains the same,

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/pytorch-inference-2024-06-17-14-40-29-350 in account 953765082453 for more information.

tim-finnigan commented 3 months ago

I saw a related issue for the Python Sagemaker SDK: https://github.com/aws/sagemaker-python-sdk/issues/1119 and documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-troubleshooting.html.

You can configure timeouts/retries as documented here in Boto3, but it looks like this a limitation imposed by the Sagemaker service. Again if you want us to review further please share logs which you can get by adding boto3.set_stream_logger('') to your script.

Dhruv-reviv commented 3 months ago

Hey Tim, Thanks for your email. I did check out the issue and all the surrounding ideas/implementations but nothing helping. I adapted async inference to try to get more time and data capacity. However it ran for an hour (which is the capacity of async inference) and the kernel died off. I was testing for very limited data maybe 50 MBs tops and still couldn't get any inferences out.

Thanks, Dhruv

On Wed, Jun 19, 2024 at 4:57 PM Tim Finnigan @.***> wrote:

I saw a related issue for the Python Sagemaker SDK: aws/sagemaker-python-sdk#1119 https://github.com/aws/sagemaker-python-sdk/issues/1119 and documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-troubleshooting.html .

You can configure timeouts/retries as documented here https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html in Boto3, but it looks like this a limitation imposed by the Sagemaker service. Again if you want us to review further please share logs which you can get by adding boto3.set_stream_logger('') to your script.

— Reply to this email directly, view it on GitHub https://github.com/boto/boto3/issues/4162#issuecomment-2179447416, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIXVTVW2FEEKH2JYNHBA54DZIHWDHAVCNFSM6AAAAABJEOCQ3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQ2DONBRGY . You are receiving this because you authored the thread.Message ID: @.***>

github-actions[bot] commented 2 months ago

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

RwGrid commented 2 weeks ago

any update on this thread @Dhruv-reviv

Dhruv-reviv commented 2 weeks ago

Hi @RwGrid, I do not have any further updates. Additionally, I stopped working on the same and modified the requirements.

boto / boto3

EOF occurred in violation of protocol (_ssl.c:2426)(short issue description) #4162

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

SDK version used

Environment details (OS name and version, etc.)

Enable logging for boto3 and specify the logger name

Define the S3 bucket and key where the model artifacts are stored

Load the interaction and auxiliary data

Serialize the data for prediction

Create a PyTorchModel object for deploying the model

Deploy the model to a SageMaker endpoint

Use the deployed model to make predictions

Print the prediction response