aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.11k stars 1.14k forks source link

Model cannot be loaded in the SageMaker endpoint after update of SageMaker SDK to 2.212 #4488

Open Neptun332 opened 8 months ago

Neptun332 commented 8 months ago

Describe the bug Model cannot be loaded in the SageMaker endpoint after update of SageMaker SDK to 2.212

To reproduce

model_builder = ModelBuilder(
    model_path=model_path,
    schema_builder=SchemaBuilder(sample_input, sample_output, input_translator=InputTranslator()),
    content_type='application/x-image',
    mode=Mode.SAGEMAKER_ENDPOINT,
    role_arn=role_arn,
    image_uri=image,
    inference_spec=InferenceSpec()
)
built_model = model_builder.build()
built_model.deploy(
    instance_type="ml.c6i.2xlarge",
    endpoint_name="my_endpoint_name",
    initial_instance_count=1)

Expected behavior

Screenshots or logs

2024-03-07T10:23:04.572+01:00   Model server started.
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,338 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=64
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,341 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,351 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml.
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,351 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]64
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,352 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,352 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,357 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,366 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-03-07T10:23:04.572+01:00   2024-03-07T09:23:04,371 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1709803384370
2024-03-07T10:23:05.324+01:00   2024-03-07T09:23:04,409 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-03-07T10:23:05.324+01:00   2024-03-07T09:23:05,201 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,202 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died.
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in <module>
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 184, in handle_connection
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,555 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,556 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2024-03-07T10:23:05.575+01:00   2024-03-07T09:23:05,556 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.

System information A description of your system. Please provide:

Additional context SageMaker endpoint was working for a while and successfully processing requests. The endpoint restarted and installed the latest version of SageMaker SDK (2.212). The endpoint stopped processing requests and printed logs as above. I have noticed that ModelBuilder creates a model package with requirements.txt. In that file, there is sagemaker>=2.199. I modified it and set sagemaker==2.199 which solved the issue.

samruds commented 8 months ago

Taking a look. Will reproduce the error locally today.

samruds commented 8 months ago

Hello we have identified and fixed the problem to be related to an extra dependencies that was added to ModelBuilder.

Please pull in the latest commit of SDK if you are still seeing an issue with this version. Specifically pull in this commit https://github.com/aws/sagemaker-python-sdk/pull/4549

samruds commented 8 months ago

Short term mitigations are

  1. Pass a custom dependency
model_builder = ModelBuilder(
    #mode=Mode.SAGEMAKER_ENDPOINT,  # you can change it to Mode.LOCAL_CONTAINER for local testing
    mode=Mode.LOCAL_CONTAINER ,
    model_path=resnet_model_dir,
    inference_spec=my_inference_spec,
    schema_builder=my_schema,
    role_arn=execution_role,
    dependencies={
        "custom": [
            "accelerate==0.24.1",

        ],
    }
)
  1. Installing the extras for bringing in accelerate if using the ModelBuilder interface. !pip install --force-reinstall --no-cache-dir --quiet "sagemaker[huggingface]>=2.212.0"
samruds commented 8 months ago

I will sync with SDK team on Monday for next steps to work with customer.