aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.07k stars 1.12k forks source link

SageMaker adds wrong additional `/` when using `S3DataSource` with a nested structure #4249

Open philschmid opened 8 months ago

philschmid commented 8 months ago

Describe the bug SageMaker adds wrongly / when using S3DataSource where files are stored in an nested order, see screenshot of how my s3 directory looks. image

To reproduce

  1. Have a model with a nested structure, e.g. Stable Diffusion
  2. try to deploy the model using S3DataSource, e.g. below
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data={'S3DataSource':{'S3Uri': s3_model_uri + "/",'S3DataType': 'S3Prefix','CompressionType': 'None'}},
   role=role,                      # iam role with permissions to create an Endpoint
   transformers_version="4.34.1",  # transformers version used
   pytorch_version="1.13.1",       # pytorch version used
   py_version='py310',             # python version used
   model_server_workers=1,         # number of workers for the model server
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,      # number of instances
    instance_type="ml.inf2.xlarge", # AWS Inferentia Instance
    volume_size = 100
)
# ignore the "Your model is not compiled. Please compile your model before using Inferentia." warning, we already compiled our model.

Expected behavior Deployed endpoint

Screenshots or logs Error: UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-inference-neuronx-2023-11-07-14-07-46-274: Failed. Reason: error: Key of model data S3 object 's3://sagemaker-us-east-2-558105141721/neuronx/sdxl//text_encoder/model.neuron' maps to invalid local file path..

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.

whittech1 commented 7 months ago

What is the value of s3_model_uri on this line?

model_data={'S3DataSource':{'S3Uri': s3_model_uri + "/",'S3DataType': 'S3Prefix','CompressionType': 'None'}},

philschmid commented 7 months ago

I tried with s3://mybucket/neuronx/sdxl/ and s3://mybucket/neuronx/sdxl. The strucutre is as shown in the image.

philschmid commented 7 months ago

Here is a full example https://github.com/philschmid/huggingface-inferentia2-samples/blob/main/stable-diffusion-xl/sagemaker-notebook.ipynb

You just need to change the "3. Upload the neuron model and inference script to Amazon S3" section and then "4. Deploy a Real-time Inference Endpoint on Amazon SageMaker"

trungleduc commented 7 months ago

Hi @philschmid, I tried your repo but can not reproduce the issue. Does the instance_type matter?

Screenshot 2023-11-23 133255

philschmid commented 7 months ago

I don't develop the SDK but i tested with inf2.xlarge maybe there is something different.

trungleduc commented 7 months ago

Could you test your code with other instance types?

philschmid commented 7 months ago

The error is with inf2.xlarge, the instance i want to use to deploy a model. Thats where the error appears. Why do you want to test another one?

trungleduc commented 7 months ago

I want to confirm whether the issue is in the SDK logic or in another place.

philschmid commented 7 months ago

@trungleduc, I understand you are trying to troubleshoot the root cause of the issue, but asking me to test on other instance types doesn't seem helpful at this point. As I mentioned, the error only occurs on inf2.xlarge with the version i shared. It would be more productive to dig deeper into what specifically is failing on inf2.xlarge, where this / gets added.