config.json file not found when loading model from AWS S3 bucket.

aws / sagemaker-huggingface-inference-toolkit

Apache License 2.0

234 stars 60 forks source link

config.json file not found when loading model from AWS S3 bucket. #22

Open ZayTheory opened 3 years ago

ZayTheory commented 3 years ago

Ok, so I am trying to deploy this model : https://huggingface.co/sshleifer/distilbart-cnn-12-6 with a custom inference.py to an endpoint on Amazon Web Services. First, however, I am trying to deploy it as is to before I add the custom inference.py file.

I will step you through the steps I have taken so far in hopes that you can tell me what I am doing wrong.

1) Download the model files using git clone https://huggingface.co/sshleifer/distilbart-cnn-12-6

2)Compress the 5.2gb model into a model.tar.gz using the command 'tar -czf model.tar.gz distilbart-cnn-12-6'

3)Upload the model.tar.gz to my s3 bucket

4)Deploy my model using this script

Whenever I run this, the endpoint successfully deploys. However when I try to run a prediction using

predictor.predict({ 'inputs': "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct." })

I get an error telling me on my aws logs

2021-07-22 16:02:37,654 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: ("You need to define one of the following ['feature-extraction', 'text-classification', 'token-classification', 'question-answering', 'table-question-answering', 'fill-mask', 'summarization', 'translation', 'text2text-generation', 'text-generation', 'zero-shot-classification', 'conversational', 'image-classification'] as env 'TASK'.", 403)

So I went into the source code here and found this section:

So for some reason, my json.config file is not loading. I think it has something to do with the model directory not being in the right place and I am kind of lost. Any help would be much appreciated!!!

philschmid commented 3 years ago

Hey @ZayTheory, Can you share the structure of your model.tar.gz? And try it again be creating the archive with the following steps?

Download the model

git lfs install
git clone https://huggingface.co/sshleifer/distilbart-cnn-12-6

Create a tar file

cd distilbart-cnn-12-6
tar zcvf model.tar.gz *

Upload model.tar.gz to s3
```
aws s3 cp model.tar.gz <s3://mymodel>
```

You could also deploy the model without creating an archive and uploading it to s3 with the following snippet

from sagemaker.huggingface import HuggingFaceModel
import boto3

iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='{IAM_ROLE_WITH_SAGEMAKER_PERMISSIONS}')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'sshleifer/distilbart-cnn-12-6',
    'HF_TASK':'summarization'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.6.1',
    pytorch_version='1.7.1',
    py_version='py36',
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m5.xlarge' # ec2 instance type
)

predictor.predict({
    'inputs': "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."
})

ZayTheory commented 3 years ago

Thank you for your tips! I followed your steps and it uploaded great but now I have ran into a new issue! I think the issue previously was that I was compressing the folder and not each individual file. That was causing directory problems. However, now my model works on aws, but it is not utilitzing the GPUs. The CPU takes a long time to generate summaries so a GPU-based inference is necessary. (Its the difference between 1 second and 16 seconds per inference) My goal is to create a custom inference.py file to truncate the inputs at the endpoint so I don't result in a CUDA assert trigger given too large of an input. Here is my inference.py file:

What edits should I make to this code to detect and utilize the GPU on my aws instance?

I also thought that I should just not make a custom load function given that it seems the default load function takes care of hooking up the GPU alrady:

but the problem I'm running into is figuring out how to access the tokenizer in either a custom predict_fn() or preprocess_fn()

Thanks so much for your help! without you I would have been stuck for another week!

philschmid commented 3 years ago

Hey @ZayTheory,

I tried to extract your questions and answer them independently:

Putting a Model in GPU:

When you use from_pretrained to need to put the model as you normally would do on the GPU, you can look at this forum post https://discuss.huggingface.co/t/is-transformers-using-gpu-by-default/8500
the inference toolkit has a utils method in transofrmers_utils.py called _is_gpu_available, which you could use to detect if the code is running on a GPU. https://github.com/aws/sagemaker-huggingface-inference-toolkit/blob/ca724995f11b58713efdc76b25b873ab1bff0ea8/src/sagemaker_huggingface_inference_toolkit/transformers_utils.py#L93

Truncate input:

Do don't need to create a custom infernece.py to truncate you could add the parameters key into your request json, like that. All of the parameters are passed into the pipeline running in the inference toolkit.

predictor.predict({
    "inputs": long_input,
    "parameters": {
        "truncation":True
    }
})

if you still want to write your custom infernece.py I would recommend only to override the predict function and truncate the input there. The load() returns a transformers.pipeline and you can access the tokenizers of it with model.tokenizers.

P.S. You can also use the parameters to control, e.g. max_length.

miniweeds commented 2 years ago

My case is a bit different. I've a boto3 client to invoke the endpoint.

client = boto3.client('sagemaker-runtime')

I tried to add truncation parameter in the request. It didn't work.

request- s = {"inputs": long_sentences, "parameters": {"truncation":True}} b = json.dumps(s).encode('utf-8') payload = b

content_type = "application/json"

response = client.invoke_endpoint( EndpointName=endpoint_name, ContentType=content_type, Accept=accept, Body=payload,
)

response- ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from model with message "{ "code": 400, "type": "InternalServerException", "message": "The size of tensor a (577) must match the size of tensor b (512) at non-singleton dimension 1" }

philschmid commented 2 years ago

Hey @miniweeds

could please share more information like which model you have deployed, how you have created your endpoint etc.

This should work you can also check out this forum thread where a different user had the same issue: https://discuss.huggingface.co/t/how-are-the-inputs-tokenized-when-model-deployment/9692/5 Feel free to open your own thread in the forum with more information: https://discuss.huggingface.co/c/sagemaker/17