aws / sagemaker-huggingface-inference-toolkit

Apache License 2.0
222 stars 60 forks source link

Sagemaker HuggingfaceModel fails on phi3 model deployment #123

Open manikawnth opened 1 month ago

manikawnth commented 1 month ago

I'm not able to deploy the Phi3 model from huggingface model hub to sagemaker. I tried using multiple DLC containers, with and without trust_remote_code: true . Still not able to get it run.

I receive the following error:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 222, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 420, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
    model = FlashLlamaForCausalLM(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 368, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 292, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 293, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 232, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 108, in __init__
    self.query_key_value = load_attention(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 43, in load_attention
    bias = config.attention_bias
  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 263, in __getattribute__
    return super().__getattribute__(key)

AttributeError: 'Phi3Config' object has no attribute 'attention_bias' #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m

#033[2m2024-05-21T16:19:40.764815Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 0 failed to start
#033[2m2024-05-21T16:19:40.764834Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards

Error: ShardCannotStart


from sagemaker import get_execution_role, Session
import boto3
sagemaker_session = Session()
region = boto3.Session().region_name

# get execution role

# please use execution role if you are using notebook instance or update the role arn if you are using a different role
execution_role = get_execution_role()

image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.0.3-gpu-py310-cu121-ubuntu22.04-v2.0'

from sagemaker.huggingface import HuggingFaceModel

hub = {
  'HF_TASK': 'text-generation',
  'HF_MODEL_ID':'microsoft/Phi-3-mini-128k-instruct',
  'TRUST_REMOTE_CODE': 'true',
  'HF_MODEL_TRUST_REMOTE_CODE': 'true'
}

huggingface_model = HuggingFaceModel(
    env=hub,
    image_uri=image_uri,
    role=execution_role,
    sagemaker_session=sagemaker_session
)

predictor = huggingface_model.deploy( initial_instance_count=1,instance_type="ml.g5.2xlarge")
philschmid commented 1 month ago

We opened a PR to fix this. https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/68

manikawnth commented 1 month ago

@philschmid Thanks for that PR. It's working fine, when I pointed it to that revision. However, shouldn't the issue be actually fixed upstream, by initializing config.attention_bias = False ?

https://github.com/huggingface/text-generation-inference/blob/d32e33bd489f2419e579f5d423073791ee19f789/server/text_generation_server/models/flash_llama.py#L64

OR

https://github.com/huggingface/text-generation-inference/blob/d32e33bd489f2419e579f5d423073791ee19f789/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py#L51