djl-inference and /ping API endpoint

greenpau commented 1 year ago

Hi All, I am deploying a model on SageMaker and getting the following error. Any advice on how to resolve it?

UnexpectedStatusException: Error hosting endpoint gptneox-demo-v1-0-0-endpoint: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

The cloudwatch says:

java.util.concurrent.CompletionException: java.io.FileNotFoundException: partitioned_model_ file not found in: /opt/ml/model/code
Caused by: java.io.FileNotFoundException: partitioned_model_ file not found in: /opt/ml/model/code

I am using AWS-supplied framework container 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.3-cu117

When I opened a case with support, they said that the container does not respond to requests to /ping endpoint.

How do I supplement, modify the model to respond?

The model has 2 files:

%%writefile {model_base_path}/code/model.py
from djl_python import Input, Output
import os
import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

predictor = None

def get_model():
    model_name = 'EleutherAI/gpt-neox-20b'
    tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '1'))
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    model = AutoModelForCausalLM.from_pretrained(model_name, revision="float32", torch_dtype=torch.float32)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model = deepspeed.init_inference(model,
                                     mp_size=tensor_parallel,
                                     dtype=model.dtype,
                                     replace_method='auto',
                                     replace_with_kernel_inject=True)
    generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
    return generator

def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model()

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_string()
    result = predictor(data, do_sample=True, min_tokens=200, max_new_tokens=256)
    return Output().add(result)

and

%%writefile {model_base_path}/code/serving.properties
engine=DeepSpeed

siddvenk commented 1 year ago

Thanks for raising this issue @greenpau. We are aware of this bug and it should be resolved in our next container release.

In the meantime, there are a few approaches that can help work past this issue.

1) In serving.properties file, add another property option.tensor_parallel_degree=<value>. This is the easiest solution

2) You can leverage the DJLModel from the SageMaker Python SDK to deploy this model with our built in inference handlers. Some sample code for this would be

from sagemaker.djl_inference import DJLModel

role = <your_iam_role_arn>
model = DJLModel(
    "EleutherAI/gpt-neox-20b",
    role,
    number_of_partitions=8,
    dtype="fp32",
)

predictor = model.deploy("ml.g5.48xlarge")

For the gpt-neox-20b model at full precision, the minimum total gpu memory required will be 80GB. Using a g5.12xlarge with tensor parallel degree of 4 will probably not be sufficient since the additional memory required at runtime for inference will likely exceed the gpu memory available. I would recommend at least a g5.48xlarge with tensor parallel degree of 8.

greenpau commented 1 year ago

@siddvenk , thank you for the answer! What if I only have access to 24xlarge? What do I need to change in my code?

siddvenk commented 1 year ago

For a g5.24xlarge, I would recommend that you use dtype="fp16" or dtype="bf16" with a tensor parallel degree of 4. Additionally, I would recommend that you set option.low_cpu_mem_usage=true. Your configuration/code would then be like this:

Option 1: Using your own inference code and serving.properties

#serving.properties
engine=DeepSpeed
option.entryPoint=model.py
option.tensor_parallel_degree=4
option.dtype=fp16
option.low_cpu_mem_usage=True

Option 2: Using the SageMaker Python sdk

from sagemaker.djl_inference import DJLModel

model = DJLModel(
    "EleutherAI/gpt-neox-20b",
    <your_iam_role>,
    number_of_partitions=4,
    dtype="fp16",
    low_cpu_mem_usage=True
)

predictor = model.deploy("ml.g5.24xlarge")

greenpau commented 1 year ago

@siddvenk , thank you very much! 👍 I went with option 2.

deepjavalibrary / djl

djl-inference and /ping API endpoint #2649