Closed yudhiesh closed 1 year ago
DJLServing support both client side batching and server side batching (also called dynamic batching)
Here is a simple example that handles server side batch: https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/src/test/resources/resnet18/model.py#L89
For text_generation tasks, you can json input with multiple queries:
{
"inputs": [
"The new movie that got Oscar this year",
"The Large Language Model is"
],
"parameters": {
"max_new_tokens":256,
"do_sample":true
}
}
The built-in deepspeed handler do support both client and server batching
@frankfliu thanks managed to get it to work but I noticed that inference is still done sequentially when I pass in multiple inputs as per the example you used.
I would expect that the inference would be done in parallel so the latency would scale in a constant manner with the number of inputs(assuming that the inputs/outputs can fit into memory) but instead it scales linearly.
I am not sure what settings I have to change as I am already altering the default values for batch_size
and max_batch_delay
.I am running inference on a AWS EC2 g5.2xlarge
instance:
settings.properties
engine=DeepSpeed
option.tensor_parallel_degree=1
option.parallel_loading=true
option.s3url=s3://BUCKET
batch_size=8
max_batch_delay=1000
model.py
import os
from typing import Any, Dict, List, Optional, Tuple, Union
import deepspeed
import torch
from djl_python import Input, Output
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
pipeline,
Pipeline,
PreTrainedTokenizerBase,
)
predictor = None
def get_inference_toolkit(properties: Dict[str, Any]) -> Pipeline:
model_name = properties["model_id"]
local_rank = int(os.getenv("LOCAL_RANK", "0"))
model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
use_safetensors=True,
)
model = deepspeed.init_inference(
model, mp_size=properties["tensor_parallel_degree"]
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline(
task="text-generation", model=model, tokenizer=tokenizer, device=local_rank
)
return generator
def format_input_for_task(input_values: Union[List[str], str]) -> List[List[str]]:
if not isinstance(input_values, list):
input_values = [input_values]
batch_inputs = []
for values in input_values:
batch_inputs += [values]
return batch_inputs
def handle(inputs: Input) -> Output:
global predictor
if not predictor:
predictor = get_inference_toolkit(inputs.get_properties())
if inputs.is_empty():
# Model server makes an empty call to warm-up the model on startup
return None
content_type = inputs.get_property("Content-Type")
input_data: List[str] = []
input_size: List[int] = []
model_kwargs: Dict[str, Any] = {}
batches = inputs.get_batches()
device = int(os.getenv("LOCAL_RANK", "0"))
if content_type is not None and content_type.startswith("application/json"):
first = True
for item in batches:
json_input = item.get_as_json()
if isinstance(json_input, dict):
input_size.append(len(json_input.get("inputs")))
input_data.extend(format_input_for_task(json_input.pop("inputs")))
if first:
model_kwargs = json_input.pop("parameters", {})
first = False
else:
if model_kwargs != json_input.pop("parameters", {}):
return Output().error(
"In order to enable dynamic batching, all input batches must have the same parameters"
)
else:
input_size.append(len(json_input))
input_data.extend(json_input)
else:
for item in batches:
input_size.append(1)
input_data.extend(item.get_as_string())
outputs = Output()
offset = 0
results = predictor(input_data, **model_kwargs)
batch_size = inputs.get_batch_size()
for i in range(batch_size):
result = results[offset : offset + input_size[i]]
outputs.add(result, key=inputs.get_content().key_at(i))
offset += input_size[i]
outputs.add_property("content-type", "application/json")
return outputs
Request Body
{
"inputs": [
"What is the purpose of life?",
"What do you think lies ahead for humanity?",
"What do you think lies ahead for humanity?"
],
"parameters": {
"max_length": 100
}
}
Inference time for different input lengths:
Number of inputs: 1
CPU times: user 6.11 ms, sys: 435 µs, total: 6.55 ms
Wall time: 4.22 s
Number of inputs: 2
CPU times: user 6.04 ms, sys: 265 µs, total: 6.3 ms
Wall time: 7.31 s
Number of inputs: 3
CPU times: user 2.81 ms, sys: 3.74 ms, total: 6.55 ms
Wall time: 10.4 s
Managed to fix it by specifying the batch_size
in the pipeline
:
results = predictor(input_data, batch_size=len(input_data), **model_kwargs)
One thing I am wondering is how does batching work when I have multiple requests coming in? Lets say I have 3 clients that make a request with 3 inputs at the same time and the batch_size
of the serving.properties
is set to 8
.
Would 8/9
of those requests get grouped into a single prediction that is done in parallel? Then the remainder will be either batched up with other requests or done on its own if the max_batch_delay
value is reached?
Another question I have is with the batch_size
in the pipeline
, would I need to set a value dynamically that depends on how much memory is currently left(including a buffer value) to prevent OOM issues?
@yudhiesh
For dynamic batching, DJLServing frontend has a workload manager which put all HTTP requests into a job queue. And will try to group up to batch_size
requests into a batch and send to backend to execute the inference. If in giving time max_batch_delay
didn't collect enough request, DJLServing just send whatever available requests to backend.
Problem
I am deploying Llama2-7b on AWS Sagemaker Endpoints with DJL-Serving as the backend, it works as expected without much issues for a
batch_size
of1
but when I increase thebatch_size
to a value>1
I start getting a lot of errors:I think the current inference code does not support
batch_size > 1
:Is there an example to support both
batch_size = 1
andbatch_size > 1
for infrence?I have noticed that there is an example using batch predict here, but the links to model.py and Deepspeed Handler do not work.