Open xidianwym opened 1 year ago
Hi @xidianwym ,
Can you share the command-lines you used, please? We'd like to have a clear way to reproduce the issue.
Thanks, Julien
I ran the following commands to build the TensorRT-LLM engine of Baichuan 2 13B base in huggingface in TensorRT-LLM/examples/baichuan/ using one A100
python build.py \ --dtype bfloat16 \ --model_dir /xxxxx \ --use_inflight_batching \ --paged_kv_cache \ --max_batch_size 4 \ --max_input_len 2048 \ --max_output_len 2048 \ --use_gpt_attention_plugin bfloat16 \ --enable_context_fmha \ --use_gemm_plugin bfloat16 \ --output_dir /xxxx
I use the following script utils.py to construct the inference service:
` import argparse
import logging
import csv
import json
import os
from pathlib import Path
from typing import List, Union
import numpy as np
import torch
from transformers import AutoTokenizer
import tensorrt_llm
from tensorrt_llm.runtime import ModelConfig, SamplingConfig, GenerationSession
from tensorrt_llm.runtime.generation import Mapping
from build import get_engine_name
from tensorrt_llm.quantization import QuantMode
now_dir = os.path.dirname(os.path.abspath(__file__))
# copy from tensorrt_llm/runtime/generation.py to debug
class BaichuanForCausalLMGenerationSession(GenerationSession):
def __init__(
self,
model_config: ModelConfig,
engine_buffer,
mapping: Mapping,
debug_mode=False,
debug_tensors_to_save=None,
cuda_graph_mode=False,
stream: torch.cuda.Stream = None,
):
super().__init__(
model_config,
engine_buffer,
mapping,
debug_mode,
debug_tensors_to_save=debug_tensors_to_save,
cuda_graph_mode=cuda_graph_mode,
stream=stream,
)
def _prepare_for_chat(self, instruction, tokenizer):
input_tokens = []
input_lengths = []
input_tokens.append(tokenizer.encode(instruction, add_special_tokens=False))
input_tokens = torch.tensor(input_tokens, dtype=torch.int32, device="cuda")
input_lengths.append(input_tokens[0].shape[-1])
input_lengths = torch.IntTensor(input_lengths).type(torch.int32).cuda()
return input_tokens, input_lengths
def chat_stream(
self,
tokenizer,
eos_token_id,
sampling_config: SamplingConfig,
input_text: Union[str, List[str]],
max_input_length: Union[int, None] = None,
max_new_tokens: Union[int, None] = None,
runtime_rank: int = 0,
):
input_ids, input_lengths = self._prepare_for_chat(
instruction=input_text,
tokenizer=tokenizer,
)
max_input_length = torch.max(input_lengths).item()
# setup batch_size, max_input_length, max_output_len
self.setup(
batch_size=input_lengths.size(0),
max_context_length=max_input_length,
max_new_tokens=max_new_tokens,
)
with torch.no_grad():
chunk_lengths = input_lengths.clone()
for output_ids in self.decode(
input_ids,
input_lengths,
sampling_config,
streaming=True,
):
torch.cuda.synchronize()
if runtime_rank == 0:
output_texts = []
for i in range(output_ids.size(0)):
temp_ids = output_ids[i, 0, chunk_lengths[i] :]
pure_ids = []
for j in range(len(temp_ids)):
if temp_ids[j] == eos_token_id:
pure_ids = temp_ids[:j]
break
if len(pure_ids) == 0:
pure_ids = temp_ids
if pure_ids.size(0) == 0:
continue
temp_text = tokenizer.decode(pure_ids, skip_special_tokens=True)
# check code is error
if b"\xef\xbf\xbd" in temp_text.encode():
continue
chunk_lengths[i] += pure_ids.size(0)
output_texts.append(temp_text)
if len(output_texts) > 0:
yield output_texts
torch.cuda.empty_cache()
def get_model(tokenizer_dir, engine_dir, log_level="error"):
# --load the tokenizer and engine #
tensorrt_llm.logger.set_level(log_level)
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_dir, use_fast=False, trust_remote_code=True
)
config_path = os.path.join(engine_dir, "config.json")
with open(config_path, "r") as f:
config = json.load(f)
gen_config_path = os.path.join(tokenizer_dir, "generation_config.json")
with open(gen_config_path, "r") as f:
gen_config = json.load(f)
top_k = gen_config["top_k"]
top_p = gen_config["top_p"]
eos_token_id = gen_config["eos_token_id"]
pad_token_id = gen_config["pad_token_id"]
use_gpt_attention_plugin = config["plugin_config"]["gpt_attention_plugin"]
remove_input_padding = config["plugin_config"]["remove_input_padding"]
paged_kv_cache = config["plugin_config"]["paged_kv_cache"]
tokens_per_block = config["plugin_config"]["tokens_per_block"]
dtype = config["builder_config"]["precision"]
tp_size = config["builder_config"]["tensor_parallel"]
world_size = tp_size
assert (
world_size == tensorrt_llm.mpi_world_size()
), f"Engine world size ({world_size}) != Runtime world size ({tensorrt_llm.mpi_world_size()})"
num_heads = config["builder_config"]["num_heads"] // world_size
hidden_size = config["builder_config"]["hidden_size"] // world_size
vocab_size = config["builder_config"]["vocab_size"]
num_layers = config["builder_config"]["num_layers"]
runtime_rank = tensorrt_llm.mpi_rank()
runtime_mapping = tensorrt_llm.Mapping(world_size, runtime_rank, tp_size=world_size)
torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
repetition_penalty = 1.1
temperature = 0.3
top_k = 5
top_p = 0.85
model_config = ModelConfig(
num_heads=num_heads,
num_kv_heads=num_heads,
hidden_size=hidden_size,
vocab_size=vocab_size,
num_layers=num_layers,
gpt_attention_plugin=use_gpt_attention_plugin,
paged_kv_cache=paged_kv_cache,
tokens_per_block=tokens_per_block,
remove_input_padding=remove_input_padding,
dtype=dtype,
)
sampling_config = SamplingConfig(
end_id=eos_token_id,
pad_id=pad_token_id,
num_beams=1,
repetition_penalty=repetition_penalty,
temperature=temperature,
top_k=top_k,
top_p=top_p,
)
engine_name = get_engine_name("baichuan", dtype, world_size, runtime_rank)
serialize_path = os.path.join(engine_dir, engine_name)
print(f"Loading engine from {serialize_path}")
return (
model_config,
sampling_config,
runtime_mapping,
runtime_rank,
serialize_path,
remove_input_padding,
tokenizer,
eos_token_id,
pad_token_id,
) `
I use the following script api.py to construct the fastapi server: ` import asyncio import datetime import json import time from typing import List, Literal, Optional, Union import uvicorn from fastapi import FastAPI, HTTPException, Request from fastapi.middleware.cors import CORSMiddleware from fastapi.responses import JSONResponse, Response, StreamingResponse from pydantic import BaseModel, Field from utils import get_model, BaichuanForCausalLMGenerationSession
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
tokenizer_dir = "/xxx"
engine_dir = "/xxxx"
log_level = "error"
(
model_config,
sampling_config,
runtime_mapping,
runtime_rank,
serialize_path,
remove_input_padding,
tokenizer,
eos_token_id,
pad_token_id,
) = get_model(tokenizer_dir, engine_dir, log_level)
with open(serialize_path, "rb") as f:
engine_buffer = f.read()
decoder = BaichuanForCausalLMGenerationSession(
model_config,
engine_buffer,
runtime_mapping,
)
@app.get("/")
async def root():
return "Hello! This is QWen-Chat-7B API."
@app.post("/stream_chat/")
async def stream_chat(request: Request):
data = await request.json()
query = data["instruction"]
max_output_length = data["max_new_tokens"]
STREAM_DELAY = 1 # second
RETRY_TIMEOUT = 15000 # milisecond
async def event_generator(query, max_output_length, sampling_config):
for new_text in decoder.chat_stream(
tokenizer=tokenizer,
eos_token_id=eos_token_id,
sampling_config=sampling_config,
input_text=query,
max_new_tokens=max_output_length,
):
# If client closes connection, stop sending events
if await request.is_disconnected():
break
# Checks for new messages and return them to client if any
try:
ret = {
"text": new_text[0],
}
yield (json.dumps(ret, ensure_ascii=False)).encode("utf-8")
except StopIteration:
await asyncio.sleep(STREAM_DELAY)
return StreamingResponse(event_generator(query, max_output_length, sampling_config))
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=48000, workers=1)`
Please note that the above two scripts are constructed in the directory TensorRT-LLM/examples/baichuan/
The test code is as follows:
`
import requests
from http import HTTPStatus
import json
from concurrent.futures import ProcessPoolExecutor
url = "http://xxxxxxxx/stream_chat/"
data = {
"instruction" :"please introduce China",
"max_new_tokens": 200,
}
headers = {"Content-Type": "application/json"}
response = requests.post(url, json=data, headers=headers)
if response.status_code == HTTPStatus.OK:
output = ""
for line in response.iter_lines():
if line:
json_data = line.decode("utf-8")
try:
json_obj = json.loads(json_data)
output += json_obj["text"]
except json.JSONDecodeError:
continue
print("[Output]\n", output)
else:
print("Error:", response.status_code, response.reason)
`
Execute the test script and send requests serially. It can be found that after the previous request ends, the GPU memory is not released, and the next request will continue to increase the GPU memory until OOM
I noticed that you are using the --paged_kv_cache
feature when building the TensorRT-LLM engine. I also observed GPU OOM errors in my tests and I found that removing this feature resolved the GPU OOM errors. I opened (and subsequently closed) an issue about this here: https://github.com/NVIDIA/TensorRT-LLM/issues/237
Can you try removing --paged_kv_cache
and re-running your script?
Thanks for your comment. This question is resolved by removing this option. But I need this option ----use_inflight_batching in compilation stage. When you use this option, the option --paged_kv_cache is used by default.
Same issue when run test, any way to solve this with --paged_kv_cache kept?
@xidianwym we recently did memory leak and memory enhancement fix, could you try the latest 0.6.1 or main branch to see if you still see the issue?
And pls note the paged KV cache is supported better in C++ runtime, could you try to run the engine using C++ API or using python binding APIs for C++? See doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_runtime.md
Same issue when run test, any way to solve this with --paged_kv_cache kept? I test with v0.7.1 using baichuan2-7b, without paged_kv_cache, also get OOM after about 50 requests.
I use the code in TensorRT-LLM/examples/baichuan/build.py to compile the Baichuan model with the option of --use_inflight_batching, then I deploy the compiled model using TensorRT-LLM inference service. I perform a stress test to get the QPS value, while I observe that GPU memory usage increases without bound when making repeated calls to GenerationSession.setup(...) and GenerationSession.decode(...) functions. How to solve this question?