NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.61k stars 978 forks source link

Running out of GPU memory when I perform a stress test of througput to get the QPS value #252

Open xidianwym opened 1 year ago

xidianwym commented 1 year ago

I use the code in TensorRT-LLM/examples/baichuan/build.py to compile the Baichuan model with the option of --use_inflight_batching, then I deploy the compiled model using TensorRT-LLM inference service. I perform a stress test to get the QPS value, while I observe that GPU memory usage increases without bound when making repeated calls to GenerationSession.setup(...) and GenerationSession.decode(...) functions. How to solve this question?

jdemouth-nvidia commented 1 year ago

Hi @xidianwym ,

Can you share the command-lines you used, please? We'd like to have a clear way to reproduce the issue.

Thanks, Julien

xidianwym commented 1 year ago

I ran the following commands to build the TensorRT-LLM engine of Baichuan 2 13B base in huggingface in TensorRT-LLM/examples/baichuan/ using one A100 python build.py \ --dtype bfloat16 \ --model_dir /xxxxx \ --use_inflight_batching \ --paged_kv_cache \ --max_batch_size 4 \ --max_input_len 2048 \ --max_output_len 2048 \ --use_gpt_attention_plugin bfloat16 \ --enable_context_fmha \ --use_gemm_plugin bfloat16 \ --output_dir /xxxx I use the following script utils.py to construct the inference service: ` import argparse import logging import csv import json import os from pathlib import Path from typing import List, Union import numpy as np import torch from transformers import AutoTokenizer import tensorrt_llm from tensorrt_llm.runtime import ModelConfig, SamplingConfig, GenerationSession from tensorrt_llm.runtime.generation import Mapping from build import get_engine_name
from tensorrt_llm.quantization import QuantMode

     now_dir = os.path.dirname(os.path.abspath(__file__))

    # copy from tensorrt_llm/runtime/generation.py to debug
    class BaichuanForCausalLMGenerationSession(GenerationSession):
        def __init__(
            self,
            model_config: ModelConfig,
            engine_buffer,
            mapping: Mapping,
            debug_mode=False,
            debug_tensors_to_save=None,
            cuda_graph_mode=False,
            stream: torch.cuda.Stream = None,
        ):
            super().__init__(
                model_config,
                engine_buffer,
                mapping,
                debug_mode,
                debug_tensors_to_save=debug_tensors_to_save,
                cuda_graph_mode=cuda_graph_mode,
                stream=stream,
            )

        def _prepare_for_chat(self, instruction, tokenizer):
            input_tokens = []
            input_lengths = []
            input_tokens.append(tokenizer.encode(instruction, add_special_tokens=False))
            input_tokens = torch.tensor(input_tokens, dtype=torch.int32, device="cuda")
            input_lengths.append(input_tokens[0].shape[-1])
            input_lengths = torch.IntTensor(input_lengths).type(torch.int32).cuda()

            return input_tokens, input_lengths

        def chat_stream(
            self,
            tokenizer,
            eos_token_id,
            sampling_config: SamplingConfig,
            input_text: Union[str, List[str]],
            max_input_length: Union[int, None] = None,
            max_new_tokens: Union[int, None] = None,
            runtime_rank: int = 0,
        ):
            input_ids, input_lengths = self._prepare_for_chat(
                instruction=input_text,
                tokenizer=tokenizer,
            )
            max_input_length = torch.max(input_lengths).item()
            # setup batch_size, max_input_length, max_output_len
            self.setup(
                batch_size=input_lengths.size(0),
                max_context_length=max_input_length,
                max_new_tokens=max_new_tokens,
            )
            with torch.no_grad():
                chunk_lengths = input_lengths.clone()
                for output_ids in self.decode(
                    input_ids,
                    input_lengths,
                    sampling_config,
                    streaming=True,
                ):
                    torch.cuda.synchronize()
                    if runtime_rank == 0:
                        output_texts = []
                        for i in range(output_ids.size(0)):
                            temp_ids = output_ids[i, 0, chunk_lengths[i] :]
                            pure_ids = []
                            for j in range(len(temp_ids)):
                                if temp_ids[j] == eos_token_id:
                                    pure_ids = temp_ids[:j]
                                    break
                            if len(pure_ids) == 0:
                                pure_ids = temp_ids
                            if pure_ids.size(0) == 0:
                                continue
                            temp_text = tokenizer.decode(pure_ids, skip_special_tokens=True)
                            # check code is error
                            if b"\xef\xbf\xbd" in temp_text.encode():
                                continue
                            chunk_lengths[i] += pure_ids.size(0)
                            output_texts.append(temp_text)
                        if len(output_texts) > 0:
                            yield output_texts
            torch.cuda.empty_cache()

    def get_model(tokenizer_dir, engine_dir, log_level="error"):
        # --load the tokenizer and engine #
        tensorrt_llm.logger.set_level(log_level)
        tokenizer = AutoTokenizer.from_pretrained(
            tokenizer_dir, use_fast=False, trust_remote_code=True
        )
        config_path = os.path.join(engine_dir, "config.json")
        with open(config_path, "r") as f:
            config = json.load(f)
        gen_config_path = os.path.join(tokenizer_dir, "generation_config.json")
        with open(gen_config_path, "r") as f:
            gen_config = json.load(f)
        top_k = gen_config["top_k"]
        top_p = gen_config["top_p"]
        eos_token_id = gen_config["eos_token_id"]
        pad_token_id = gen_config["pad_token_id"]

        use_gpt_attention_plugin = config["plugin_config"]["gpt_attention_plugin"]
        remove_input_padding = config["plugin_config"]["remove_input_padding"]
        paged_kv_cache = config["plugin_config"]["paged_kv_cache"]
        tokens_per_block = config["plugin_config"]["tokens_per_block"]
        dtype = config["builder_config"]["precision"]
        tp_size = config["builder_config"]["tensor_parallel"]
        world_size = tp_size
        assert (
            world_size == tensorrt_llm.mpi_world_size()
        ), f"Engine world size ({world_size}) != Runtime world size ({tensorrt_llm.mpi_world_size()})"
        num_heads = config["builder_config"]["num_heads"] // world_size
        hidden_size = config["builder_config"]["hidden_size"] // world_size
        vocab_size = config["builder_config"]["vocab_size"]
        num_layers = config["builder_config"]["num_layers"]

        runtime_rank = tensorrt_llm.mpi_rank()
        runtime_mapping = tensorrt_llm.Mapping(world_size, runtime_rank, tp_size=world_size)
        torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
        repetition_penalty = 1.1
        temperature = 0.3
        top_k = 5
        top_p = 0.85
        model_config = ModelConfig(
            num_heads=num_heads,
            num_kv_heads=num_heads,
            hidden_size=hidden_size,
            vocab_size=vocab_size,
            num_layers=num_layers,
            gpt_attention_plugin=use_gpt_attention_plugin,
            paged_kv_cache=paged_kv_cache,
            tokens_per_block=tokens_per_block,
            remove_input_padding=remove_input_padding,
            dtype=dtype,
        )
        sampling_config = SamplingConfig(
            end_id=eos_token_id,
            pad_id=pad_token_id,
            num_beams=1,
            repetition_penalty=repetition_penalty,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
        )

        engine_name = get_engine_name("baichuan", dtype, world_size, runtime_rank)
        serialize_path = os.path.join(engine_dir, engine_name)
        print(f"Loading engine from {serialize_path}")
        return (
            model_config,
            sampling_config,
            runtime_mapping,
            runtime_rank,
            serialize_path,
            remove_input_padding,
            tokenizer,
            eos_token_id,
            pad_token_id,
        ) `

I use the following script api.py to construct the fastapi server: ` import asyncio import datetime import json import time from typing import List, Literal, Optional, Union import uvicorn from fastapi import FastAPI, HTTPException, Request from fastapi.middleware.cors import CORSMiddleware from fastapi.responses import JSONResponse, Response, StreamingResponse from pydantic import BaseModel, Field from utils import get_model, BaichuanForCausalLMGenerationSession

    app = FastAPI()
    app.add_middleware(
        CORSMiddleware,
        allow_origins=["*"],
        allow_credentials=True,
        allow_methods=["*"],
        allow_headers=["*"],
    )
    tokenizer_dir = "/xxx"
    engine_dir = "/xxxx"
    log_level = "error"
    (
        model_config,
        sampling_config,
        runtime_mapping,
        runtime_rank,
        serialize_path,
        remove_input_padding,
        tokenizer,
        eos_token_id,
        pad_token_id,
    ) = get_model(tokenizer_dir, engine_dir, log_level)
    with open(serialize_path, "rb") as f:
        engine_buffer = f.read()
    decoder = BaichuanForCausalLMGenerationSession(
        model_config,
        engine_buffer,
        runtime_mapping,
    )

    @app.get("/")
    async def root():
        return "Hello! This is QWen-Chat-7B API."

    @app.post("/stream_chat/")
    async def stream_chat(request: Request):
        data = await request.json()
        query = data["instruction"]
        max_output_length = data["max_new_tokens"]
        STREAM_DELAY = 1  # second
        RETRY_TIMEOUT = 15000  # milisecond

        async def event_generator(query, max_output_length, sampling_config):
            for new_text in decoder.chat_stream(
                tokenizer=tokenizer,
                eos_token_id=eos_token_id,
                sampling_config=sampling_config,
                input_text=query,
                max_new_tokens=max_output_length,
            ):
                # If client closes connection, stop sending events
                if await request.is_disconnected():
                    break
                # Checks for new messages and return them to client if any
                try:
                    ret = {
                        "text": new_text[0],
                    }
                    yield (json.dumps(ret, ensure_ascii=False)).encode("utf-8")
                except StopIteration:
                    await asyncio.sleep(STREAM_DELAY)

        return StreamingResponse(event_generator(query, max_output_length, sampling_config))

    if __name__ == "__main__":
        uvicorn.run(app, host="0.0.0.0", port=48000, workers=1)`

Please note that the above two scripts are constructed in the directory TensorRT-LLM/examples/baichuan/

The test code is as follows: `
import requests from http import HTTPStatus import json from concurrent.futures import ProcessPoolExecutor url = "http://xxxxxxxx/stream_chat/" data = { "instruction" :"please introduce China", "max_new_tokens": 200,

    }
    headers = {"Content-Type": "application/json"}
    response = requests.post(url, json=data, headers=headers)

    if response.status_code == HTTPStatus.OK:
        output = ""
        for line in response.iter_lines():
            if line:
                json_data = line.decode("utf-8")
                try:
                    json_obj = json.loads(json_data)
                    output += json_obj["text"]
                except json.JSONDecodeError:
                    continue
        print("[Output]\n", output)
    else:
        print("Error:", response.status_code, response.reason)

`

Execute the test script and send requests serially. It can be found that after the previous request ends, the GPU memory is not released, and the next request will continue to increase the GPU memory until OOM

cody-moveworks commented 1 year ago

I noticed that you are using the --paged_kv_cache feature when building the TensorRT-LLM engine. I also observed GPU OOM errors in my tests and I found that removing this feature resolved the GPU OOM errors. I opened (and subsequently closed) an issue about this here: https://github.com/NVIDIA/TensorRT-LLM/issues/237

Can you try removing --paged_kv_cache and re-running your script?

xidianwym commented 1 year ago

Thanks for your comment. This question is resolved by removing this option. But I need this option ----use_inflight_batching in compilation stage. When you use this option, the option --paged_kv_cache is used by default.

Chevolier commented 11 months ago

Same issue when run test, any way to solve this with --paged_kv_cache kept?

litaotju commented 10 months ago

@xidianwym we recently did memory leak and memory enhancement fix, could you try the latest 0.6.1 or main branch to see if you still see the issue?

And pls note the paged KV cache is supported better in C++ runtime, could you try to run the engine using C++ API or using python binding APIs for C++? See doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_runtime.md

bnuzhanyu commented 9 months ago

Same issue when run test, any way to solve this with --paged_kv_cache kept? I test with v0.7.1 using baichuan2-7b, without paged_kv_cache, also get OOM after about 50 requests.