NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.78k stars 1k forks source link

System hang when setting penalty #865

Open Linzecong opened 10 months ago

Linzecong commented 10 months ago

Here is my build command.

python build.py --model_dir Yi-34B-Chat --dtype float16 --remove_input_padding  --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --tp_size 2 --enable_context_fmha  --use_inflight_batching  --paged_kv_cache  --load_by_shard  --use_weight_only  --weight_only_precision int4 --output_dir /app/triton_model/tensorrt_llm/1

Model is Yi-34B, int4 weight only.

The system will get stuck after running for a period of time. This will only happen when penalty is set. Only in high concurrency situations.

Here is my test code.

from functools import partial
import os
import sys
import json
import numpy as np
from transformers import PreTrainedTokenizer
from typing import List, Tuple
import tritonclient.grpc as grpcclient
import tritonclient.http as httpclient
from transformers import AutoTokenizer, LlamaTokenizer, T5Tokenizer
from tritonclient.utils import InferenceServerException, np_to_triton_dtype
import random, traceback

triton_client = grpcclient.InferenceServerClient(url="localhost:14568")

def prepare_tensor(name, input):
    t = grpcclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

def get_output_e2e(raw_text, **kwargs):
    global triton_client
    model_name = "ensemble"

    input0 = [[raw_text]]
    input0_data = np.array(input0).astype(object)
    output0_len = np.ones_like(input0).astype(np.int32) * kwargs["max_tokens"]
    bad_words_list = np.array([[""]], dtype=object)
    stop_words_list = np.array([kwargs["stop"]], dtype=object)

    top_k_data = np.array([[kwargs["top_k"]]], dtype=np.int32)
    top_p_data = np.array([[kwargs["top_p"]]], dtype=np.float32)
    temperature_data = np.array([[kwargs["temperature"]]], dtype=np.float32)
    repetition_penalty_data = np.array([[kwargs["repetition_penalty"]]], dtype=np.float32)
    presence_penalty_data = np.array([[kwargs["presence_penalty"]]], dtype=np.float32)
    random_seed_data = np.array([[kwargs["random_seed"]]], dtype=np.uint64)

    streaming = [[False]]
    streaming_data = np.array(streaming, dtype=bool)

    inputs = [
        prepare_tensor("text_input", input0_data),
        prepare_tensor("max_tokens", output0_len),
        prepare_tensor("bad_words", bad_words_list),
        prepare_tensor("stop_words", stop_words_list),
        prepare_tensor("stream", streaming_data),

        prepare_tensor("top_k", top_k_data),
        prepare_tensor("top_p", top_p_data),
        prepare_tensor("temperature", temperature_data),
        prepare_tensor("random_seed", random_seed_data),
    ]

    if kwargs["presence_penalty"] != 0:
        inputs.append(prepare_tensor("presence_penalty", presence_penalty_data))
    if kwargs["repetition_penalty"] != 1:
        inputs.append(prepare_tensor("repetition_penalty", repetition_penalty_data))

    retry = 0
    while retry < 3:
        try:
            result = triton_client.infer(model_name, inputs)
            return result.as_numpy('text_output')[0].decode()
        except:
            retry += 1
            print("==============retry: ", retry, "==============")
            traceback.print_exc()
            if not triton_client.is_server_ready():
                triton_client.close()
                triton_client = grpcclient.InferenceServerClient(url="localhost:14568")

if __name__ == "__main__":
    import threading
    def _run(i):
        print(i, get_output_e2e(
            raw_text="The quick brown fox jumps over the lazy dog",
            max_tokens=100,
            top_k=50,
            top_p=0.95,
            temperature=0.8,
            repetition_penalty=1.5, # hang when not equal to 1
            presence_penalty=0,# hang when not equal to 0 
            stop=["\n"],
            random_seed=random.randint(0, 1000000000),
        ))

    threads = []
    for i in range(100):
        threads.append(threading.Thread(target=_run, args=(i,)))
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print("all done")

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:56:00.0 Off | Off | | 30% 41C P2 69W / 450W | 22526MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:57:00.0 Off | Off | | 30% 39C P2 85W / 450W | 22524MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

all request will stuck.

I also test awq method and A30 GPU, will hang too. Sometimes only one GPU is occupied 100%, and the other one is idle. both 0.6.1 and 0.7.1 have problems。

The above is all my information. I am not sure if it is caused by parallelism. I hope there are some debugging methods.

BasicCoder commented 10 months ago

possible solutions #149

nv-guomingz commented 1 week ago

hi do u still have further issue or question now? If not, we'll close it soon.