NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 937 forks source link

triton infer llm #862

Open lyc728 opened 8 months ago

lyc728 commented 8 months ago

{"error":"in ensemble 'ensemble', Failed to process the request(s) for model instance 'preprocessing_0_0', message: ValueError: invalid literal for int() with base 10: '<|im_end|>'\n\nAt:\n /usr/local/lib/python3.10/dist-packages/numpy/lib/arraypad.py(151): _set_pad_area\n /usr/local/lib/python3.10/dist-packages/numpy/lib/arraypad.py(808): pad\n /data/tensorrtllm_backend12/triton_model_repo/preprocessing/1/model.py(272): \n /data/tensorrtllm_backend12/triton_model_repo/preprocessing/1/model.py(271): _create_request\n /data/tensorrtllm_backend12/triton_model_repo/preprocessing/1/model.py(210): execute\n"

byshiue commented 8 months ago

Please share reproduced steps to help analyze the reason and debug.

lyc728 commented 8 months ago

I changed the code (preprocessing/1/model.py)

self.tokenizer.pad_token = self.tokenizer.eos_token

    # self.tokenizer_end_id = self.tokenizer.encode(
    #     self.tokenizer.eos_token, add_special_tokens=False)[0]
    # self.tokenizer_pad_id = self.tokenizer.encode(
    #     self.tokenizer.pad_token, add_special_tokens=False)[0]

    # Parse model output configs and convert Triton types to numpy types
    output_names = [
        "INPUT_ID", "REQUEST_INPUT_LEN", "BAD_WORDS_IDS", "STOP_WORDS_IDS",
        "OUT_END_ID", "OUT_PAD_ID"
    ]
    import os
    gen_config_path = os.path.join(tokenizer_dir, 'generation_config.json')
    with open(gen_config_path, 'r') as f:
        gen_config = json.load(f)
    chat_format = gen_config['chat_format']
    if chat_format == "raw":
        self.eos_id = gen_config['eos_token_id']
        self.pad_id = gen_config['pad_token_id']
    elif chat_format == "chatml":
        self.pad_id = self.eos_id = self.tokenizer.im_end_id
    else:
        raise Exception("unkown chat format ", chat_format)
    eos_token = self.tokenizer.decode(self.eos_id)

    self.tokenizer.eos_token = eos_token
    self.tokenizer.pad_token = eos_token
    self.tokenizer_end_id = eos_token
    self.tokenizer_pad_id = eos_token

the servrer can success but curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好,请问你叫什么?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 50, "bad_words": "", "stop_words": "", "end_id": [151643], "pad_id": [151643]} is error

byshiue commented 8 months ago

You set self.tokenizer_pad_id (should be a int) by a token (a text).

lyc728 commented 8 months ago

thans you reply ,I set the self.tokenizer_pad_id , but I meet the new question 1[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: input_ids: expected 2 dims, provided 1 dims (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:138) 1 0x7f20234697fd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x177fd) [0x7f20234697fd] 2 0x7f20235797d8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1277d8) [0x7f20235797d8] 3 0x7f202353ac86 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xe8c86) [0x7f202353ac86] 4 0x7f202353bce4 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xe9ce4) [0x7f202353bce4] 5 0x7f202353cf69 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xeaf69) [0x7f202353cf69] 6 0x7f20234e5dc0 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x93dc0) [0x7f20234e5dc0] 7 0x7f20234bba28 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x69a28) [0x7f20234bba28] 8 0x7f20234bffb5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6dfb5) [0x7f20234bffb5] 9 0x7f21a0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f21a0e4f253] 10 0x7f21a0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f21a0bdfac3] 11 0x7f21a0c70814 clone + 68

byshiue commented 8 months ago

Please share how do you build docker image and what TensorRT-LLM version do you use. The issue is often caused by mismatch of TensorRT-LLM for engine building and triton backend.