NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.21k stars 910 forks source link

ModelRunnerCpp.generate throw tensorrt_llm::common::TllmException for the second time #912

Closed xesdiny closed 8 months ago

xesdiny commented 8 months ago

System Info

-CPU architecture x86_64 -GPU name NVIDA V100 -GPU memory size 32G*8 -TensorRT-LLM branch v0.7.1 -TensorRT-LLM commit 80bc075

Who can help?

@ncomly-nvidia

Information

My goal is to use pybind to rely on triton python backend to start multi-card inference deployment. When TP=1, The request link is complete and successful. But when I set multi-GPU deployment with TP=4, the first request returned successfully, but the second time it threw a GEMM error in the executeContextStep stage.

Tasks

Reproduction

There are tensorrt_llm/1/model.py & config.pbtxt context

import json
import os
import threading
import torch
import triton_python_backend_utils as pb_utils
from torch import from_numpy
import torch.nn.functional as F
import tensorrt_llm
from tensorrt_llm.runtime import GenerationSession, ModelConfig, SamplingConfig
from tensorrt_llm.quantization import QuantMode
from tensorrt_llm.runtime import PYTHON_BINDINGS, ModelRunner
if PYTHON_BINDINGS:
    from tensorrt_llm.runtime import ModelRunnerCpp
else:
    print("Python bindings of C++ session is unavailable, fallback to Python session.")
    exit()
import time

def mpi_comm():
    from mpi4py import MPI
    return MPI.COMM_WORLD

def mpi_rank():
    return mpi_comm().Get_rank()

def get_engine_name(model, dtype, tp_size, rank):
    return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank)

def get_input_tensor_by_name(request, name):
    tensor = pb_utils.get_input_tensor_by_name(request, name)
    if tensor is not None:
        # Triton tensor -> numpy tensor -> PyTorch tensor
        return from_numpy(tensor.as_numpy())
    else:
        return tensor

def get_input_scalar_by_name(request, name):
    tensor = pb_utils.get_input_tensor_by_name(request, name)
    if tensor is not None:
        # Triton tensor -> numpy tensor -> first scalar
        tensor = tensor.as_numpy()
        return tensor.reshape((tensor.size, ))[0]
    else:
        return tensor

class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to initialize any state associated with this model.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """
        model_config = json.loads(args['model_config'])
        engine_dir = model_config['parameters']['engine_dir']['string_value']
        config_path = os.path.join(engine_dir, 'config.json')
        with open(config_path, 'r') as f:
            config = json.load(f)

        self.remove_input_padding = config['plugin_config'][
            'remove_input_padding']

        tensor_parallel = config['builder_config']['tensor_parallel']
        pipeline_parallel = 1
        if 'pipeline_parallel' in config['builder_config']:
            pipeline_parallel = config['builder_config']['pipeline_parallel']
        world_size = tensor_parallel * pipeline_parallel
        assert world_size == tensorrt_llm.mpi_world_size(), \
            f'Engine world size ({world_size}) != Runtime world size ({tensorrt_llm.mpi_world_size()})'
        if 'max_attention_window_size' in config['builder_config']:
            self.max_attention_window_size = int(config['builder_config']['max_attention_window_size'])
        else:
            self.max_attention_window_size = 4096

        self.comm = mpi_comm()
        self.rank = mpi_rank()

        runner_cls = ModelRunnerCpp
        runner_kwargs = dict(engine_dir=engine_dir,
                             lora_dir=None,
                             rank=self.rank,
                             debug_mode=False,
                             lora_ckpt_source=None)
        runner_kwargs.update(
            max_batch_size=int(config['builder_config']['max_batch_size']),
            max_input_len=int(config['builder_config']['max_input_len']),
            max_output_len=int(config['builder_config']['max_output_len']),
            max_beam_width=int(config['builder_config']['max_beam_width']),
            max_attention_window_size=self.max_attention_window_size)
        self.runner = runner_cls.from_dir(**runner_kwargs)

        self.cuda_device = torch.device(self.runner.session.device)
        self.inflight_thread_count = 0
        self.inflight_thread_count_lck = threading.Lock()
        if self.rank != 0:
            while (True):
                self.execute([None])

    def execute(self, requests):
        """`execute` must be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference is requested
        for this model.

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """
        responses = []

        # Every Python backend must iterate through list of requests and create
        # an instance of pb_utils.InferenceResponse class for each of them. You
        # should avoid storing any of the input Tensors in the class attributes
        # as they will be overridden in subsequent inference requests. You can
        # make a copy of the underlying NumPy array and store it if it is
        # required.
        # for request in requests:
        #     # Perform inference on the request and append it to responses list...
        for request in requests:
            print(f"{self.rank=} get request!~~~~!")
            self.process_request(request)
        # You must return a list of pb_utils.InferenceResponse. Length
        # of this list must match the length of `requests` list.
        return None

    def process_request(self, request):
        # parse request
        inputs = {}
        if self.rank == 0:
            inputs['input_ids'] = get_input_tensor_by_name(
                request, 'input_ids')
            inputs['input_lengths'] = get_input_tensor_by_name(
                request, 'input_lengths')
            inputs['request_output_len'] = get_input_scalar_by_name(
                request, 'request_output_len')
            inputs['end_id'] = get_input_scalar_by_name(request, 'end_id')
            inputs['pad_id'] = get_input_scalar_by_name(request, 'pad_id')
            inputs['beam_width'] = get_input_scalar_by_name(
                request, 'beam_width')
            inputs['temperature'] = get_input_scalar_by_name(
                request, 'temperature')
            inputs['runtime_top_k'] = get_input_scalar_by_name(
                request, 'runtime_top_k')
            inputs['runtime_top_p'] = get_input_scalar_by_name(
                request, 'runtime_top_p')
            inputs['len_penalty'] = get_input_scalar_by_name(
                request, 'len_penalty')
            inputs['repetition_penalty'] = get_input_scalar_by_name(
                request, 'repetition_penalty')
            inputs['min_length'] = get_input_scalar_by_name(
                request, 'min_length')
            inputs['presence_penalty'] = get_input_scalar_by_name(
                request, 'presence_penalty')
            inputs['random_seed'] = get_input_scalar_by_name(
                request, 'random_seed')
            inputs['output_log_probs'] = get_input_scalar_by_name(
                request, 'output_log_probs')
            inputs['stop_words_list'] = get_input_tensor_by_name(
                request, 'stop_words_list')
            inputs['bad_words_list'] = get_input_tensor_by_name(
                request, 'bad_words_list')
        # Broadcast requests to other clients
        inputs = self.comm.bcast(inputs, root=0)
        print(f"{self.rank=} {inputs=}")
        # Start a separate thread to send the responses for the request. The
        # sending back the responses is delegated to this thread.
        thread = threading.Thread(
            target=self.response_thread,
            args=(
                request.get_response_sender() if self.rank == 0 else None,
                inputs,
            ),
        )
        # A model using decoupled transaction policy is not required to send all
        # responses for the current request before returning from the execute.
        # To demonstrate the flexibility of the decoupled API, we are running
        # response thread entirely independent of the execute thread.
        thread.daemon = True

        with self.inflight_thread_count_lck:
            self.inflight_thread_count += 1

        thread.start()

    def response_thread(self, response_sender, inputs):
        # The response_sender is used to send response(s) associated with the
        # corresponding request.

        input_ids = inputs['input_ids'].to(self.cuda_device)
        input_lengths = inputs['input_lengths']
        end_id = inputs['end_id']
        pad_id = inputs['pad_id']

        sampling_config = SamplingConfig(end_id=end_id, pad_id=pad_id, return_dict=True)
        if inputs['request_output_len'] is not None:
            sampling_config.max_new_tokens = inputs['request_output_len']

        if inputs['beam_width'] is not None:
            sampling_config.num_beams = inputs['beam_width']
        if inputs['temperature'] is not None:
            sampling_config.temperature = inputs['temperature']

        if inputs['runtime_top_k'] is not None:
            sampling_config.top_k = inputs['runtime_top_k']

        if inputs['runtime_top_p'] is not None:
            sampling_config.top_p = inputs['runtime_top_p']

        if inputs['len_penalty'] is not None:
            sampling_config.length_penalty = inputs['len_penalty']

        if inputs['repetition_penalty'] is not None:
            sampling_config.repetition_penalty = inputs[
                'repetition_penalty']
        if inputs['min_length'] is not None:
            sampling_config.min_length = inputs['min_length']
        if inputs['presence_penalty'] is not None:
            sampling_config.presence_penalty = inputs['presence_penalty']
        if inputs['stop_words_list'] is not None:
            sampling_config.stop_words_list = inputs['stop_words_list'].to(self.cuda_device)
        if inputs['bad_words_list'] is not None:
            sampling_config.bad_words_list = inputs['bad_words_list']
        sampling_config.random_seed = inputs['random_seed']
        sampling_config.output_log_probs = inputs['output_log_probs']
        sampling_config.output_log_probs = True

        sampling_config.output_sequence_lengths = True
        sampling_config.max_attention_window_size = self.max_attention_window_size

        print(f'{self.rank=} {input_ids=} {sampling_config.stop_words_list=}')
        print(f'{self.rank=} {sampling_config=}')

        with torch.no_grad():
            output_dict = self.runner.generate(
                [input_ids],
                sampling_config=sampling_config,
                max_new_tokens=sampling_config.max_new_tokens,
                max_attention_window_size=self.max_attention_window_size,
                end_id=end_id,
                pad_id=pad_id,
                temperature=sampling_config.temperature,
                top_k=sampling_config.top_k,
                top_p=sampling_config.top_p,
                num_beams=sampling_config.num_beams,
                length_penalty=sampling_config.length_penalty,
                repetition_penalty=sampling_config.repetition_penalty,
                stop_words_list=sampling_config.stop_words_list,
                bad_words_list=sampling_config.bad_words_list,
                lora_uids=None,
                prompt_table_path=None,
                prompt_tasks=None,
                streaming=False,
                output_sequence_lengths=sampling_config.output_sequence_lengths,
                return_dict=sampling_config.return_dict)
            torch.cuda.synchronize()
        print(f'{self.rank=} {output_dict["output_ids"]=}')

        if self.rank == 0:
            # Create output tensors. You need pb_utils.Tensor
            # objects to create pb_utils.InferenceResponse.
            output_tensors = [
                pb_utils.Tensor("output_ids",
                                output_dict['output_ids'].cpu().numpy())
            ]

            if sampling_config.output_log_probs:

                context_logits = torch.stack(output_dict['context_logits']).cpu()
                logits = F.log_softmax(context_logits, dim=-1)
                detach_input_ids = input_ids.detach().unsqueeze(-1).to(torch.long).cpu()  # [b,c] -> [b,c,1]
                logits_score = torch.gather(logits[:, :- 1, :], -1, detach_input_ids[:, 1:input_lengths]).squeeze(-1)
                output_tensors.append(
                    pb_utils.Tensor("context_log_probs", logits_score.numpy()))
            if sampling_config.output_sequence_lengths:
                print(f"{sampling_config.output_sequence_lengths=} {output_dict['sequence_lengths']=}")
                output_tensors.append(
                    pb_utils.Tensor("sequence_lengths", output_dict['sequence_lengths'].cpu().numpy()))
            # Create InferenceResponse. You can set an error here in case
            # there was a problem with handling this inference request.
            # Below is an example of how you can set errors in inference
            # response:

            inference_response = pb_utils.InferenceResponse(output_tensors)

            # We must close the response sender to indicate to Triton that we are
            # done sending responses for the corresponding request. We can't use the
            # response sender after closing it. The response sender is closed by
            # setting the TRITONSERVER_RESPONSE_COMPLETE_FINAL.
            if response_sender!= None:
                response_sender.send(inference_response,flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)

        with self.inflight_thread_count_lck:
            self.inflight_thread_count -= 1

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        Here we will wait for all response threads to complete sending
        responses.
        """

        print("Finalize invoked")

        inflight_threads = True
        cycles = 0
        logging_time_sec = 5
        sleep_time_sec = 0.1
        cycle_to_log = logging_time_sec / sleep_time_sec
        while inflight_threads:
            with self.inflight_thread_count_lck:
                inflight_threads = self.inflight_thread_count != 0
                if cycles % cycle_to_log == 0:
                    print(
                        f"Waiting for {self.inflight_thread_count} response threads to complete..."
                    )
            if inflight_threads:
                time.sleep(sleep_time_sec)
                cycles += 1

        print("Finalize complete...")

config.pbtxt

name: "tensorrt_llm"
backend: "python"
max_batch_size: 8

model_transaction_policy {
  decoupled: True
}
# Uncomment this for dynamic_batching
dynamic_batching {
   max_queue_delay_microseconds: 50000
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "output_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_lengths"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters {
  key: "engine_dir"
  value: {
    string_value: "${engine_dir}"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "enable_trt_overlap"
  value: {
    string_value: "False"
  }
}

Expected behavior

self.rank=0 inputs={'input_ids': tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       dtype=torch.int32), 'input_lengths': tensor([19], dtype=torch.int32), 'request_output_len': 256, 'end_id': 2, 'pad_id': 2, 'beam_width': 1, 'temperature': 0.95, 'runtime_top_k': 50, 'runtime_top_p': 0.7, 'len_penalty': 0.0, 'repetition_penalty': 1.0, 'min_length': None, 'presence_penalty': None, 'random_seed': 12, 'output_log_probs': False, 'stop_words_list': tensor([[[ 835,  355,  835, 1792,  835, 7451],
         [   2,    4,    6,   -1,   -1,   -1]]], dtype=torch.int32), 'bad_words_list': None}
self.rank=1 inputs={'input_ids': tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       dtype=torch.int32), 'input_lengths': tensor([19], dtype=torch.int32), 'request_output_len': 256, 'end_id': 2, 'pad_id': 2, 'beam_width': 1, 'temperature': 0.95, 'runtime_top_k': 50, 'runtime_top_p': 0.7, 'len_penalty': 0.0, 'repetition_penalty': 1.0, 'min_length': None, 'presence_penalty': None, 'random_seed': 12, 'output_log_probs': False, 'stop_words_list': tensor([[[ 835,  355,  835, 1792,  835, 7451],
         [   2,    4,    6,   -1,   -1,   -1]]], dtype=torch.int32), 'bad_words_list': None}
self.rank=2 inputs={'input_ids': tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       dtype=torch.int32), 'input_lengths': tensor([19], dtype=torch.int32), 'request_output_len': 256, 'end_id': 2, 'pad_id': 2, 'beam_width': 1, 'temperature': 0.95, 'runtime_top_k': 50, 'runtime_top_p': 0.7, 'len_penalty': 0.0, 'repetition_penalty': 1.0, 'min_length': None, 'presence_penalty': None, 'random_seed': 12, 'output_log_probs': False, 'stop_words_list': tensor([[[ 835,  355,  835, 1792,  835, 7451],
         [   2,    4,    6,   -1,   -1,   -1]]], dtype=torch.int32), 'bad_words_list': None}
self.rank=3 inputs={'input_ids': tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       dtype=torch.int32), 'input_lengths': tensor([19], dtype=torch.int32), 'request_output_len': 256, 'end_id': 2, 'pad_id': 2, 'beam_width': 1, 'temperature': 0.95, 'runtime_top_k': 50, 'runtime_top_p': 0.7, 'len_penalty': 0.0, 'repetition_penalty': 1.0, 'min_length': None, 'presence_penalty': None, 'random_seed': 12, 'output_log_probs': False, 'stop_words_list': tensor([[[ 835,  355,  835, 1792,  835, 7451],
         [   2,    4,    6,   -1,   -1,   -1]]], dtype=torch.int32), 'bad_words_list': None}
self.rank=1 get request!~~~~!
self.rank=2 get request!~~~~!
self.rank=3 get request!~~~~!
self.rank=0 input_ids=tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       device='cuda:0', dtype=torch.int32) sampling_config.stop_words_list=None
self.rank=0 sampling_config=SamplingConfig(end_id=2, pad_id=2, max_new_tokens=1, num_beams=1, max_attention_window_size=4096, output_sequence_lengths=True, return_dict=True, stop_words_list=None, bad_words_list=None, temperature=1.0, top_k=1, top_p=0.0, top_p_decay=None, top_p_min=None, top_p_reset_ids=None, length_penalty=1.0, repetition_penalty=1.0, min_length=1, presence_penalty=0.0, use_beam_hyps=True, beam_search_diversity_rate=0.0, random_seed=None, output_cum_log_probs=False, output_log_probs=True)
batch_input_ids=tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       device='cuda:0', dtype=torch.int32) input_lengths=tensor([19], device='cuda:0', dtype=torch.int32)
sampling_config.stop_words_list=None
self.rank=1 input_ids=tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       device='cuda:1', dtype=torch.int32) sampling_config.stop_words_list=None
self.rank=3 input_ids=tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       device='cuda:3', dtype=torch.int32) sampling_config.stop_words_list=None
self.rank=1 sampling_config=SamplingConfig(end_id=2, pad_id=2, max_new_tokens=1, num_beams=1, max_attention_window_size=4096, output_sequence_lengths=True, return_dict=True, stop_words_list=None, bad_words_list=None, temperature=1.0, top_k=1, top_p=0.0, top_p_decay=None, top_p_min=None, top_p_reset_ids=None, length_penalty=1.0, repetition_penalty=1.0, min_length=1, presence_penalty=0.0, use_beam_hyps=True, beam_search_diversity_rate=0.0, random_seed=None, output_cum_log_probs=False, output_log_probs=True)
self.rank=3 sampling_config=SamplingConfig(end_id=2, pad_id=2, max_new_tokens=1, num_beams=1, max_attention_window_size=4096, output_sequence_lengths=True, return_dict=True, stop_words_list=None, bad_words_list=None, temperature=1.0, top_k=1, top_p=0.0, top_p_decay=None, top_p_min=None, top_p_reset_ids=None, length_penalty=1.0, repetition_penalty=1.0, min_length=1, presence_penalty=0.0, use_beam_hyps=True, beam_search_diversity_rate=0.0, random_seed=None, output_cum_log_probs=False, output_log_probs=True)
self.rank=2 input_ids=tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       device='cuda:2', dtype=torch.int32) sampling_config.stop_words_list=None
self.rank=2 sampling_config=SamplingConfig(end_id=2, pad_id=2, max_new_tokens=1, num_beams=1, max_attention_window_size=4096, output_sequence_lengths=True, return_dict=True, stop_words_list=None, bad_words_list=None, temperature=1.0, top_k=1, top_p=0.0, top_p_decay=None, top_p_min=None, top_p_reset_ids=None, length_penalty=1.0, repetition_penalty=1.0, min_length=1, presence_penalty=0.0, use_beam_hyps=True, beam_search_diversity_rate=0.0, random_seed=None, output_cum_log_probs=False, output_log_probs=True)
batch_input_ids=tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       device='cuda:3', dtype=torch.int32) input_lengths=tensor([19], device='cuda:3', dtype=torch.int32)
sampling_config.stop_words_list=None
batch_input_ids=tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       device='cuda:1', dtype=torch.int32) input_lengths=tensor([19], device='cuda:1', dtype=torch.int32)
sampling_config.stop_words_list=None
batch_input_ids=tensor([[    1,   835,  1792, 29901, 32284, 32001, 32577, 30214, 38922, 33750,
         30210, 32636, 33977, 30267,    13,  2277, 29937,  7451, 29901]],
       device='cuda:2', dtype=torch.int32) input_lengths=tensor([19], device='cuda:2', dtype=torch.int32)
sampling_config.stop_words_list=None
sampling_config.output_sequence_lengths=True output_dict['sequence_lengths']=tensor([[19]], device='cuda:0', dtype=torch.int32)

actual behavior

When making the second request call:

...
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOperationDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mCublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_INTERNAL_ERROR (/app/tensorrt_llm/cpp/tensorrt_llm/common/cublasMMWrapper.cpp:140)
1       0x7f3cc297818e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xae18e) [0x7f3cc297818e]
2       0x7f3cc29d4fa5 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10afa5) [0x7f3cc29d4fa5]
3       0x7f3cc29d535b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10b35b) [0x7f3cc29d535b]
4       0x7f3cc299e411 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xd4411) [0x7f3cc299e411]
5       0x7f3cc299ecc7 tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 263
6       0x7f3d110b6ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f3d110b6ba9]
7       0x7f3d1108c6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f3d1108c6af]
8       0x7f3d1108e320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f3d1108e320]
9       0x7f3d09760747 tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 951
10      0x7f3d09761744 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 2772
11      0x7f3d097629c9 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3097
12      0x7f3d096fc6f9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xc86f9) [0x7f3d096fc6f9]
13      0x7f3d096e4e5e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb0e5e) [0x7f3d096e4e5e]
14      0x7f3e72151023 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023) [0x7f3e72151023]
15      0x7f3e72108adc _PyObject_MakeTpCall + 140
16      0x7f3e7210b41a /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe241a) [0x7f3e7210b41a]
17      0x7f3e720a49c8 _PyEval_EvalFrameDefault + 40296
18      0x7f3e721eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f3e721eb3af]
19      0x7f3e7210b3d8 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8) [0x7f3e7210b3d8]
20      0x7f3e7210aed8 PyVectorcall_Call + 168
21      0x7f3e7209f776 _PyEval_EvalFrameDefault + 19222
22      0x7f3e721eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f3e721eb3af]
23      0x7f3e7210b358 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe2358) [0x7f3e7210b358]
24      0x7f3e7209f776 _PyEval_EvalFrameDefault + 19222
25      0x7f3e721eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f3e721eb3af]
26      0x7f3e720a2efe _PyEval_EvalFrameDefault + 33438
27      0x7f3e721eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f3e721eb3af]
28      0x7f3e720a2efe _PyEval_EvalFrameDefault + 33438
29      0x7f3e721eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f3e721eb3af]
30      0x7f3e7210b46c /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe246c) [0x7f3e7210b46c]
31      0x7f3e722ae7bd /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOperationDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mCublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_INTERNAL_ERROR (/app/tensorrt_llm/cpp/tensorrt_llm/common/cublasMMWrapper.cpp:140)
1       0x7f525277818e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xae18e) [0x7f525277818e]
2       0x7f52527d4fa5 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10afa5) [0x7f52527d4fa5]
3       0x7f52527d535b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10b35b) [0x7f52527d535b]
4       0x7f525279e411 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xd4411) [0x7f525279e411]
5       0x7f525279ecc7 tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 263
6       0x7f52a0eb6ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f52a0eb6ba9]
7       0x7f52a0e8c6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f52a0e8c6af]
8       0x7f52a0e8e320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f52a0e8e320]
9       0x7f5299560747 tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 951
10      0x7f5299561744 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 2772
11      0x7f52995629c9 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3097
12      0x7f52994fc6f9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xc86f9) [0x7f52994fc6f9]
13      0x7f52994e4e5e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb0e5e) [0x7f52994e4e5e]
14      0x7f5401d51023 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023) [0x7f5401d51023]
15      0x7f5401d08adc _PyObject_MakeTpCall + 140
16      0x7f5401d0b41a /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe241a) [0x7f5401d0b41a]
17      0x7f5401ca49c8 _PyEval_EvalFrameDefault + 40296
18      0x7f5401deb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f5401deb3af]
19      0x7f5401d0b3d8 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8) [0x7f5401d0b3d8]
20      0x7f5401d0aed8 PyVectorcall_Call + 168
21      0x7f5401c9f776 _PyEval_EvalFrameDefault + 19222
22      0x7f5401deb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f5401deb3af]
23      0x7f5401d0b358 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe2358) [0x7f5401d0b358]
24      0x7f5401c9f776 _PyEval_EvalFrameDefault + 19222
25      0x7f5401deb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f5401deb3af]
26      0x7f5401ca2efe _PyEval_EvalFrameDefault + 33438
27      0x7f5401deb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f5401deb3af]
28      0x7f5401ca2efe _PyEval_EvalFrameDefault + 33438
29      0x7f5401deb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f5401deb3af]
30      0x7f5401d0b46c /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe246c) [0x7f5401d0b46c]
31      0x7f5401eae7bd /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x2857bd) [0x7f5401eae7bd]
32      0x7f5401e4058b /usr/lib/x86_64-linux-gnu/libpython3x2857bd) [0x7f3e722ae7bd]
32      0x7f3e7224058b /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x21758b) [0x7f3e7224058b]
33      0x7f3e71c69ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3e71c69ac3]
34      0x7f3e71cfabf4 clone + 68
.10.so.1.0(+0x21758b) [0x7f5401e4058b]
33      0x7f5401869ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f5401869ac3]
34      0x7f54018fabf4 clone + 68
[dev-aigc-20:31438] *** Process received signal ***
[dev-aigc-20:31438] Signal: Aborted (6)
[dev-aigc-20:31438] Signal code:  (-6)
[dev-aigc-20:31471] *** Process received signal ***
[dev-aigc-20:31471] Signal: Aborted (6)
[dev-aigc-20:31471] Signal code:  (-6)
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOperationDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mCublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_INTERNAL_ERROR (/app/tensorrt_llm/cpp/tensorrt_llm/common/cublasMMWrapper.cpp:140)
1       0x7f707937818e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xae18e) [0x7f707937818e]
2       0x7f70793d4fa5 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10afa5) [0x7f70793d4fa5]
3       0x7f70793d535b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10b35b) [0x7f70793d535b]
4       0x7f707939e411 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xd4411) [0x7f707939e411]
5       0x7f707939ecc7 tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 263
6       0x7f70c7ab6ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f70c7ab6ba9]
7       0x7f70c7a8c6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f70c7a8c6af]
8       0x7f70c7a8e320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f70c7a8e320]
9       0x7f70c0160747 tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 951
10      0x7f70c0161744 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 2772
11      0x7f70c01629c9 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3097
12      0x7f70c00fc6f9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xc86f9) [0x7f70c00fc6f9]
13      0x7f70c00e4e5e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb0e5e) [0x7f70c00e4e5e]
14      0x7f7228951023 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023) [0x7f7228951023]
15      0x7f7228908adc _PyObject_MakeTpCall + 140
16      0x7f722890b41a /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe241a) [0x7f722890b41a]
17      0x7f72288a49c8 _PyEval_EvalFrameDefault + 40296
18      0x7f72289eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f72289eb3af]
19      0x7f722890b3d8 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8) [0x7f722890b3d8]
20      0x7f722890aed8 PyVectorcall_Call + 168
21      0x7f722889f776 _PyEval_EvalFrameDefault + 19222
22      0x7f72289eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f72289eb3af]
23      0x7f722890b358 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe2358) [0x7f722890b358]
24      0x7f722889f776 _PyEval_EvalFrameDefault + 19222
25      0x7f72289eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f72289eb3af]
26      0x7f72288a2efe _PyEval_EvalFrameDefault + 33438
27      0x7f72289eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f72289eb3af]
28      0x7f72288a2efe _PyEval_EvalFrameDefault + 33438
29      0x7f72289eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f72289eb3af]
30      0x7f722890b46c /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe246c) [0x7f722890b46c]
31      0x7f7228aae7bd /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x2857bd) [0x7f7228aae7bd]
32      0x7f7228a4058b /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x21758b) [0x7f7228a4058b]
33      0x7f7228469ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f7228469ac3]
34      0x7f72284fabf4 clone + 68
[dev-aigc-20:31439] *** Process received signal ***
[dev-aigc-20:31439] Signal: Aborted (6)
[dev-aigc-20:31439] Signal code:  (-6)
[dev-aigc-20:31438] [ 0] [dev-aigc-20:31471] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f5401817520]
[dev-aigc-20:31438] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3e71c17520]
[dev-aigc-20:31471] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f540186b9fc]
[dev-aigc-20:31438] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f3e71c6b9fc]
[dev-aigc-20:31471] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f5401817476]
[dev-aigc-20:31438] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f3e71c17476]
[dev-aigc-20:31471] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f54017fd7f3]
[dev-aigc-20:31438] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f3e71bfd7f3]
[dev-aigc-20:31471] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f3e71e9fb9e]
[dev-aigc-20:31471] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f5401a9fb9e]
[dev-aigc-20:31438] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f3e71eab20c]
[dev-aigc-20:31471] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f5401aab20c]
[dev-aigc-20:31438] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f3e71eaa1e9]
[dev-aigc-20:31471] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f5401aaa1e9]
[dev-aigc-20:31438] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f3e71eaa959]
[dev-aigc-20:31471] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f5401aaa959]
[dev-aigc-20:31438] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f54025e9884]
/usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f3e728d0884]
[dev-aigc-20:31471] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f3e728d12dd]
[dev-aigc-20:31438] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f54025ea2dd]
[dev-aigc-20:31438] [10] [dev-aigc-20:31471] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x4b109)[0x7f3cc2915109]
[dev-aigc-20:31471] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x4b109)[0x7f5252715109]
[dev-aigc-20:31438] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10b35b)[0x7f3cc29d535b]
[dev-aigc-20:31471] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10b35b)[0x7f52527d535b]
[dev-aigc-20:31438] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xd4411)[0x7f3cc299e411]
[dev-aigc-20:31471] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xd4411)[0x7f525279e411]
[dev-aigc-20:31438] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins10GemmPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x107)[0x7f525279ecc7]
[dev-aigc-20:31438] [14] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9)[0x7f52a0eb6ba9]
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins10GemmPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x107)[0x7f3cc299ecc7]
[dev-aigc-20:31471] [14] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9)[0x7f3d110b6ba9]
[dev-aigc-20:31471] [15] [dev-aigc-20:31438] [15] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af)[0x7f52a0e8c6af]
[dev-aigc-20:31438] [16] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320)[0x7f52a0e8e320]
[dev-aigc-20:31438] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af)[0x7f3d1108c6af]
[dev-aigc-20:31471] [16] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320)[0x7f3d1108e320]
[dev-aigc-20:31471] [17] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession18executeContextStepERKSt6vectorINS0_15GenerationInputESaIS3_EERKS2_IiSaIiEEPKNS_13batch_manager16kv_cache_manager14KVCacheManagerE+0x3b7)[0x7f5299560747]
[dev-aigc-20:31438] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession18executeContextStepERKSt6vectorINS0_15GenerationInputESaIS3_EERKS2_IiSaIiEEPKNS_13batch_manager16kv_cache_manager14KVCacheManagerE+0x3b7)[0x7f3d09760747]
[dev-aigc-20:31471] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession15generateBatchedERSt6vectorINS0_16GenerationOutputESaIS3_EERKS2_INS0_15GenerationInputESaIS7_EERKNS0_14SamplingConfigERKSt8functionIFvibEE+0xad4)[0x7f5299561744]
[dev-aigc-20:31438] [19] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession15generateBatchedERSt6vectorINS0_16GenerationOutputESaIS3_EERKS2_INS0_15GenerationInputESaIS7_EERKNS0_14SamplingConfigERKSt8functionIFvibEE+0xad4)[0x7f3d09761744]
[dev-aigc-20:31471] [19] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession8generateERNS0_16GenerationOutputERKNS0_15GenerationInputERKNS0_14SamplingConfigE+0xc19)[0x7f3d097629c9]
[dev-aigc-20:31471] [20] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession8generateERNS0_16GenerationOutputERKNS0_15GenerationInputERKNS0_14SamplingConfigE+0xc19)[0x7f52995629c9]
[dev-aigc-20:31438] [20] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xc86f9)[0x7f3d096fc6f9]
[dev-aigc-20:31471] [21] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xc86f9)[0x7f52994fc6f9]
[dev-aigc-20:31438] [21] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb0e5e)[0x7f52994e4e5e]
[dev-aigc-20:31438] [22] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb0e5e)[0x7f3d096e4e5e]
[dev-aigc-20:31471] [22] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023)[0x7f5401d51023]
[dev-aigc-20:31438] [23] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023)[0x7f3e72151023]
[dev-aigc-20:31471] [23] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyObject_MakeTpCall+0x8c)[0x7f5401d08adc]
[dev-aigc-20:31438] [24] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyObject_MakeTpCall+0x8c)[0x7f3e72108adc]
[dev-aigc-20:31471] [24] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe241a)[0x7f3e7210b41a]
[dev-aigc-20:31471] [25] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe241a)[0x7f5401d0b41a]
[dev-aigc-20:31438] [25] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x9d68)[0x7f3e720a49c8]
[dev-aigc-20:31471] [26] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x9d68)[0x7f5401ca49c8]
[dev-aigc-20:31438] [26] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af)[0x7f5401deb3af]
[dev-aigc-20:31438] [27] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af)[0x7f3e721eb3af]
[dev-aigc-20:31471] [27] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8)[0x7f3e7210b3d8]
[dev-aigc-20:31471] [28] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8)[0x7f5401d0b3d8]
[dev-aigc-20:31438] [28] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(PyVectorcall_Call+0xa8)[0x7f5401d0aed8]
[dev-aigc-20:31438] [29] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(PyVectorcall_Call+0xa8)[0x7f3e7210aed8]
[dev-aigc-20:31471] [29] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x4b16)[0x7f5401c9f776]
[dev-aigc-20:31438] *** End of error message ***
/usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x4b16)[0x7f3e7209f776]
[dev-aigc-20:31471] *** End of error message ***
[dev-aigc-20:31439] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7228417520]
[dev-aigc-20:31439] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f722846b9fc]
[dev-aigc-20:31439] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f7228417476]
[dev-aigc-20:31439] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f72283fd7f3]
[dev-aigc-20:31439] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f722869fb9e]
[dev-aigc-20:31439] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f72286ab20c]
[dev-aigc-20:31439] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f72286aa1e9]
[dev-aigc-20:31439] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f72286aa959]
[dev-aigc-20:31439] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f7229166884]
[dev-aigc-20:31439] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f72291672dd]
[dev-aigc-20:31439] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x4b109)[0x7f7079315109]
[dev-aigc-20:31439] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x10b35b)[0x7f70793d535b]
[dev-aigc-20:31439] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xd4411)[0x7f707939e411]
[dev-aigc-20:31439] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins10GemmPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x107)[0x7f707939ecc7]
[dev-aigc-20:31439] [14] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9)[0x7f70c7ab6ba9]
[dev-aigc-20:31439] [15] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af)[0x7f70c7a8c6af]
[dev-aigc-20:31439] [16] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320)[0x7f70c7a8e320]
[dev-aigc-20:31439] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession18executeContextStepERKSt6vectorINS0_15GenerationInputESaIS3_EERKS2_IiSaIiEEPKNS_13batch_manager16kv_cache_manager14KVCacheManagerE+0x3b7)[0x7f70c0160747]
[dev-aigc-20:31439] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession15generateBatchedERSt6vectorINS0_16GenerationOutputESaIS3_EERKS2_INS0_15GenerationInputESaIS7_EERKNS0_14SamplingConfigERKSt8functionIFvibEE+0xad4)[0x7f70c0161744]
[dev-aigc-20:31439] [19] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession8generateERNS0_16GenerationOutputERKNS0_15GenerationInputERKNS0_14SamplingConfigE+0xc19)[0x7f70c01629c9]
[dev-aigc-20:31439] [20] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xc86f9)[0x7f70c00fc6f9]
[dev-aigc-20:31439] [21] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb0e5e)[0x7f70c00e4e5e]
[dev-aigc-20:31439] [22] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023)[0x7f7228951023]
[dev-aigc-20:31439] [23] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyObject_MakeTpCall+0x8c)[0x7f7228908adc]
[dev-aigc-20:31439] [24] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe241a)[0x7f722890b41a]
[dev-aigc-20:31439] [25] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x9d68)[0x7f72288a49c8]
[dev-aigc-20:31439] [26] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af)[0x7f72289eb3af]
[dev-aigc-20:31439] [27] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8)[0x7f722890b3d8]
[dev-aigc-20:31439] [28] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(PyVectorcall_Call+0xa8)[0x7f722890aed8]
[dev-aigc-20:31439] [29] /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x4b16)[0x7f722889f776]
[dev-aigc-20:31439] *** End of error message ***
E0118 09:57:27.290098 30847 backend_model.cc:634] ERROR: Failed to create instance: Stub process 'tensorrt_llm_0_0' is not healthy.
E0118 09:57:27.290204 30847 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: Stub process 'tensorrt_llm_0_0' is not healthy.
I0118 09:57:27.290219 30847 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
I0118 09:57:27.290312 30847 server.cc:592]
...

additional notes

The client I use refers to this code tensorrtllm_backend 3a61c37

tools/gpt/client.py

#!/usr/bin/python

import os
import sys
import time
import traceback
from functools import partial

import queue

from tritonclient.utils import InferenceServerException

sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))
import argparse
from datetime import datetime

import numpy as np
from transformers import AutoTokenizer, LlamaTokenizer, T5Tokenizer
from tools.utils import utils

def run_inference(tokenizer):
    tokenizer.pad_token = tokenizer.eos_token
    pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
    end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]

    line = tokenizer.encode(FLAGS.text)
    input_start_ids = np.array([line], np.int32)
    input_len = np.array([[len(line)]], np.int32)
    inputs = utils.prepare_inputs(input_start_ids, input_len, pad_id, end_id,
                                  FLAGS)

    start_time = datetime.now()

    class UserData:
        def __init__(self):
            self._completed_requests = queue.Queue()

    user_data = UserData()

    def callback(user_data, result, error):
        if error:
            user_data._completed_requests.put(error)
        else:
            user_data._completed_requests.put(result)

    with utils.create_inference_server_client(FLAGS.protocol,
                                              FLAGS.url,
                                              concurrency=FLAGS.concurrency,
                                              verbose=FLAGS.verbose) as client:
        client.start_stream(callback=partial(callback, user_data), stream_timeout=None)
        request_id = np.random.default_rng().integers(
            0, np.iinfo(np.uint64).max, dtype=np.uint64)
        # results = utils.send_requests('tensorrt_llm',
        #                               inputs,
        #                               client,
        #                               request_parallelism=1)
        client.async_stream_infer(
            model_name='tensorrt_llm',
            inputs=inputs,
            request_id=str(request_id),
        )
        client.stop_stream()
        # Parse the responses
        while True:
            try:
                result = user_data._completed_requests.get(block=False)
            except Exception:
                break

            if type(result) == InferenceServerException:
                if result.status() == "StatusCode.CANCELLED":
                    print("Request is cancelled")
                else:
                    print("Received an error from server:")
                    print(result)
                    raise result
            else:
                output_ids = result.as_numpy("output_ids")
                context_log_probs = result.as_numpy('context_log_probs')
                sequence_lengths = result.as_numpy('sequence_lengths')

                stop_time = datetime.now()
                latency = (stop_time - start_time).total_seconds() * 1000.0
                latency = round(latency, 3)
                print(f"[INFO] Latency: {latency} ms")
                print(f"sequence lengths: {sequence_lengths=}")
                # print(f"context logits : {context_log_probs=}")
                output_ids = output_ids.reshape(
                    (output_ids.size,)).tolist()[input_start_ids.shape[1]:]
                output_text = tokenizer.decode(output_ids,skip_special_tokens=True)
                print(f'Input: {FLAGS.text}')
                print(f'Output: {output_text}')
    print()

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-v',
                        '--verbose',
                        action="store_true",
                        required=False,
                        default=False,
                        help='Enable verbose output')
    parser.add_argument('-u',
                        '--url',
                        type=str,
                        required=False,
                        help='Inference server URL.')
    parser.add_argument(
        '-i',
        '--protocol',
        type=str,
        required=False,
        default='http',
        help='Protocol ("http"/"grpc") used to ' +
        'communicate with inference service. Default is "http".')
    parser.add_argument(
        '-t',
        '--text',
        type=str,
        required=False,
        default='Born in north-east France, Soyer trained as a',
        help='Input text')
    parser.add_argument('-c',
                        '--concurrency',
                        type=int,
                        default=1,
                        required=False,
                        help='Specify concurrency')
    parser.add_argument('-beam',
                        '--beam_width',
                        type=int,
                        default=1,
                        required=False,
                        help='Specify beam width')
    parser.add_argument('-topk',
                        '--topk',
                        type=int,
                        default=1,
                        required=False,
                        help='topk for sampling')
    parser.add_argument('-topp',
                        '--topp',
                        type=float,
                        default=0.0,
                        required=False,
                        help='topp for sampling')
    parser.add_argument('-o',
                        '--output_len',
                        type=int,
                        default=10,
                        required=False,
                        help='Specify output length')
    parser.add_argument('--tokenizer_dir',
                        type=str,
                        required=True,
                        help='Specify tokenizer directory')
    parser.add_argument('--tokenizer_type',
                        type=str,
                        default='auto',
                        required=False,
                        choices=['auto', 't5', 'llama'],
                        help='Specify tokenizer type')

    FLAGS = parser.parse_args()
    if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
        print(
            "unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
                FLAGS.protocol))
        exit(1)

    if FLAGS.url is None:
        FLAGS.url = "localhost:8000" if FLAGS.protocol == "http" else "localhost:8001"

    if FLAGS.tokenizer_type == 't5':
        tokenizer = T5Tokenizer(vocab_file=FLAGS.tokenizer_dir,
                                padding_side='left', trust_remote_code=True)
    elif FLAGS.tokenizer_type == 'auto':
        tokenizer = AutoTokenizer.from_pretrained(FLAGS.tokenizer_dir,
                                                  padding_side='left', trust_remote_code=True)
    elif FLAGS.tokenizer_type == 'llama':
        tokenizer = LlamaTokenizer.from_pretrained(FLAGS.tokenizer_dir,
                                                   legacy=False,
                                                   padding_side='left', trust_remote_code=True)
    else:
        raise AttributeError(
            f'Unexpected tokenizer type: {FLAGS.tokenizer_type}')

    run_inference(tokenizer)
xesdiny commented 8 months ago

well, I found that I could use AsyncLLMEngine for that.

MartinMarciniszyn commented 8 months ago

Yes, AsyncLLMEngine is the best option. There will also be a standard solution to integrate TRT-LLM with the Python backend of Triton soon. Leaving for @schetlur-nv to comment.