NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

Inference RoBERTa on Triton server using TRT_LLM #2440

Open DeekshithaDPrakash opened 1 week ago

DeekshithaDPrakash commented 1 week ago

I am trying to Deploy and inference the XLM_Roberta model on TRT-LLM.

I followed the example guide for BERT and built the engine: (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bert)

python3 run.py

However, not sure what's to be done next!

For Llama models there is a detailed guide to pass input and perform inferencing, however, BERT models there's no info at all.

So I tried to implement the instructions of llama to BERT model after building engine as follows:

But it's throwing error:

python3 /opt/tritonserver/scripts/launch_triton_server.py --world_size 1 --model_repo=/opt/tritonserver/inflight_batcher_llm                              root@a100server-System-Product-Name:/opt/tritonserver/TensorRT_LLM_XLM_RoBERTa# I1113 01:00:58.337225 3545 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fc35e000000' with size 268435456"
I1113 01:00:58.339345 3545 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1113 01:00:58.344421 3545 model_lifecycle.cc:472] "loading: postprocessing:1"
I1113 01:00:58.344454 3545 model_lifecycle.cc:472] "loading: preprocessing:1"
I1113 01:00:58.344522 3545 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I1113 01:00:58.344558 3545 model_lifecycle.cc:472] "loading: tensorrt_llm_bls:1"
I1113 01:00:58.409244 3545 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I1113 01:00:58.409258 3545 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I1113 01:00:58.459250 3545 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1113 01:00:58.459275 3545 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1113 01:00:58.459278 3545 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1113 01:00:58.459282 3545 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1113 01:00:58.466345 3545 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I1113 01:00:58.467964 3545 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] gpt_model_path is not specified, will be left empty
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
E1113 01:00:58.469718 3545 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Both encoder and decoder model paths are empty (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/model_instance_state.cc:535)\n1       0x7fc3a35e6e5b tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100\n2       0x7fc3a35e6557 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(+0x17557) [0x7fc3a35e6557]\n3       0x7fc3a35f5182 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n4       0x7fc3a3633589 TRITONBACKEND_ModelInstanceInitialize + 153\n5       0x7fc3bd50efff /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1acfff) [0x7fc3bd50efff]\n6       0x7fc3bd510247 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae247) [0x7fc3bd510247]\n7       0x7fc3bd4f2615 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190615) [0x7fc3bd4f2615]\n8       0x7fc3bd4f2c66 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190c66) [0x7fc3bd4f2c66]\n9       0x7fc3bd4ff5dd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d5dd) [0x7fc3bd4ff5dd]\n10      0x7fc3bcb62ee8 /lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fc3bcb62ee8]\n11      0x7fc3bd4e8d6b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186d6b) [0x7fc3bd4e8d6b]\n12      0x7fc3bd4fa1fa /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1981fa) [0x7fc3bd4fa1fa]\n13      0x7fc3bd4fea2c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ca2c) [0x7fc3bd4fea2c]\n14      0x7fc3bd5fa8ad /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2988ad) [0x7fc3bd5fa8ad]\n15      0x7fc3bd5fde8c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x29be8c) [0x7fc3bd5fde8c]\n16      0x7fc3bd75c4a2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3fa4a2) [0x7fc3bd75c4a2]\n17      0x7fc3bcdce253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc3bcdce253]\n18      0x7fc3bcb5dac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc3bcb5dac3]\n19      0x7fc3bcbeea04 clone + 68"
E1113 01:00:58.469764 3545 model_lifecycle.cc:642] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Both encoder and decoder model paths are empty (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/model_instance_state.cc:535)\n1       0x7fc3a35e6e5b tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100\n2       0x7fc3a35e6557 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(+0x17557) [0x7fc3a35e6557]\n3       0x7fc3a35f5182 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n4       0x7fc3a3633589 TRITONBACKEND_ModelInstanceInitialize + 153\n5       0x7fc3bd50efff /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1acfff) [0x7fc3bd50efff]\n6       0x7fc3bd510247 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae247) [0x7fc3bd510247]\n7       0x7fc3bd4f2615 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190615) [0x7fc3bd4f2615]\n8       0x7fc3bd4f2c66 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190c66) [0x7fc3bd4f2c66]\n9       0x7fc3bd4ff5dd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d5dd) [0x7fc3bd4ff5dd]\n10      0x7fc3bcb62ee8 /lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fc3bcb62ee8]\n11      0x7fc3bd4e8d6b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186d6b) [0x7fc3bd4e8d6b]\n12      0x7fc3bd4fa1fa /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1981fa) [0x7fc3bd4fa1fa]\n13      0x7fc3bd4fea2c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ca2c) [0x7fc3bd4fea2c]\n14      0x7fc3bd5fa8ad /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2988ad) [0x7fc3bd5fa8ad]\n15      0x7fc3bd5fde8c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x29be8c) [0x7fc3bd5fde8c]\n16      0x7fc3bd75c4a2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3fa4a2) [0x7fc3bd75c4a2]\n17      0x7fc3bcdce253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc3bcdce253]\n18      0x7fc3bcb5dac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc3bcb5dac3]\n19      0x7fc3bcbeea04 clone + 68"
I1113 01:00:58.469797 3545 model_lifecycle.cc:777] "failed to load 'tensorrt_llm'"
I1113 01:00:58.716439 3545 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I1113 01:01:00.321537 3545 model_lifecycle.cc:839] "successfully loaded 'preprocessing'"
I1113 01:01:00.327824 3545 model_lifecycle.cc:839] "successfully loaded 'postprocessing'"
E1113 01:01:00.327910 3545 model_repository_manager.cc:703] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Both encoder and decoder model paths are empty (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/model_instance_state.cc:535)\n1       0x7fc3a35e6e5b tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100\n2       0x7fc3a35e6557 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(+0x17557) [0x7fc3a35e6557]\n3       0x7fc3a35f5182 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n4       0x7fc3a3633589 TRITONBACKEND_ModelInstanceInitialize + 153\n5       0x7fc3bd50efff /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1acfff) [0x7fc3bd50efff]\n6       0x7fc3bd510247 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae247) [0x7fc3bd510247]\n7       0x7fc3bd4f2615 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190615) [0x7fc3bd4f2615]\n8       0x7fc3bd4f2c66 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190c66) [0x7fc3bd4f2c66]\n9       0x7fc3bd4ff5dd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d5dd) [0x7fc3bd4ff5dd]\n10      0x7fc3bcb62ee8 /lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fc3bcb62ee8]\n11      0x7fc3bd4e8d6b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186d6b) [0x7fc3bd4e8d6b]\n12      0x7fc3bd4fa1fa /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1981fa) [0x7fc3bd4fa1fa]\n13      0x7fc3bd4fea2c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ca2c) [0x7fc3bd4fea2c]\n14      0x7fc3bd5fa8ad /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2988ad) [0x7fc3bd5fa8ad]\n15      0x7fc3bd5fde8c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x29be8c) [0x7fc3bd5fde8c]\n16      0x7fc3bd75c4a2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3fa4a2) [0x7fc3bd75c4a2]\n17      0x7fc3bcdce253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc3bcdce253]\n18      0x7fc3bcb5dac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc3bcb5dac3]\n19      0x7fc3bcbeea04 clone + 68;"
I1113 01:01:00.327968 3545 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1113 01:01:00.327989 3545 server.cc:631]
+-------------+-------------------------------------------------+-------------------------------------------------+
| Backend     | Path                                            | Config                                          |
+-------------+-------------------------------------------------+-------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_pyt | {"cmdline":{"auto-complete-config":"false","bac |
|             | hon.so                                          | kend-directory":"/opt/tritonserver/backends","m |
|             |                                                 | in-compute-capability":"6.000000","shm-region-p |
|             |                                                 | refix-name":"prefix0_","default-max-batch-size" |
|             |                                                 | :"4"}}                                          |
|             |                                                 |                                                 |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtrito | {"cmdline":{"auto-complete-config":"false","bac |
|             | n_tensorrtllm.so                                | kend-directory":"/opt/tritonserver/backends","m |
|             |                                                 | in-compute-capability":"6.000000","default-max- |
|             |                                                 | batch-size":"4"}}                               |
|             |                                                 |                                                 |
+-------------+-------------------------------------------------+-------------------------------------------------+

I1113 01:01:00.328028 3545 server.cc:674]
+------------------+---------+------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                             |
+------------------+---------+------------------------------------------------------------------------------------+
| postprocessing   | 1       | READY                                                                              |
| preprocessing    | 1       | READY                                                                              |
| tensorrt_llm     | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorR |
|                  |         | T-LLM][ERROR] Assertion failed: Both encoder and decoder model paths are empty (/t |
|                  |         | mp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/model_instance_state.cc:535)   |
|                  |         | 3       0x7fc3a35f5182 triton::backend::inflight_batcher_llm::ModelInstanceState:: |
|                  |         | Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInst |
|                  |         | ance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66           |
|                  |         | 6       0x7fc3bd510247 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae247)  |
|                  |         | [0x7fc3bd510247]                                                                   |
|                  |         | 8       0x7fc3bd4f2c66 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190c66)  |
|                  |         | [0x7fc3bd4f2c66]                                                                   |
|                  |         | 4       0x7fc3a3633589 TRITONBACKEND_ModelInstanceInitialize + 153                 |
|                  |         | 11      0x7fc3bd4e8d6b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186d6b)  |
|                  |         | [0x7fc3bd4e8d6b]                                                                   |
|                  |         | 13      0x7fc3bd4fea2c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ca2c)  |
|                  |         | [0x7fc3bd4fea2c]                                                                   |
|                  |         | 15      0x7fc3bd5fde8c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x29be8c)  |
|                  |         | [0x7fc3bd5fde8c]                                                                   |
|                  |         | 17      0x7fc3bcdce253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc3bcdce |
|                  |         | 253]                                                                               |
|                  |         | 9       0x7fc3bd4ff5dd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d5dd) [0x7fc3bd4ff5dd] |
|                  |         | 10      0x7fc3bcb62ee8 /lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fc3bcb62ee8]  |
|                  |         | 11      0x7fc3bd4e8d6b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186d6b) [0x7fc3bd4e8d6b] |
| tensorrt_llm_bls | 1       | READY                                                                              |
+------------------+---------+------------------------------------------------------------------------------------+

I1113 01:01:00.401583 3545 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA A100 80GB PCIe"
I1113 01:01:00.408347 3545 metrics.cc:770] "Collecting CPU metrics"
I1113 01:01:00.408459 3545 tritonserver.cc:2598]
+----------------------------------+--------------------------------------------------------------------------------+
| Option                           | Value                                                                          |
+----------------------------------+--------------------------------------------------------------------------------+
| server_id                        | triton                                                                         |
| server_version                   | 2.49.0                                                                         |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) s |
|                                  | chedule_policy model_configuration system_shared_memory cuda_shared_memory bin |
|                                  | ary_tensor_data parameters statistics trace logging                            |
| model_repository_path[0]         | /opt/tritonserver/inflight_batcher_llm                                         |
| model_control_mode               | MODE_NONE                                                                      |
| strict_model_config              | 1                                                                              |
| model_config_name                |                                                                                |
| rate_limit                       | OFF                                                                            |
| pinned_memory_pool_byte_size     | 268435456                                                                      |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                       |
| min_supported_compute_capability | 6.0                                                                            |
| strict_readiness                 | 1                                                                              |
| exit_timeout                     | 30                                                                             |
| cache_enabled                    | 0                                                                              |
+----------------------------------+--------------------------------------------------------------------------------+

I1113 01:01:00.408493 3545 server.cc:305] "Waiting for in-flight requests to complete."
I1113 01:01:00.408501 3545 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I1113 01:01:00.408961 3545 server.cc:336] "All models are stopped, unloading models"
I1113 01:01:00.408971 3545 server.cc:345] "Timeout 30: Found 3 live models and 0 in-flight non-inference requests"
I1113 01:01:01.409145 3545 server.cc:345] "Timeout 29: Found 3 live models and 0 in-flight non-inference requests"
Cleaning up...
Cleaning up...
Cleaning up...
I1113 01:01:01.659903 3545 model_lifecycle.cc:624] "successfully unloaded 'tensorrt_llm_bls' version 1"
I1113 01:01:01.916216 3545 model_lifecycle.cc:624] "successfully unloaded 'postprocessing' version 1"
I1113 01:01:01.941044 3545 model_lifecycle.cc:624] "successfully unloaded 'preprocessing' version 1"
I1113 01:01:02.409426 3545 server.cc:345] "Timeout 28: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[39537,1],0]
  Exit code:    1

It would be great if somebody can guide me!

ZhihaoGu11 commented 8 hours ago

I have encountered the same issue. Could someone kindly point out where the problem might be?