NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.74k stars 1k forks source link

CUDA runtime error in cudaMemcpyAsync when enabling kv cache reuse with prompt table and TP > 1. #2417

Closed jxchenus closed 4 days ago

jxchenus commented 2 weeks ago

System Info

AWS EC2 instance: g6e.48xlarge TensorRT-LLM v0.13.0 Triton Inference Server v2.50.0 Nvidia 24.09-py3-min used as based image for docker template

Who can help?

@xuanzic

Information

Tasks

Reproduction

  1. Start docker container built with the above components with option --gpus '"device=4,5,6,7"'
  2. Build model with the following:
    
    CONTAINER_LLM_CKPT_DIR="/home/agm-models/m3_288k/llm_only/"
    CONTAINER_LLM_REPO_DIR="/home/agm-models/m3_288k/agm/"
    LLM_TP_DEGREE=2

trtllm-build --checkpoint_dir $CONTAINER_LLM_CKPT_DIR/ckpt_tp$LLM_TP_DEGREE/ \ --gemm_plugin bfloat16 \ --gpt_attention_plugin bfloat16 \ --max_batch_size 1 \ --output_dir $CONTAINER_LLM_REPO_DIR/tensorrt_llm/1/engine/ \ --max_beam_width 1 \ --max_input_len 1280 \ --max_num_tokens 2048 \ --max_prompt_embedding_table_size 1024 \ --context_fmha enable \ --remove_input_padding enable \ --bert_attention_plugin bfloat16 \ --paged_kv_cache enable \ --use_paged_context_fmha enable \ --use_fused_mlp enable \ --max_seq_len 2048 \ --max_multimodal_len 4096

4. Configure triton server Python backend and use ModelRunnerCpp to run execute inference request with the following in `model.py`:
              with torch.no_grad():
                    outputs = self.runner.generate(
                        inputs["input_ids"],
                        prompt_table=inputs["prompt_embedding_table"],
                        input_token_extra_ids=inputs["input_token_extra_ids"],
                        **sampling_params,
                    )
5. Use `mpirun` to start `tritonserver` with the following options:```
            f"--grpc-port={grpc_port}",
            "--reuse-grpc-port=1",
            f"--http-port={http_port}",
            "--reuse-http-port=1",
            f"--metrics-port={metrics_port}",
            # Following option avoids hang in python backend stub
            "--disable-auto-complete-config",
            f"--backend-config=python,shm-region-prefix-name={shm_region_prefix}_{i}_",
  1. Issue an inference request using grpcclient.InferenceServerClient:
                    triton_client.async_stream_infer(
                        model_name=MODEL_NAME,
                        inputs=inputs,
                        outputs=outputs,
                        request_id=f"request_{index}",
                        sequence_id=sequence_id,
                        sequence_start=True,
                        sequence_end=True,
                        enable_empty_final_response=True,
                    ) 

Expected behavior

Inference should succeed.

actual behavior

Inference hangs on the client. The following stack trace is logged on tritonserver:

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemcpyAsync(dst, src.data(), src.getSizeInBytes(), cudaMemcpyDefault, mStream->get()): invalid argument (/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/cpp/tensorrt_llm/runtime/bufferManager.cpp:151)
1       0x7f2327480e25 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7f2329627d51 tensorrt_llm::batch_manager::PromptTuningBuffers::fill(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::BufferManager const&, bool) + 1201
3       0x7f232962e996 tensorrt_llm::batch_manager::RuntimeBuffers::setFromInputs(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 8406
4       0x7f2329631fc1 tensorrt_llm::batch_manager::RuntimeBuffers::prepareStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 177
5       0x7f23296549c4 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 164
6       0x7f2329654bce tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 222
7       0x7f23296555ec tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2492
8       0x7f232967c035 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 405
9       0x7f23296813cb tensorrt_llm::executor::Executor::Impl::executionLoop() + 1179
10      0x7f24f2ad9253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f24f2ad9253]
11      0x7f24f2868ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f24f2868ac3]
12      0x7f24f28f9a04 clone + 68

additional notes

With the same model, configuration and model.py code, if I just change TP to 1 when converting and building the model, inference succeeds.

hello-11 commented 2 weeks ago

@jxchenus Thanks for your interest in TrtLLM. Can you try to use the kv cache resue according to the doc: https://nvidia.github.io/TensorRT-LLM/advanced/kv-cache-reuse.html.

jxchenus commented 2 weeks ago

@hello-11 Thank you for sharing the doc link. Which part of the doc did you think I missed following?

hello-11 commented 2 weeks ago

@jxchenus If you are running a Triton server, you can enable kv cache reuse with a parameter: parameters: { key: "enable_kv_cache_reuse" value: { string_value: "true" } }

jxchenus commented 2 weeks ago

Thanks @hello-11 !

Here's how I enabled kv cache reuse:

        self.runner = runner_cls.from_dir(
            engine_dir=engine_dir,
            rank=self.rank,
            kv_cache_enable_block_reuse=True,
        )

Here runner_cls is ModelRunnerCpp. I tried removing the line kv_cache_enable_block_reuse=True, and got the same error as reported above.

Let me try to use the parameter "enable_kv_cache_reuse" and see what I get. Thank you again!

jxchenus commented 2 weeks ago

I have tried adding parameters: { key: "enable_kv_cache_reuse" value: { string_value: "true" } } to config.pbtxt for my model, and passing it on to the sampling parameters. As long as TP>1, it produces the same error.

Funatiq commented 2 weeks ago

@jxchenus If I understood correctly the error also occurs without KV cache reuse? So the issue is about prompt table with TP>1?

jxchenus commented 2 weeks ago

@Funatiq Thank you for looking into this!

I have a test case where the model is built with --use_paged_context_fmha enable option, and the runner is instantiated without kv_cache_enable_block_reuse=True argument, runner.generate is called without input_token_extra_ids option, and the same stack trace is still dumped.

Please let me know if you'd like me to try building the model differently to see if it reproduces.

akhoroshev commented 2 weeks ago

same problem with prompt table and with tp4 https://github.com/NVIDIA/TensorRT-LLM/issues/2358 (kv cache reuse is disabled)

jxchenus commented 1 week ago

I can reproduce the same error reported in https://github.com/NVIDIA/TensorRT-LLM/issues/2417 by running mpirun of a Python script outside of tritonserver.

I am running with TP=2, and the root node is the one that logs this error stack, while rank 1 completes the generation successfully. This is consistent with what I'm seeing in tritonserver.

akhoroshev commented 1 week ago

this is not a Triton problem, I absolutely agree with you @jxchenus (I don’t use Triton)

jxchenus commented 4 days ago

I just tested inside a new container built with TensorRT-LLMTensorRT-LLM@main 535c9cc, and confirm that this bug is no longer reproducible.

I was also provided with a patch, but the update is inside some closed-source code.

xuanzic commented 4 days ago

@akhoroshev we root cause it's issue with trtllm, and this issue is resolved in latest main (will also be included in next stable release). Please try and see if it works for you. Thanks!