fused_multihead_attention_v2 CUDA Error: CUDA_ERROR_INVALID_VALUE

System Info

CPU architecture: x86_64
CPU/Host memory: size Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz , memory size: 128Gi
GPU properties
- GPU name: NVIDIA A30 *4
- GPU memory size: 24Gi
- Clock frequencies used (if applicable)
Libraries
- TensorRT-LLM branch or tag: v0.9.0
- TensorRT-LLM commit: 250d9c2
- Versions of TensorRT, Modelopt, CUDA, cuBLAS, etc. used: CUDA-12.3
- Container used (if running TensorRT-LLM in a container)
NVIDIA driver version: 535.104.12
OS (Ubuntu 22.04, CentOS 7, Windows 10): CentOS 7
Any other information that may be useful in reproducing the bug

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

load Llama model, then send requests.

occurs some err.

CUDA Error: CUDA_ERROR_INVALID_VALUE

/app/tensorrt_llm/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_v2.h 481
https://github.com/NVIDIA/TensorRT-LLM/blob/250d9c293d5edbc2a45c20775b3150b1eb68b364/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_v2.h#L481

batch_manager

[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1       0x7ff4b80c7305 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7ff41db12ae0 tensorrt_llm::runtime::GptDecoderBatch::forwardSync(tensorrt_llm::runtime::decoder_batch::Token const&) + 96
3       0x7ff41dc39d36 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 7622
4       0x7ff41dbea4d4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
5       0x7ff41dbf252f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
6       0x7ff4c05f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7ff4c05f1253]
7       0x7ff4c0380ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff4c0380ac3]
8       0x7ff4c0412850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7ff4c0412850]
[TensorRT-LLM][WARNING] Step function failed, continuing.

Expected behavior

work well

actual behavior

throw exception, can't process request

additional notes

[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: hasValues == configValue.has_value() (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/samplingConfig.h:46)
1       0x7faf940d9f31 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7faef9b2b008 tensorrt_llm::runtime::SamplingConfig::SamplingConfig(std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&) + 1720
3       0x7faef9b1e514 tensorrt_llm::runtime::GptDecoderBatch::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&) + 404
4       0x7faef9c37203 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&, std::vector<unsigned long, std::allocator<unsigned long> > const&) + 851
5       0x7faef9c394e7 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 5495
6       0x7faef9bea4d4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
7       0x7faef9bf252f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
8       0x7faf9bdf1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7faf9bdf1253]
9       0x7faf9bb80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7faf9bb80ac3]
10      0x7faf9bc12850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7faf9bc12850]
[TensorRT-LLM][WARNING] Step function failed, continuing.

CUDA Error: CUDA_ERROR_INVALID_VALUE /app/tensorrt_llm/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_v2.h 481

[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1       0x7ff8cc198305 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7ff83db12ae0 tensorrt_llm::runtime::GptDecoderBatch::forwardSync(tensorrt_llm::runtime::decoder_batch::Token const&) + 96
3       0x7ff83dc39d36 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 7622
4       0x7ff83dbea4d4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
5       0x7ff83dbf252f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
6       0x7ff8e09f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7ff8e09f1253]
7       0x7ff8e0780ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff8e0780ac3]
8       0x7ff8e0812850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7ff8e0812850]
[TensorRT-LLM][ERROR] Encountered error for requestId 144103273: Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1       0x7ff8cc198305 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7ff83db12ae0 tensorrt_llm::runtime::GptDecoderBatch::forwardSync(tensorrt_llm::runtime::decoder_batch::Token const&) + 96
3       0x7ff83dc39d36 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 7622
4       0x7ff83dbea4d4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
5       0x7ff83dbf252f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
6       0x7ff8e09f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7ff8e09f1253]
7       0x7ff8e0780ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff8e0780ac3]
8       0x7ff8e0812850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7ff8e0812850]
[TensorRT-LLM][ERROR] Encountered error for requestId 1997879650: Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1       0x7ff8cc198305 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7ff83db12ae0 tensorrt_llm::runtime::GptDecoderBatch::forwardSync(tensorrt_llm::runtime::decoder_batch::Token const&) + 96
3       0x7ff83dc39d36 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 7622
4       0x7ff83dbea4d4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
5       0x7ff83dbf252f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
6       0x7ff8e09f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7ff8e09f1253]
7       0x7ff8e0780ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff8e0780ac3]
8       0x7ff8e0812850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7ff8e0812850]

NVIDIA / TensorRT-LLM

fused_multihead_attention_v2 CUDA Error: CUDA_ERROR_INVALID_VALUE #1965

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes