Open zhaocc1106 opened 2 weeks ago
@symphonylyh, could you please take a look at this?
Hi @zhaocc1106, where is this BuildPromptTuningForImages
call? did you implement it yourself?
In parallel, currently we have plan to support end-to-end executor support for multimodal models. When it's done, I think your case can work fine as well
Hi @zhaocc1106, where is this
BuildPromptTuningForImages
call? did you implement it yourself? In parallel, currently we have plan to support end-to-end executor support for multimodal models. When it's done, I think your case can work fine as well
Yes, BuildPromptTuningForImages
is myself function. I use c++ api.
I encountered same problem
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaFreeHost(ptr): unspecified launch failure (/home/askhoroshev/tensorrt-llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:177)
1 0x7feb8af5e962 /home/askhoroshev/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x70b962) [0x7feb8af5e962]
2 0x7feb8cc99fad virtual thunk to tensorrt_llm::runtime::GenericTensor<tensorrt_llm::runtime::PinnedAllocator>::~GenericTensor() + 125
3 0x7feb8d1453be tensorrt_llm::batch_manager::PromptTuningBuffers::fill(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::BufferManager const&, bool) + 3758
4 0x7feb8d14d78f tensorrt_llm::batch_manager::RuntimeBuffers::setFromInputs(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 8063
5 0x7feb8d14e052 tensorrt_llm::batch_manager::RuntimeBuffers::prepareStep[abi:cxx11](std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 178
6 0x7feb8d16f6d4 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 164
7 0x7feb8d16f8de tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 222
8 0x7feb8d17003c tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1788
9 0x7feb8d19c231 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 353
10 0x7feb8d1a113f tensorrt_llm::executor::Executor::Impl::executionLoop() + 895
11 0x7feb772d1a80 /home/askhoroshev/tensorrt-llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32c5a80) [0x7feb772d1a80]
12 0x7feb2e2941ca /lib64/libpthread.so.0(+0x81ca) [0x7feb2e2941ca]
13 0x7feb2d5c08d3 clone + 67
TP 4 llama like model and executor API
I'm passing the pinned tensor as an embeddings. If I pass the kCpu tensor everything will be fine.
I guess you miss synchronization point in batch_manager code. Because transfer from kCpu to kGpu implies implicit sync but transfer from kPinned to kGpu (and from kGpu to kGpu) does't.
@MartinMarciniszyn @symphonylyh
@zhaocc1106 try passing kCPU tensor instead of kGPU as a workaround
@zhaocc1106 try passing kCPU tensor instead of kGPU as a workaround
But my vit_embeding is the output of tensorrt. It's in gpu device memory and i copy to trtllm gpu mem by D2D copy. Copy to cpu will result in time wasting.
It's strange that if the first request have no image, following image request will be ok.
But my vit_embeding is the output of tensorrt. It's in gpu device memory and i copy to trtllm gpu mem by D2D copy. Copy to cpu will result in time wasting.
I know, but it works for me )
But my vit_embeding is the output of tensorrt. It's in gpu device memory and i copy to trtllm gpu mem by D2D copy. Copy to cpu will result in time wasting.
I know, but it works for me )
thanks, i will try
@Superjomn @MartinMarciniszyn any updates here?
there is a critical error in the PromptTuningBuffers::fill function https://github.com/NVIDIA/TensorRT-LLM/issues/1917
ENV:
ISSUE: I use c++ api of "tensorrt_llm/batch_manager/" to deploy a multi-modal llm. When i build tensorrt-llm engine with
--tp 4
and use 4 gpus to deploy a service. If the first request is a image request, will have following err.But if the first request have no image, following image request will be ok. Even though, if i use 1 gpu to deploy, first image request will be also ok. Image request means have prompt-tunning table input, like codes: