[Bug] vl pipeline triggle cudaMemcpyAsync ERROR illegal memory access #1813

Open pupumao opened 3 months ago

pupumao commented 3 months ago


Describe the bug

we use vl_pipeline following Batch prompts inference

we got this error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/models/llama/LlamaBatch.h:134

here is the code in LlamaBatch.h:

    // analogs to `std::copy_n`
    template<typename U>
    U* Copy(const U* src, size_t count, U* dst)
        check_cuda_error(cudaMemcpyAsync(dst, src, sizeof(U) * count, cudaMemcpyDefault, stream_));
        return dst += count;


command to reproduce: CUDA_VISIBLE_DEVICES=0 python

from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.messages import VisonConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', log_level='INFO',vision_config=VisonConfig(max_batch_size=1))
image_urls=['' for i in range(40)]
prompt_list = [('describe this image', load_image(img_url)) for img_url in image_urls]
for i in range(100):

We found that the smaller the setting of max_batch_size, the easier it is to reproduce this issue; if the size of the prompt_list is equal to max_batch_size, the issue basically does not occur. We also found that it cannot be reproduced on A100

## hardware
nvidia L40(sm 89)

## python env
conda create -n lmdeploy python=3.10


## hardware
nvidia L40(sm 89)

## python env
conda create -n lmdeploy python=3.10

### conda list -n lmdeploy

lvhan028 commented 3 months ago

@AllentDan is it related to #1789 ?

AllentDan commented 3 months ago

@AllentDan is it related to #1789 ?

No. only handles stuck problems instead of illegal mem access. And as mentioned above, the bug can not be reproduced in A100.

pupumao commented 3 months ago

There is a peculiar situation where it seems that when the inference of the torch module inside ImageEncoder and the inference of the language model from turbomind are synchronized, this issue arises. In the vl_async_engine, if I use a torch rand for the features and do not perform inference with ImageEncoder, this problem does not occur. My experimental model is Llava. If I start a completely independent thread in Llava to continuously loop and perform encode_images, this issue will occur more quickly.

irexyc commented 3 months ago

My experimental model is Llava. If I start a completely independent thread in Llava to continuously loop and perform encode_images, this issue will occur more quickly.

@pupumao Could you share your experimental code?

The code of ImageEncoder is updated recently, could you try the latest code and see if the issue still happend ?

pupumao commented 3 months ago

My experimental model is Llava. If I start a completely independent thread in Llava to continuously loop and perform encode_images, this issue will occur more quickly.

@pupumao Could you share your experimental code?

The code of ImageEncoder is updated recently, could you try the latest code and see if the issue still happend ?

I tried this latest code, issue still happend

pupumao commented 3 months ago

My experimental model is Llava. If I start a completely independent thread in Llava to continuously loop and perform encode_images, this issue will occur more quickly.

@pupumao Could you share your experimental code? The code of ImageEncoder is updated recently, could you try the latest code and see if the issue still happend ?

I tried this latest code, issue still happend

@irexyc I cloned the latest code from github, and build from source, then use this llava experiment code with self.start_work() which started a seperate thread for inference, also got the error

pupumao commented 3 months ago

I add traceback in c++ code, i got two error position of "an illegal memory access" in different experiments:

stack trace:
  .../lmdeploy/lmdeploy/lib/ : turbomind::LlamaBatch<__nv_bfloat16>::Finish(turbomind::GenerationState&)+0x16a
  .../lmdeploy/lmdeploy/lib/ : turbomind::LlamaBatch<__nv_bfloat16>::InternalThreadEntry(int)+0x982
  .../lmdeploy/lmdeploy/lib/ : ()+0x27da84
  /lib64/ : ()+0x7ea5
  /lib64/ : clone()+0x6d
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: an illegal memory access was encountered .../lmdeploy/src/turbomind/models/llama/LlamaBatch.h:136

Aborted (core dumped)
stack trace:
  .../lmdeploy/lmdeploy/lib/ : turbomind::NcclGuard::~NcclGuard()+0x106
  .../lmdeploy/lmdeploy/lib/ : turbomind::LlamaBatch<__nv_bfloat16>::AllocatePersistantBuffer(unsigned long, int)+0xa0b
  .../lmdeploy/lmdeploy/lib/ : turbomind::LlamaBatch<__nv_bfloat16>::LlamaBatch(turbomind::EngineParams const&, int, int, turbomind::LlamaV2<__nv_bfloat16>*)+0x771
  .../lmdeploy/lmdeploy/lib/ : turbomind::LlamaV2<__nv_bfloat16>::LlamaV2(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, float, turbomind::LlamaAttentionParams const&, int, int, int, int, bool, turbomind::EngineParams const&, turbomind::LoraParams const&, std::shared_ptr<turbomind::LlamaV2<__nv_bfloat16>::SharedState>, turbomind::LlamaWeight<__nv_bfloat16>*, turbomind::NcclParam, CUstream_st*, turbomind::cublasMMWrapper*, turbomind::IAllocator*, bool, cudaDeviceProp*)+0x448
  .../lmdeploy/lmdeploy/lib/ : LlamaTritonModel<__nv_bfloat16>::createSharedModelInstance(int, int, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm>)+0x5af
  .../lmdeploy/lmdeploy/lib/ : LlamaTritonModel<__nv_bfloat16>::createModelInstance(int, int, CUstream_st*, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm>)+0x6bf
  .../lmdeploy/lmdeploy/lib/ : ()+0xb506d
  .../lmdeploy/lmdeploy/lib/ : ()+0xcdfa2
  python : ()+0x1445a6
  python : _PyObject_MakeTpCall()+0x26b
  python : ()+0x150866
  python : _PyEval_EvalFrameDefault()+0x4c12
  python : ()+0x1506d8
  python : _PyEval_EvalFrameDefault()+0x2d83
  python : _PyFunction_Vectorcall()+0x6c
  python : _PyEval_EvalFrameDefault()+0x72c
  python : _PyFunction_Vectorcall()+0x6c
  python : _PyEval_EvalFrameDefault()+0x72c
  python : ()+0x150804
  python : ()+0x228372
  python : ()+0x228324
  /lib64/ : ()+0x7ea5
  /lib64/ : clone()+0x6d
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: an illegal memory access was encountered .../lmdeploy/src/turbomind/utils/

Aborted (core dumped)
lzhangzz commented 3 months ago

Thanks for investigating the problem!

Please set environment variable TM_DEBUG_LEVEL=DEBUG before trying to get the stacktrace. It synchronize kernel launches to get the accurate position of where things go wrong.

pupumao commented 3 months ago

Thanks for investigating the problem!

Please set environment variable TM_DEBUG_LEVEL=DEBUG before trying to get the stacktrace. It synchronize kernel launches to get the accurate position of where things go wrong.


pupumao commented 3 months ago

Thanks for investigating the problem!

Please set environment variable TM_DEBUG_LEVEL=DEBUG before trying to get the stacktrace. It synchronize kernel launches to get the accurate position of where things go wrong.

@lzhangzz here is part of log for the other failed case:

2024-06-24 14:08:56,256 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.044s
2024-06-24 14:08:56,256 - lmdeploy - INFO - ImageEncoder done 1 images, left 0 images.
2024-06-24 14:08:56,257 - lmdeploy - INFO - ImageEncoder received 1 images, left 1 images.
2024-06-24 14:08:56,257 - lmdeploy - INFO - ImageEncoder process 1 images, left 0 images.
2024-06-24 14:08:56,257 - lmdeploy - INFO - prompt="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <IMAGE_TOKEN>\ndescribe this image ASSISTANT:", gen_config=EngineGenerationConfig(n=1, max_new_tokens=1, top_p=0.8, top_k=40, temperature=0.8, repetition_penalty=1.0, ignore_eos=False, random_seed=14629431361338060508, stop_words=[2], bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 319, 13563, 1546, 263, 12758, 5199, 322, 385, 23116, 21082, 20255, 29889, 450, 20255, 4076, 8444, 29892, 13173, 29892, 322, 1248, 568, 6089, 304, 278, 5199, 29915, 29879, 5155, 29889, 3148, 1001, 29901, 29871, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 29871, 13, 2783, 29581, 445, 1967, 319, 1799, 9047, 13566, 29901], adapter_name=None.
2024-06-24 14:08:56,257 - lmdeploy - INFO - session_id=7, history_tokens=0, input_tokens=622, max_new_tokens=1, seq_start=True, seq_end=True, step=0, prep=True
[TM][DEBUG] Set logger level by DEBUG
[TM][DEBUG] std::shared_ptr<std::unordered_map<std::basic_string<char>, triton::Tensor> > LlamaTritonModelInstance<T>::forward(std::shared_ptr<std::unordered_map<std::basic_string<char>, triton::Tensor> >, turbomind::AbstractInstanceComm*) [with T = __half]
[TM][DEBUG] std::unordered_map<std::basic_string<char>, turbomind::Tensor> LlamaTritonModelInstance<T>::convert_inputs(std::shared_ptr<std::unordered_map<std::basic_string<char>, triton::Tensor> >) [with T = __half]
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x8c7027c00 with size 16416
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x8c702be00 with size 32
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: CORRID
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = long unsigned int] start
[TM][DEBUG] getVal with type x, but data type is: u8
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = long unsigned int; size_t = long unsigned int] start
[TM][DEBUG] getVal with type x, but data type is: u8
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: START
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: END
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: STOP
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
stack trace:
  .../lmdeploy/lmdeploy/lib/ : turbomind::LlamaBatch<__half>::Finish(turbomind::GenerationState&)+0x16a
  .../lmdeploy/lmdeploy/lib/ : turbomind::LlamaBatch<__half>::InternalThreadEntry(int)+0x982
  .../lmdeploy/lmdeploy/lib/ : ()+0x27da84
  /lib64/ : ()+0x7ea5
  /lib64/ : clone()+0x6d
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: an illegal memory access was encountered .../lmdeploy/src/turbomind/models/llama/LlamaBatch.h:136

Aborted (core dumped)