InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.77k stars 433 forks source link

terminate called after throwing an instance of 'std::runtime_error' #2230

Closed lyc728 closed 3 months ago

lyc728 commented 4 months ago

使用模型minicpmv2.5报错 terminate called after throwing an instance of 'std::runtime_error' what(): [TM][ERROR] CUDA runtime error: CUBLAS_STATUS_NOT_SUPPORTED /lmdeploy/src/turbomind/utils/cublasMMWrapper.cc:307

Aborted (core dumped) 但是使用Internvl2正常推理

lyc728 commented 4 months ago

现在用原生的推理代码也是报错

企业微信截图_17228479611578
lvhan028 commented 4 months ago

麻烦提供下 lmdeploy check_env 的执行结果和复现代码

lyc728 commented 4 months ago

解决了

irexyc commented 4 months ago

@lyc728 怎么解决的

lyc728 commented 4 months ago

lmdeploy不能组batch推理 不然显存会超,swift是可以的

lvhan028 commented 3 months ago

lmdeploy 不能组batch?lmdeploy一直都有Continuous batching 请问,你说的组batch具体是指?

lyc728 commented 3 months ago

pipe = pipeline(model_path, chat_template_config=ChatTemplateConfig('llama3')) response = pipe((ask, img_path),(ask, img_path)) 如果是两个的话,这样显存会一直增大

lvhan028 commented 3 months ago

是说循环调用 response = pipe((ask, img_path),(ask, img_path)),发现显存一直上涨么?

lyc728 commented 3 months ago

是的

lvhan028 commented 3 months ago

@zhulinJulia24 may add this case to the memory-check test sets

irexyc commented 3 months ago

复现不出来,我这里看着挺稳定的,如果只是循环调用两个请求的话,显存没有变化。

lyc728 commented 3 months ago

每次请求你不要相同,我是每次请求是不同的,我显存到40g就爆了

irexyc commented 3 months ago

@lyc728

你是两卡么?加载完模型剩余多少G显存?

lyc728 commented 3 months ago

单卡的

irexyc commented 3 months ago

单卡的

加载完模型剩几g

lyc728 commented 3 months ago

还剩余4G

irexyc commented 3 months ago
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
import numpy as np

pipe = pipeline('/home/chenxin/MiniCPM-Llama3-V-2_5/', backend_config=TurbomindEngineConfig(cache_max_entry_count=0.8))
im = load_image('/home/chenxin/ws3/vl/tiger.jpeg')

def random_resize(im):
  length = list(range(384, 2056, 32))
  offset = list(range(-16, 16))
  width = np.random.choice(length) + np.random.choice(offset)
  height = np.random.choice(length) + np.random.choice(offset)
  res = im.resize((width, height))
  return res

while True:
  ask = 'describe this image'
  batch_size = 4
  batch_data = []
  for _ in range(batch_size):
    im_rnd = random_resize(im)
    batch_data.append((ask, im_rnd))
  response = pipe(batch_data)

我这边不能复现,跟加载完相比,推理的时候多了2个G左右。本身显存比较小的话,可能会由于碎片而导致更大的消耗,可以把cache_max_entry_count改成0.4看看

lyc728 commented 3 months ago

加了这个限制就好了(cache_max_entry_count=0.8)

irexyc commented 3 months ago

@lyc728

这个值不设置的话,默认就是0.8 😂

lyc728 commented 3 months ago

改成0.4了

----------Reply to Message---------- On Wed, Aug 7, 2024 20:12 PM Chen @.***> wrote:

@lyc728

这个值不设置的话,默认就是0.8 😂

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

irexyc commented 3 months ago

好的,那看下来就是跟刚启动模型后相比,会有一个显存峰值的问题,先关掉这个issue了。