InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.57k stars 416 forks source link

Unable to infer on multiple CPUs #2008

Closed bonre closed 3 months ago

bonre commented 3 months ago

Thank you for your amazing works. I am learning to use lmdeploy, and I have successfully implemented awq-4bit quantization of internlm-xcomposer2-4KHD, and successfully used pipeline inference. However, when I used the tp parameter to deploy to multi-GPUs inference, I found that the model only distributed the weights to the set GPUs evenly, with the inference process still only used a single CPU. How can I solve this problem? Thank you so much if you can help me solve it.

test code:

import datetime import os from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig, ChatTemplateConfig from lmdeploy.vl import load_image

import torch

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

engine_config = TurbomindEngineConfig( model_format='awq', cache_max_entry_count=0.8, tp=4, )

model_path = '/workspace/awq/ixc2-4bit' pipe = pipeline(model_path, backend_config=engine_config, chat_template_config=ChatTemplateConfig(model_name='internlm-xcomposer2-4khd')) gen_config = GenerationConfig(

temperature=0.8,

# top_p=1.0,
# top_k=1.0,
# random_seed=4114,
max_new_tokens=1024,

)

image_path = "/workspace/data/image.png" image = load_image(image_path) prompt = "描述这张图片中的内容"

for i in range(10): print("Warm up...[{}/10]".format(i+1)) image = load_image(image) query=prompt gen_config = GenerationConfig(max_new_tokens=256) response = pipe((query, image), gen_config=gen_config)

times = 10 total_tokens = 0 start_time = datetime.datetime.now() for i in range(times): response = pipe((prompt, image), gen_config=gen_config) print(response.text) total_tokens += response.generate_token_len end_max_mem = torch.cuda.max_memory_allocated()

infer_time = end_time - start_time infer_time = infer_time.seconds + infer_time.microseconds / 1000000.0 infer_speed = total_tokens / infer_time per_infer_time = infer_time / times print(f'Inference speed: {infer_speed:.2f} tokens/second') print(f'Inference time: {per_infer_time:.2f} seconds/time') print(f"Max Memory Used: {end_max_mem / 1024**2:.2f} MB")

irexyc commented 3 months ago

To have better parallelism

change the forward loop to single call, therefore the multi query could be parallel computing by llm engine.

response = pipe([(query, image) * 10], gen_config=gen_config)
bonre commented 3 months ago

To have better parallelism

change the forward loop to single call, therefore the multi query could be parallel computing by llm engine.

response = pipe([(query, image) * 10], gen_config=gen_config)

Thanks for your reply! However, I still get the same problem. Checking by nvtop, the cpu mems look like this: D91AB595-0A71-4ffd-BD2E-610C55BBD9D1

I'm sorry I didn't describe it clearly. btw, I tried to modify it as you said, but got the following error: 73df087a50167cccdf840aa8879f99e6ec7d827720e788455bbe4bbebecc0817QzpcVXNlcnNcYmFuZ3R1XEFwcERhdGFcUm9hbWluZ1xEaW5nVGFsa1wxODk1NTgzNjQyX3YyXEltYWdlRmlsZXNcMTcyMDc3NTAyOTI0NF83MzhGRTM0Qi0yOUI4LTQ3M2QtOEU0MC00OTFEQ0EyNDcyQTgucG5n since my actual query is not that short. @irexyc

irexyc commented 3 months ago

response = pipe([(query, image) * 10], gen_config=gen_config) shoud be response = pipe([(query, image)] * 10, gen_config=gen_config)

Using batch inference will reduce the total cost time.

lmdeploy (with turbomind) backend doesn't use mult process. Each process has one pid and the 1673% indicate that the process consume about 16.7 cpu cores. If one process use single cpu, the value should be 100%.

bonre commented 3 months ago

@irexyc I see. So if I want to use multi process in the same time while running this code: response = pipe([(query, image)] * 10, gen_config=gen_config) pipe=pipeline(model_path,backend_config=engine_config,chat_template_config=ChatTemplateConfig(model_name='internlm-xcomposer2-4khd')) It won`t work since lmdeploy is not support yet? My aim is to deploy the awq-model on multi GPUs, with infering on every GPUs CPU to further improve inference speed, however now what I can achieve is simply using single GPUs CPU to deploy. BTW, my device is A100 40G * 4. My problem is the inference speed is the same on a single GPU and multiple GPUs, how can I improve it? I would greatly appreciate it if you could help me answer, ths!!!