Closed bonre closed 3 months ago
To have better parallelism
change the forward loop to single call, therefore the multi query could be parallel computing by llm engine.
response = pipe([(query, image) * 10], gen_config=gen_config)
To have better parallelism
change the forward loop to single call, therefore the multi query could be parallel computing by llm engine.
response = pipe([(query, image) * 10], gen_config=gen_config)
Thanks for your reply! However, I still get the same problem. Checking by nvtop, the cpu mems look like this:
I'm sorry I didn't describe it clearly. btw, I tried to modify it as you said, but got the following error: since my actual query is not that short. @irexyc
response = pipe([(query, image) * 10], gen_config=gen_config)
shoud be
response = pipe([(query, image)] * 10, gen_config=gen_config)
Using batch inference will reduce the total cost time.
lmdeploy (with turbomind) backend doesn't use mult process. Each process has one pid and the 1673% indicate that the process consume about 16.7 cpu cores. If one process use single cpu, the value should be 100%.
@irexyc I see. So if I want to use multi process in the same time while running this code:
response = pipe([(query, image)] * 10, gen_config=gen_config)
pipe=pipeline(model_path,backend_config=engine_config,chat_template_config=ChatTemplateConfig(model_name='internlm-xcomposer2-4khd'))
It won`t work since lmdeploy is not support yet?
My aim is to deploy the awq-model on multi GPUs, with infering on every GPUs CPU to further improve inference speed, however now what I can achieve is simply using single GPUs CPU to deploy.
BTW, my device is A100 40G * 4.
My problem is the inference speed is the same on a single GPU and multiple GPUs, how can I improve it? I would greatly appreciate it if you could help me answer, ths!!!
Thank you for your amazing works. I am learning to use lmdeploy, and I have successfully implemented awq-4bit quantization of internlm-xcomposer2-4KHD, and successfully used pipeline inference. However, when I used the tp parameter to deploy to multi-GPUs inference, I found that the model only distributed the weights to the set GPUs evenly, with the inference process still only used a single CPU. How can I solve this problem? Thank you so much if you can help me solve it.
test code:
import datetime import os from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig, ChatTemplateConfig from lmdeploy.vl import load_image
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
engine_config = TurbomindEngineConfig( model_format='awq', cache_max_entry_count=0.8, tp=4, )
model_path = '/workspace/awq/ixc2-4bit' pipe = pipeline(model_path, backend_config=engine_config, chat_template_config=ChatTemplateConfig(model_name='internlm-xcomposer2-4khd')) gen_config = GenerationConfig(
temperature=0.8,
)
image_path = "/workspace/data/image.png" image = load_image(image_path) prompt = "描述这张图片中的内容"
for i in range(10): print("Warm up...[{}/10]".format(i+1)) image = load_image(image) query=prompt gen_config = GenerationConfig(max_new_tokens=256) response = pipe((query, image), gen_config=gen_config)
times = 10 total_tokens = 0 start_time = datetime.datetime.now() for i in range(times): response = pipe((prompt, image), gen_config=gen_config) print(response.text) total_tokens += response.generate_token_len end_max_mem = torch.cuda.max_memory_allocated()
infer_time = end_time - start_time infer_time = infer_time.seconds + infer_time.microseconds / 1000000.0 infer_speed = total_tokens / infer_time per_infer_time = infer_time / times print(f'Inference speed: {infer_speed:.2f} tokens/second') print(f'Inference time: {per_infer_time:.2f} seconds/time') print(f"Max Memory Used: {end_max_mem / 1024**2:.2f} MB")