InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.93k stars 355 forks source link

[Feature] support llava1.5 w4a16 model? the model is so slower than origin fp16 model? #1342

Open ganliqiang opened 4 months ago

ganliqiang commented 4 months ago

Motivation

image when i run this pipe = pipeline('liuhaotian/llava-v1.5-13b', chat_template_config=ChatTemplateConfig(model_name='vicuna'), cache_max_entry_count=0.1) some diffrent answeer given, but the origin model donnot occur,my promot is task+choices+format. task = "your task is to find out what actions and events are included in the given image?" choices = '''\nA. Someone are fighting\nB. Climbing the tree\nC. Climbing the wall \nD. occupying roads to management and sell things\nE. Hanging clothes along the street\nF. Someone fell down, lay or sit on the ground \nG. Climbing over a guardrail on the street\nH. Haphazard piles of materials \nI. Talking on phone\nJ. Someone is smoking \nK. None of the above''' format = "\nAnswer with one or more option's letters from the given choices directly." is there some bug in this version or do i use the right way?at the same time the model is 3.0X slower than the origin fp16 model too?

Related resources

No response

Additional context

No response

lzhangzz commented 4 months ago

What GPU model are you using?

ganliqiang commented 4 months ago

What GPU model are you using?

a100 80g,the complete is from lmdeploy import pipeline, ChatTemplateConfig, GenerationConfig from lmdeploy.vl import load_image

gen_config = GenerationConfig(top_k=1, temperature=0)
pipe = pipeline('liuhaotian/llava-v1.5-13b',
                chat_template_config=ChatTemplateConfig(model_name='vicuna'), cache_max_entry_count=0.1)
    image = load_image(img_path)
    vqa_time = time.time()
    response = pipe((prompt, image), gen_config=gen_config)
    print(f"vqa_tiem:{time.time() - vqa_time}")
    print(response)
    ans=find_ans(response.text)
    print(ans)

i find most time the gpu is idle,so i guess maybe most of time it is preprocessing the image?