LDLINGLINGLING / adan_application

个人项目地址,一些大语言模型和多模态模型的应用
119 stars 7 forks source link

MiniCPM-V_2_6_awq_int4显存占用20G #7

Open smilebetterworld opened 2 hours ago

smilebetterworld commented 2 hours ago

git clone https://www.modelscope.cn/models/linglingdan/MiniCPM-V_2_6_awq_int4 用这个量化后的INT4模型推理,显存占用大概20G,和fp模型显存占用情况基本一样,请教下是不是量化存在问题?

smilebetterworld commented 2 hours ago

推理代码为: from PIL import Image from transformers import AutoTokenizer from vllm import LLM, SamplingParams import time import GPUtil

图像文件路径列表

IMAGES = [ "/data/1666770191808_crop_0.jpg", # 本地图片路径 ]

模型名称或路径

MODEL_NAME = "/data/MiniCPM-V_2_6_awq_int4"

打开并转换图像

image = Image.open(IMAGES[0]).convert("RGB")

初始化分词器

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

初始化语言模型

llm = LLM(model=MODEL_NAME, gpu_memory_utilization=1, # 使用全部GPU内存 trust_remote_code=True, max_model_len=2048) # 根据内存状况可调整此值

构建对话消息

question = "extract only raw text from the given image.Don't add any information or commentary." messages = [{'role': 'user', 'content': '(./)\n' + question}]

应用对话模板到消息

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

设置停止符ID

stop_tokens = ['<|im_end|>', '<|endoftext|>'] stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]

设置生成参数

sampling_params = SamplingParams( stop_token_ids=stop_token_ids, max_tokens=1024, temperature=0, best_of=1)

st = time.time()

获取模型输出

outputs = llm.generate({ "prompt": prompt, "multi_modal_data": { "image": image } }, sampling_params=sampling_params)

latency = time.time() - st gpus = GPUtil.getGPUs() for gpu in gpus: print(f"GPU ID: {gpu.id}, GPU负载: {round(gpu.load*100,2)}%, Memory Total: {gpu.memoryTotal}MB, 显存占用: {gpu.memoryUsed}MB, Memory Free: {gpu.memoryFree}MB") print('latency is {} seconds'.format(latency))