Problems running CogVLM2 on Google Colab Runtime T4

KirilAngelov commented 1 month ago

System Info / 系統信息

Hello, I have problems running CogVLM2 using the provided T4 instance in Google Colab. Here are the python modules versions: cuda 12.2 transformers 4.41.2 xformers 0.0.26.post1

Uploading a single JPG image, setting the right image path and prompt results in CUDA out of memory.

I have tried running both the THUDM/cogvlm2-llama3-chat-19B and THUDM/cogvlm2-llama3-chat-19B-int4 using 4bit quantization for both and always have this error: OutOfMemoryError: CUDA out of memory. Tried to allocate 1.11 GiB. GPU

Here is some additional information:

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
    0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
        torch_dtype=TORCH_TYPE,
        trust_remote_code=True,
        quantization_config=BitsAndBytesConfig(load_in_4bit=True),
        low_cpu_mem_usage=True
    ).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

Expected behavior / 期待表现

The expected behavior should be CogVLM2 running, because in the official documentation is said that when using 4bit quantization, the model should run on a GPU with 16GB memory.

zRzRzRzRzRzRzR commented 1 month ago

我测试了最新的代码，int4模型一共只占用了10G显存。但是我估计是在载入的时候超过16G，你尝试修改成内存载入后在移动到显存，而不是直接在显存载入

KirilAngelov commented 1 month ago

Hello, Thank you for the information. I am not sure how to load the model and then transfer it to video memory. I will appreciate any example or help in order to run the model. Also, it would be nice to include these instructions for moving the model to video memory after loading it in the official documentation, so users can successfully run the example scripts.

THUDM / CogVLM2