albert-haam commented 4 months ago

Dear,

I'm quite struggling to make sample code works on my laptop with a Nvidia A2000(8GB) card.

Does anyone has an advice?

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

import torch import transformers from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image import warnings import pathlib

disable some warnings

transformers.logging.set_verbosity_error() transformers.logging.disable_progress_bar() warnings.filterwarnings('ignore')

set device

torch.set_default_device('cuda') # or 'cuda'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') torch_device = 'cuda' #auto, cpu

model_name = 'BAAI/Bunny-v1_0-3B' # or 'BAAI/Bunny-v1_0-2B-zh'

create model

model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map=torch_device, trust_remote_code=True)

model.to(device)

tokenizer = AutoTokenizer.from_pretrained( model_name, trust_remote_code=True)

text prompt

prompt = 'What happened in the image?' text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \n{prompt} ASSISTANT:" text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('')]

input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).to(torch_device).unsqueeze(0)

input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=model.dtype, device=torch_device).unsqueeze(0)

input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=model.dtype, device=torch_device).to(torch_device).unsqueeze(0)

local image

file = pathlib.Path('C:/Users/Admin/Utils/Bunny-AI/slippery-person.jpeg') image = Image.open(file) image_tensor = model.process_images([image], model.config)

generate

output_ids = model.generate( input_ids,

images=image_tensor

images=image_tensor.unsqueeze(0).to(dtype=model.dtype, device='cuda', non_blocking=True),
max_new_tokens=100,
use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

Isaachhh commented 4 months ago

Please add code below just before output_ids = model.generate(: model.get_vision_tower().to('cuda') input_ids = input_ids.to('cuda')

albert-haam commented 4 months ago

Thank you.

Do you know how much GPU memory is needed to run the sample?

A2000 GPU has 8GB and I unable to run the sample with my GPU as "CUDA out of memory".

RussRobin commented 4 months ago

Hi @albert-haam, thank you for your interest in Bunny.

8G isn't enough for scripts in quick start. For quick start and cli inference, we've tested on our device and it occupies just about 9G. To reduce GPU memory consumption, please try quantizing the model. It's currently supported in cli, where you can set --load-8bit in bunny/serve/cli.py .

Feel free to comment on this issue if you have further confusion.

Regards Russell BAAI