Closed chopinchenx closed 3 months ago
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings
import numpy as np
# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')
# set device
device = 'cuda' # or cpu
model_name = 'RussRobin/SpatialBot-3B'
offset_bos = 0
# create model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # float32 for cpu
device_map='auto',
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True)
# text prompt
prompt = 'What is the depth value of point <0.5,0.2>? Answer directly from depth map.'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image 1>\n<image 2>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image 1>\n<image 2>\n')]
input_ids = torch.tensor(text_chunks[0] + [-201] + [-202] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)
image1 = Image.open('rgb.jpg')
image2 = Image.open('depth.png')
channels = len(image2.getbands())
if channels == 1:
img = np.array(image2)
height, width = img.shape
three_channel_array = np.zeros((height, width, 3), dtype=np.uint8)
three_channel_array[:, :, 0] = (img // 1024) * 4
three_channel_array[:, :, 1] = (img // 32) * 8
three_channel_array[:, :, 2] = (img % 32) * 8
image2 = Image.fromarray(three_channel_array, 'RGB')
image_tensor = model.process_images([image1,image2], model.config).to(dtype=model.dtype, device=device)
# generate
output_ids = model.generate(
input_ids,
images=image_tensor,
max_new_tokens=100,
use_cache=True,
repetition_penalty=1.0 # increase this to avoid chattering
)[0]
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
Hi i also encounter the same problem. You can add model.model.vision_tower = model.model.vision_tower.to(device)
after load the model.
Thank you @chopinchenx for your interest in our work, and thanks @Yuxin916 for your answer!
Can you locate where the error is thrown? Let me double check the code and update QuickStart.
It just shown when run the quick start.
Although the model = AutoModelForCausalLM.from_pretrained(....)
has been loaded into device
. It seems that in the modeling_bunny_phi.py
file, the vision_tower
do not have the self.device
entry, so all other modules are on cuda, only instead of this vision tower. Therefore temporary solution is to add this line model.model.vision_tower = model.model.vision_tower.to(device)
after the model loading. Hope to have a better solution in modifying the modeling_bunny_phi.py
file.
Best Regards
Thanks for your debugging! In fact we can refer to the model and move it to cuda.I wonder if it will work for you:
# add the code in QuickStart script
model.get_vision_tower().to('cuda')
output_ids = model.generate(
input_ids,
images=image_tensor,
max_new_tokens=100,
use_cache=True,
repetition_penalty=1.0 # increase this to avoid chattering
)[0]
This will solve the problem. Thank you very much!
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)