BAAI-DCAI / SpatialBot

The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.
MIT License
167 stars 12 forks source link

Running quickstart without depth image #14

Closed zwenyu closed 1 month ago

zwenyu commented 1 month ago

Thank you for the interesting work! I'd like to check what's the correct way to run inference with only RGB following the code provided under Quickstart? Using model.process_images([image1], model.config).to(dtype=model.dtype, device=device) returns IndexError: list index out of range, so it appears two images are expected.

RussRobin commented 1 month ago

Hi @zwenyu Thank you for your interest in our work.

You may want to modify

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image 1>\n<image 2>\n')]
input_ids = torch.tensor(text_chunks[0] + [-201] + [-202] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)

to

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image 1>\n')]
input_ids = torch.tensor(text_chunks[0] + [-201] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)

We use [-201] and [-202] to represent images in input text token. Hope it makes sense to you.

Regards

zwenyu commented 1 month ago

I get it now. Thanks!