Load SigLip model - Githubissues

BAAI-DCAI / SpatialBot

The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.

MIT License

168 stars 12 forks source link

Load SigLip model #6

Closed Yuxin916 closed 3 months ago

Yuxin916 commented 3 months ago

Hi! I would like to ask few questions regarding the visual encoder part.

How does the SpatialBot model load the SigLip pre-trained model? I have downloaded the siglip-so400m-patch14-384 model from hf as well and modify the config.json file mm_vision_tower into path of the model folder. However, the output said Some weights of the model checkpoint at /xxx/spatial_bot/SpatialBot-3B were not used when initializing BunnyPhiForCausalLM: ['model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.bias' .... all vision_model layers. Any insights of that?
I was trying to get the output from vison_tower to inspect more output. because i sincerely doubt my loading of siglip is incorrect. How i do is `model.model.vision_tower(image_tensor). Is that correct?
I noticed in the quickstar how to try to preprocess the one channel depth. three_channel_array[:, :, 0] = (img // 1024) * 4 three_channel_array[:, :, 1] = (img // 32) * 8 three_channel_array[:, :, 2] = (img % 32) * 8 what are the purpose of this? should i do also for my own one-channel depth? cuz the output visualization looks quite weird as a depth image.

Looking forward to your reply.

Best regards

RussRobin commented 3 months ago

Hi @Yuxin916, thanks for reaching out.

It’s correct, we don’t use some weights in pretrained image encoder.
You can directly print vision encoder output here: line 79 in bunny_arch (def encode_images)
Please see depth encoding in Sec III of our paper. We encode depth maps into 3-channel image.

BTW the provided code should work for you, and you can use it directly.

Regards

Yuxin916 commented 3 months ago

Understand. I tried and it work well so far. Thank you for your help!