Hi! I would like to ask few questions regarding the visual encoder part.
How does the SpatialBot model load the SigLip pre-trained model? I have downloaded the siglip-so400m-patch14-384 model from hf as well and modify the config.json file mm_vision_tower into path of the model folder. However, the output said Some weights of the model checkpoint at /xxx/spatial_bot/SpatialBot-3B were not used when initializing BunnyPhiForCausalLM: ['model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.bias' .... all vision_model layers. Any insights of that?
I was trying to get the output from vison_tower to inspect more output. because i sincerely doubt my loading of siglip is incorrect. How i do is `model.model.vision_tower(image_tensor). Is that correct?
I noticed in the quickstar how to try to preprocess the one channel depth. three_channel_array[:, :, 0] = (img // 1024) * 4 three_channel_array[:, :, 1] = (img // 32) * 8 three_channel_array[:, :, 2] = (img % 32) * 8 what are the purpose of this? should i do also for my own one-channel depth? cuz the output visualization looks quite weird as a depth image.
Hi! I would like to ask few questions regarding the visual encoder part.
How does the SpatialBot model load the SigLip pre-trained model? I have downloaded the
siglip-so400m-patch14-384
model from hf as well and modify the config.json filemm_vision_tower
into path of the model folder. However, the output saidSome weights of the model checkpoint at /xxx/spatial_bot/SpatialBot-3B were not used when initializing BunnyPhiForCausalLM: ['model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.bias' .... all vision_model layers
. Any insights of that?I was trying to get the output from vison_tower to inspect more output. because i sincerely doubt my loading of siglip is incorrect. How i do is `model.model.vision_tower(image_tensor). Is that correct?
I noticed in the quickstar how to try to preprocess the one channel depth.
three_channel_array[:, :, 0] = (img // 1024) * 4 three_channel_array[:, :, 1] = (img // 32) * 8 three_channel_array[:, :, 2] = (img % 32) * 8
what are the purpose of this? should i do also for my own one-channel depth? cuz the output visualization looks quite weird as a depth image.Looking forward to your reply.
Best regards