Closed Yuxin916 closed 2 months ago
Hi @Yuxin916
What is the model and prompt you use?
Regards
Hi! The input RGBD tensor looks like below:
And the prompts i tried are
The model i am using is SpatialBot-3B i downloaded from huggingface.
The RGBD and prompts are frequently updated for every time step run. Above two prompts will output inrelevant answers from time to time. (Every 10 times inference may appear for 3 - 4 times).
Do you have any suggestions in prompting? Such as the QA questions you used in training?
Best Regards
Can you share how you use model.generate? e.g. The temperature…
Hi! This is how i used.
I modify to have batch inference and it output the same as for loop single inference. The problem persist in for loop generate as well.
I tried for temperature=0.8, 0.9 but did not solve the problem. Do you have any suggestions in some other parameters?
Best regards
There’s no specific format of spatial reasoning in training. The issue seems to be under-training, as the model outputs something like the pertaining data. But this is not always seen for 3B models. I’m trying to reproduce the error now.
BTW is the depthmap uint16?
Also, please try temperature=0.
Have you modified coded for model input pre-processing in data_utils by yourself?
USER: Find the door. If the object is not visible or the images provided are not clear, list objects that goal object is typically found near in home scenes.
ASSISTANT: The door is white and located on the right side of the image.
USER: Ignore wall, floor, ceiling, and window. List all objects detected and describe the spatial relationship among the objects?
ASSISTANT: The objects detected are a television, a dresser, a door, a wall, and a cabinet. The television is on the left side of the image, the dresser is in the foreground, and the door is in the background. The cabinet is on the right side of the image, and the wall is behind the dresser and the door.
Hi! Thank you so much for your help! It is very helpful.
I followed your reply to double checked the depth image preprocessing. Unfortunately the issue still persist. I manully do some prompt engineering then and inspect the outcome for every timestep. Looks like it is very sensitive to prompts input. Now the problem is solved for a specific prompt i design. But i totally understand that, this problem is definitely due to the task i am working on shifted away from how your original model is trained, it is not your model's issue. I will see how i could finetune your model, better fit in my task in the future.
Again, great thank you so much for your help!
Best regards
Cool! I’m happy to help if you are willing to disclose more failure cases. If the model always generates irrelevant words, please let me know.
for fine tuning on a specific task: prepare some QAs in this sprecific task, try to make the question and answer formats diverse (maybe prompt GPT to get some templates. You can find CKPT of spatialbot 3B in our readme.
hope it helps.
Hi! Hope you are doing well.
I have encountered a strange issue when using VLM to inference. Sometimes, sometimes it generate text sequences such as:
'Paragraph: The following is a list of the top 10 most popular songs of all time, according to Billboard magazine. The list is based on the number of weeks a song has spent on the Billboard Hot 100 chart.'
like garbage talk.Any insight of that?
Thank you!