BAAI-DCAI / SpatialBot

The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.
MIT License
157 stars 10 forks source link

Unstable output #8

Closed Yuxin916 closed 2 months ago

Yuxin916 commented 2 months ago

Hi! Hope you are doing well.

I have encountered a strange issue when using VLM to inference. Sometimes, sometimes it generate text sequences such as: 'Paragraph: The following is a list of the top 10 most popular songs of all time, according to Billboard magazine. The list is based on the number of weeks a song has spent on the Billboard Hot 100 chart.' like garbage talk.

Any insight of that?

Thank you!

RussRobin commented 2 months ago

Hi @Yuxin916

What is the model and prompt you use?

Regards

Yuxin916 commented 2 months ago

Hi! The input RGBD tensor looks like below: rgb depth

And the prompts i tried are

  1. "Find the {GOAL_NAME}. If the object is not visible or the images provided are not clear, list objects that goal object is typically found near in home scenes."
  2. "Ignore wall, floor, ceiling, and window. List all objects detected and describe the spatial relationship among the objects? "

The model i am using is SpatialBot-3B i downloaded from huggingface.

The RGBD and prompts are frequently updated for every time step run. Above two prompts will output inrelevant answers from time to time. (Every 10 times inference may appear for 3 - 4 times).

Do you have any suggestions in prompting? Such as the QA questions you used in training?

Best Regards

RussRobin commented 2 months ago

Can you share how you use model.generate? e.g. The temperature…

Yuxin916 commented 2 months ago

Hi! This is how i used.

I modify to have batch inference and it output the same as for loop single inference. The problem persist in for loop generate as well.

image

I tried for temperature=0.8, 0.9 but did not solve the problem. Do you have any suggestions in some other parameters?

Best regards

RussRobin commented 2 months ago

There’s no specific format of spatial reasoning in training. The issue seems to be under-training, as the model outputs something like the pertaining data. But this is not always seen for 3B models. I’m trying to reproduce the error now.

BTW is the depthmap uint16?

RussRobin commented 2 months ago

Also, please try temperature=0.

RussRobin commented 2 months ago
  1. I downloaded the depth image and found that it is 4-channel (RGBA) uint8. The depth input should be single channel uint16 if you follow our instructions to convert RGB to Depth. If it is depth from sensors, or ground truth depth, please modify it to the desired data format: in millimeter unit, uint16, single channel.

Have you modified coded for model input pre-processing in data_utils by yourself?

  1. SpatialBot-3B response I tried (not batch inference):
    USER: Find the door.  If the object is not visible or the images provided are not clear, list objects that goal object is typically found near in home scenes.
    ASSISTANT:  The door is white and located on the right side of the image.
    USER: Ignore wall, floor, ceiling, and window. List all objects detected and describe the spatial relationship among the objects?
    ASSISTANT:  The objects detected are a television, a dresser, a door, a wall, and a cabinet. The television is on the left side of the image, the dresser is in the foreground, and the door is in the background. The cabinet is on the right side of the image, and the wall is behind the dresser and the door.
Yuxin916 commented 2 months ago

Hi! Thank you so much for your help! It is very helpful.

I followed your reply to double checked the depth image preprocessing. Unfortunately the issue still persist. I manully do some prompt engineering then and inspect the outcome for every timestep. Looks like it is very sensitive to prompts input. Now the problem is solved for a specific prompt i design. But i totally understand that, this problem is definitely due to the task i am working on shifted away from how your original model is trained, it is not your model's issue. I will see how i could finetune your model, better fit in my task in the future.

Again, great thank you so much for your help!

Best regards

RussRobin commented 2 months ago

Cool! I’m happy to help if you are willing to disclose more failure cases. If the model always generates irrelevant words, please let me know.

for fine tuning on a specific task: prepare some QAs in this sprecific task, try to make the question and answer formats diverse (maybe prompt GPT to get some templates. You can find CKPT of spatialbot 3B in our readme.

hope it helps.