Closed XuweiyiChen closed 11 months ago
真嘟假嘟,效果这么相差这么大嘛,试试其他任务
TLDR:
Hello, thanks for raising the point!
Zero-shot transfer to real-world scenarios remains a challenge (though not impossible) for the model trained on Bridge data, primarily due to the limited object diversity in the Bridge dataset, which mainly comprises toy kitchen objects.
In our experience, the model has effectively understood the "verbs" in the task. For instance, with a prompt like "put A on B", the model consistently synthesizes a gripper picking something up and moving it somewhere else; with "open C", it tries to grasp the handle of C and synthesizes feasible gripper movements in subsequent frames. However, the model often grabs the wrong object or moves the object to the wrong place if the object's appearance differs or is not commonly found in the training data. This is also why we had to fine-tune the model for our real-world Franka Emika Panda Arm experiment: we need to teach the model about the appearance of the new objects in the new world.
In both of the video synthesis results you have shown, looks like the model was trying to synthesize a gripper moving something towards the top of the pot, but failed to grab the correct "lid". While this task is challenging for the video model, you should be able to reproduce our result with more tries. You might also want to try on the pick-up-banana task, as mentioned in TLDR, which is much easier.
Let me know if anything is still unclear!
I've downloaded the checkpoint for the bridge dataset and used your provided script for inference. However, I'm struggling to reproduce the results as shown on your website. Here's what I've got so far:
I obtained the image and tried with two prompt:
I would appreciate any guidance to achieve results similar to those on your website. Any suggestions or insights?