flow-diffusion / AVDC

Official repository of Learning to Act from Actionless Videos through Dense Correspondences.
https://flow-diffusion.github.io/
MIT License
171 stars 19 forks source link

Issue Reproducing Results with Bridge Dataset Checkpoint #3

Closed XuweiyiChen closed 11 months ago

XuweiyiChen commented 12 months ago

I've downloaded the checkpoint for the bridge dataset and used your provided script for inference. However, I'm struggling to reproduce the results as shown on your website. Here's what I've got so far:

I obtained the image and tried with two prompt:

  1. Used the prompt "Put lid on pot" and obtained this: IMG_2121_out51
  2. For the prompt "Task: Put lid on pot", I got this result: IMG_2121_out50

I would appreciate any guidance to achieve results similar to those on your website. Any suggestions or insights?

Bailey-24 commented 12 months ago

真嘟假嘟,效果这么相差这么大嘛,试试其他任务

kbkbowo commented 11 months ago

TLDR:

  1. You can use the prompt "Put lid on pot" or "put lid on pot" (upper/lower case does not matter)
  2. Try for more times (probably dozens, since this task is challenging)
  3. Also try for this task "pick up banana", which is much easier. The first frame image and some synthesis results are also shown below. IMG_2115 Synthesis results: IMG_21245 IMG_21244 IMG_21243 IMG_21242 IMG_21241

Hello, thanks for raising the point!

Zero-shot transfer to real-world scenarios remains a challenge (though not impossible) for the model trained on Bridge data, primarily due to the limited object diversity in the Bridge dataset, which mainly comprises toy kitchen objects.

In our experience, the model has effectively understood the "verbs" in the task. For instance, with a prompt like "put A on B", the model consistently synthesizes a gripper picking something up and moving it somewhere else; with "open C", it tries to grasp the handle of C and synthesizes feasible gripper movements in subsequent frames. However, the model often grabs the wrong object or moves the object to the wrong place if the object's appearance differs or is not commonly found in the training data. This is also why we had to fine-tune the model for our real-world Franka Emika Panda Arm experiment: we need to teach the model about the appearance of the new objects in the new world.

In both of the video synthesis results you have shown, looks like the model was trying to synthesize a gripper moving something towards the top of the pot, but failed to grab the correct "lid". While this task is challenging for the video model, you should be able to reproduce our result with more tries. You might also want to try on the pick-up-banana task, as mentioned in TLDR, which is much easier.

Let me know if anything is still unclear!

kbkbowo commented 11 months ago

Hello, we have released an update on this repo here and this might also help reproduce the results.

I'm closing the issue and feel free to re-open it if you have any other issues.