flow-diffusion / AVDC

Official repository of Learning to Act from Actionless Videos through Dense Correspondences.
https://flow-diffusion.github.io/
MIT License
167 stars 18 forks source link

Issue reproducing demo results #4

Closed FelixHegg closed 11 months ago

FelixHegg commented 11 months ago

Thanks for the work, the idea seems great. But we have trouble to reproduce your results, even for the demo examples you provided:

I downloaded the model with ./download.sh metaworld. Then I ran it with python train_mw.py --mode inference -c 24 -p ../examples/assembly.png -t assembly. The results look bad. The robot base moves, the arm moves very fast in one frame and the grasped object changes size. assembly_out

Another post mentioned a problem with reproducibility as well and you told him to try the banana example. So I downloaded the bridge model with ./download.sh bridge and ran python train_bridge.py --mode inference -c 42 -p ../examples/banana.jpg -t "pick up banana" . Again the results look very disappointing, as the manipulator doesn't even touch a banana. banana_out

I really like the new approach, but these are the demo examples and even they don't seem to work at all. What could be the problem? I hope the mistake is on my side, but there were not many steps. So I don't see where I could have made a mistake.

Please let me know if you need any further information to debug this problem.

kbkbowo commented 11 months ago

TLDR We have found some issues on our side and fixed them (as well as some other minor ones) in this commit for better reproducibility. Thanks for raising the issue!


Hello! Thanks for reporting the issue.

Since multiple issues about the inference quality have been raised recently, I looked into the code and did a thorough test to see if there's any problem on our side. (The code in this repo was cleaned up from the code used in our experiment, so there might be some minor issues that we have not noticed.)

We found that we have not used the EMA model for sampling, which we should. This issue might have led to unstable inference results like the points that you and another post are talking about. (e.g. the reason why the banana task did not seem to work. ) Also, the Meta-World video model requires a center-cropped image as input (Specifically, a center-cropped 128x128 image from a 320x240 image). We fixed these issues as well as some others with a commit for better reproducibility.

The demo examples should work as expected right now. Note that the diffusion model can still sometimes generate bad samples due to its probabilistic nature. We recommend running 5~10 times on each task to get a better understanding of the quality of synthesized videos.

Sorry for the inconvenience and we genuinely thank you for reporting the issue. Feel free to let me know if any other issues still remain.

FelixHegg commented 11 months ago

Thank you for the quick answer. The quality is now significantly better. The assembly now looks like this: assembly_out

And the banana looks like this: banana_out