More details about generated videos

Loping151 commented 5 months ago

I'm trying to use SVD to generate the reference video myself. I'm using the stability video diffusion API. I use a small motion bucket (from 5 to 15) and different seeds. I've spent a few bucks but I've not got a generated video with acceptable physics prior. Most results I got have static flower and moving camera, instead of "mostly object motion and little camera motion". I wonder what the trick is in generating the video, and how many videos are generated before a good one is selected.

https://github.com/a1600012888/PhysDreamer/assets/97866915/44bcd44b-873b-4d12-b528-eb0db3cfa652

a1600012888 commented 5 months ago

Hi, For generating videos with SVD, I recommend motion bucket between 5-10. You can use both xt and original one. I recommend you batch generating 10 samples each time. Also a closer up and front viewpoint typically helps.

I think another important aspect you might be missing is when prompting SVD is to use the same spatial resolution (576x1024) as model training. I tried different resolution and the performance drops quite quickly if I don't use that resolution. I also tried to investigate this, I think the reason is that SVD does not use spatial positional encoding for its attention module. Thus, the "effective positional information" comes from convolutional layers with paddings. And this "effective positional information" tells the model how far each patch is away from its boundary(2D borders of the videos). And this "effective positional information" depends strongly on resolutions. I will release the video training code later.

Loping151 commented 5 months ago

Thank you for your reply. Looking forward to the training code.

a1600012888 / PhysDreamer

More details about generated videos #12