PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Apache License 2.0
10.87k stars 971 forks source link

Options for frames, width and height? #295

Open SoftologyPro opened 1 month ago

SoftologyPro commented 1 month ago

The gradio_web_server.py locks the number of frames to 65 and a resolution of 512x512, ie

if __name__ == '__main__':
    args = type('args', (), {
        'ae': 'CausalVAEModel_4x8x8',
        'force_images': False,
        'model_path': 'LanguageBind/Open-Sora-Plan-v1.1.0',
        'text_encoder_name': 'DeepFloyd/t5-v1_1-xxl',
        'version': '65x512x512'
    })

The readme shows 221 frame examples. To make longer videos do I just change the 65x512x512 to 221x512x512 or are there other changes needed? Can I try other resolutions by changing the 512x512? I tried 'version': '65x640x384' and it gave an error when starting Can the gradio_web_server.py be modified to include options for frames, width and height? They can default to the current 65, 512, 512.

Why does 221x512x512 work and 222x512x512 fail?

LinB203 commented 1 month ago

Sorry for that. just change the 65x512x512 to 221x512x512 updated here

SoftologyPro commented 1 month ago

OK, thanks. But what are valid values for frames width and height?

221x512x512 works. 222x512x512 script fails. 65x512x512 works. 65x640x384 fails.

LinB203 commented 1 month ago

Here, I can provide a detailed explanation for those who may need it in the future.

In version 1.1.0, we initially aimed to support arbitrary resolutions, thus we enabled multi-scale training (--multi_scale in stage 1 training script) in stage 1. However, due to the limited dataset, especially with almost a single aspect ratio, the inference capability for arbitrary resolutions in stage 1 was particularly poor. In stage 2, we disabled multi-scale training, hence 221x512x512 does not support specifying height or width.

To summarize, in version 1.1.0, stage 1 employed 65-frame training, and stage 2 employed 221-frame training, both they do not support customing frame number to inference. The weights from stage 1 support specifying height or width.

In the next version, we will support mainstream resolution inference and arbitrary duration inference.

SoftologyPro commented 1 month ago

OK, thank you. Looking forward to the next update. For now, I have added support for Open-Sora Plan into Visions of Chaos.

Also, a minor issue, can you add more stats when generating? There are the initial 50/50 stats, ie

88%| 44/50 [00:44<00:05,  1.19it/s]
90%| 45/50 [00:45<00:04,  1.19it/s]
92%| 46/50 [00:46<00:03,  1.19it/s]
94%| 47/50 [00:47<00:02,  1.19it/s]
96%| 48/50 [00:47<00:01,  1.19it/s]
98%| 49/50 [00:48<00:00,  1.19it/s]
100%| 50/50 [00:49<00:00,  1.19it/s]
100%| 50/50 [00:49<00:00,  1.01it/s]

but then there are no further stats for minutes until the movie appears in the UI.