PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Apache License 2.0
10.89k stars 973 forks source link

curating high-quality video data #6

Open pengzhangzhi opened 4 months ago

pengzhangzhi commented 4 months ago

Hi team members, I would attribute the success of SORA to the training data like how OpenAI has done for GPT. Any ideas on curating high-quality video data?

pengzhangzhi commented 4 months ago

A random thought. As a community, would be great to have a discord channel for discussions and updates.

bhack commented 4 months ago

You could also check https://laion.ai/blog/video2dataset/

aluo-x commented 4 months ago

I should be able to help on the high quality video data aspect (along with transcripts), although the source of captions is a more difficult problem.

cxh0519 commented 4 months ago

Our preliminary goal is to achieve impressive results in a specific data domain to verify the effectiveness of our pipeline and then extend our plan for generalization. Please keep attention on our project and feel free to contact us for discussion and potential cooperation.

sayakpaul commented 4 months ago

Is the domain of the data fixed? If so, do you have more public information available on that?

Regardless, I think having some sort of a data curation pipeline (like the one used in the Stable Video Diffusion paper) would be really nice.

cxh0519 commented 4 months ago

Agree. After the preliminary validation, we will construct a data curation pipeline following successful projects and pay more attention to data.

sayakpaul commented 4 months ago

Cool, sounds like a plan. FWIW, I put together a simple repository that walks through the primary steps of the Stable Video Diffusion curation pipeline: https://github.com/sayakpaul/single-video-curation-svd.

cxh0519 commented 4 months ago

We will check it and thanks for your effort!

yuxuan-li92 commented 4 months ago

You could also check data-juicer, it seems to be beneficial for video data curation.

bhack commented 3 months ago

https://arxiv.org/abs/2403.06098