Open pengzhangzhi opened 8 months ago
A random thought. As a community, would be great to have a discord channel for discussions and updates.
You could also check https://laion.ai/blog/video2dataset/
I should be able to help on the high quality video data aspect (along with transcripts), although the source of captions is a more difficult problem.
Our preliminary goal is to achieve impressive results in a specific data domain to verify the effectiveness of our pipeline and then extend our plan for generalization. Please keep attention on our project and feel free to contact us for discussion and potential cooperation.
Is the domain of the data fixed? If so, do you have more public information available on that?
Regardless, I think having some sort of a data curation pipeline (like the one used in the Stable Video Diffusion paper) would be really nice.
Agree. After the preliminary validation, we will construct a data curation pipeline following successful projects and pay more attention to data.
Cool, sounds like a plan. FWIW, I put together a simple repository that walks through the primary steps of the Stable Video Diffusion curation pipeline: https://github.com/sayakpaul/single-video-curation-svd.
We will check it and thanks for your effort!
You could also check data-juicer, it seems to be beneficial for video data curation.
Hi team members, I would attribute the success of SORA to the training data like how OpenAI has done for GPT. Any ideas on curating high-quality video data?