Regarding the training dataset, would you mind me asking how did you collect 35 million single-shot text-video pairs from the long public datasets? My initial understanding was that the label for long video may not be suitable for short video clips. Many Thanks.
Regarding the training dataset, would you mind me asking how did you collect 35 million single-shot text-video pairs from the long public datasets? My initial understanding was that the label for long video may not be suitable for short video clips. Many Thanks.