Hi, thanks for your great work! I notice that some raw videos in your huggingface dataset are longer than the timestamps your record in json file. For example, in ego4d the video may be last 60 seconds, but only 12 seconds caption are recorded based on the timestamps.
Do we need to clip the videos based on the timestamps when tranining the model?
Hi, thanks for your great work! I notice that some raw videos in your huggingface dataset are longer than the timestamps your record in json file. For example, in ego4d the video may be last 60 seconds, but only 12 seconds caption are recorded based on the timestamps. Do we need to clip the videos based on the timestamps when tranining the model?
Sent from PPHub