Data for training MineCLIP

yifan123 commented 1 year ago

Hi, Thanks for these amazing results and for releasing the code! Do you plan to also release the 640K video-language pairs for training MineCLIP you mention in the paper ?

yunfanjiang commented 1 year ago

Hey, please follow the instruction to download the YouTube data. Let me know if you have further questions. Thanks!

yifan123 commented 1 year ago

MINEDOJO: knowledge base: 730K+(~300K hours)

MINECLIP: 640k video-language pairs, 640K*16s=2.8K hours.

I know how to download the YouTube data (300K hours), but I can't find the filterd 640K data used to train MineCLIP. Where can I download it? Thanks~

wangguanzhi commented 1 year ago

Hi @yifan123! Thanks for your interest. Here is how we extracted 640K pairs. Feel free to use other heuristics to extract training pairs from 730K+ videos.

Collect a list of keywords corresponding to the supported entities, blocks, and items in Minecraft;
Perform string matching over our YouTube video transcripts to obtain 640K text segments;
For each matched transcript segment, randomly grow it to 16 ∼ 77 tokens (limited by CLIP’s context length);
Randomly sample a timestamp within the start and end time of the matched transcript as the center for the video clip;
Randomly grow the video clip from the center timestamp to 8 ∼ 16 seconds.

yifan123 commented 1 year ago

Thanks for your reply. I have a few more questions.

Compared with the 400M data used by CLIP, MineCLIP uses a little less data of 640K to finetune, does it seriously limit the performance?
In comparative learning, the larger the batch, the better. What is the reason for mineCLIP to choose small batch(64*8)?
What is the reason for finetuning only two epochs?

Thanks~

wangguanzhi commented 1 year ago

We initialized MineCLIP weights from OpenAI CLIP’s public checkpoint and only finetuned the last two layers during training. Therefore, MineCLIP was not trained from scratch and we found that 640K video-text pairs already gave a satisfactory performance.
We followed VideoCLIP which also uses 512 batch size. Note that our visual inputs are videos instead of images. We sampled 16 RGB frames from each video uniformly.
We did not do too much hyperparameter tuning. So it's possible that finetuning for more epochs would give better performance.

yunfanjiang commented 1 year ago

Thanks @wangguanzhi for the detailed explanation. I'm closing this issue now. Feel free to reopen with new questions. Thanks @yifan123 again for your interest.

yifan123 commented 1 year ago

Thanks for your replay!

Could you share your script of cleaning data?

Hi @yifan123! Thanks for your interest. Here is how we extracted 640K pairs. Feel free to use other heuristics to extract training pairs from 730K+ videos.

Collect a list of keywords corresponding to the supported entities, blocks, and items in Minecraft;

Perform string matching over our YouTube video transcripts to obtain 640K text segments;

For each matched transcript segment, randomly grow it to 16 ∼ 77 tokens (limited by CLIP’s context length);

Randomly sample a timestamp within the start and end time of the matched transcript as the center for the video clip;

Randomly grow the video clip from the center timestamp to 8 ∼ 16 seconds.

wangguanzhi commented 1 year ago

DM you!

LightHouse2000 commented 1 year ago

DM you!

hey! Thanks for your amazing work!

would you mind sharing your scripts for cleanning data once more?

wangguanzhi commented 1 year ago

DM you!

hey! Thanks for your amazing work!

would you mind sharing your scripts for cleanning data once more?

Can you give me your email?

yifan123 commented 1 year ago

Hi @yifan123! Thanks for your interest. Here is how we extracted 640K pairs. Feel free to use other heuristics to extract training pairs from 730K+ videos.

Collect a list of keywords corresponding to the supported entities, blocks, and items in Minecraft;

Perform string matching over our YouTube video transcripts to obtain 640K text segments;

For each matched transcript segment, randomly grow it to 16 ∼ 77 tokens (limited by CLIP’s context length);

Randomly sample a timestamp within the start and end time of the matched transcript as the center for the video clip;

Randomly grow the video clip from the center timestamp to 8 ∼ 16 seconds.

Hey,

Depending on the keyword you provide, each video can matches about 30 keywords. Therefore, we can extract about 700k*30=20M pairs from 730K+ videos following the above filtering process. Intuitively, these pairs have low quality. For example, the voice often talks about something in the future, which does not match the current video, such as diamond. As mentioned in the paper, only 640K pair was extracted at last. Is there any other post-processing?

LightHouse2000 commented 1 year ago

Thanks a lot！ pkulighthouse@163.com

wangguanzhi commented 1 year ago

Hi @yifan123! Thanks for your interest. Here is how we extracted 640K pairs. Feel free to use other heuristics to extract training pairs from 730K+ videos.

Collect a list of keywords corresponding to the supported entities, blocks, and items in Minecraft;

Perform string matching over our YouTube video transcripts to obtain 640K text segments;

For each matched transcript segment, randomly grow it to 16 ∼ 77 tokens (limited by CLIP’s context length);

Randomly sample a timestamp within the start and end time of the matched transcript as the center for the video clip;

Randomly grow the video clip from the center timestamp to 8 ∼ 16 seconds.

Hey,

Depending on the keyword you provide, each video can matches about 30 keywords. Therefore, we can extract about 700k*30=20M pairs from 730K+ videos following the above filtering process. Intuitively, these pairs have low quality. For example, the voice often talks about something in the future, which does not match the current video, such as diamond. As mentioned in the paper, only 640K pair was extracted at last. Is there any other post-processing?

Just replied to your email!