facebookresearch / MetaCLIP

ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering
Other
1.17k stars 49 forks source link

data files are missing during the metadata generation. #43

Closed gmfirefox1 closed 7 months ago

gmfirefox1 commented 7 months ago

I am trying to repro the metadata build process. In the file build_metadata.py, it needs some data files: data/wiki/enwiki-unigram.txt data/wiki/1gram.txt.gz data/wiki/2gram.txt.gz

Can you please share it or share the process how you generate it? Thanks!

howardhsu commented 7 months ago

thx for your interests. uni/bigram can be computed by your self on a wiki corpus or you can find some pre-computed version online. for example: https://github.com/IlyaSemenov/wikipedia-word-frequency/raw/master/results/enwiki-2022-08-29.txt or https://nlp.cs.nyu.edu/wikipedia-data/ngram/wp_1gram.txt.gz https://nlp.cs.nyu.edu/wikipedia-data/ngram/wp_2gram.txt.gz

Hope this helps.

gmfirefox1 commented 7 months ago

Thank you, Howard! Your assistance was greatly appreciated. It seems enwiki-2022-08-29.txt is no longer available, I'll try enwiki-2023-04-13.txt instead.