Scripts for dedup and filter Common Crawl?

EleutherAI / the-pile

MIT License

1.48k stars 129 forks source link

Scripts for dedup and filter Common Crawl? #96

Open shangw-nvidia opened 2 years ago

shangw-nvidia commented 2 years ago

Hi,

I notice that the download URL for the CommonCrawlDataset is http://eaidata.bmk.sh/data/pile_cc_filtered_deduped.jsonl.zst. In other words, this CC dataset is already deduplicated and filtered? However, it doesn't seem like https://github.com/leogao2/commoncrawl_downloader in the README included the scripts for deduplication and filtering. I'm wondering where I can find out exactly how deduplication and filtering for Pile CC is done?

Thanks!

shangw-nvidia commented 2 years ago

Additional question: it seems like the_pile/pile.py only downloads and interleave the data from various data sources. processing_scripts contains many processing scripts, however, how do we know which script is supposed to be run on which data source, and how those scripts are supposed to be run?