Open shangw-nvidia opened 2 years ago
Additional question: it seems like the_pile/pile.py
only downloads and interleave the data from various data sources. processing_scripts
contains many processing scripts, however, how do we know which script is supposed to be run on which data source, and how those scripts are supposed to be run?
Hi,
I notice that the download URL for the
CommonCrawlDataset
ishttp://eaidata.bmk.sh/data/pile_cc_filtered_deduped.jsonl.zst
. In other words, this CC dataset is already deduplicated and filtered? However, it doesn't seem like https://github.com/leogao2/commoncrawl_downloader in the README included the scripts for deduplication and filtering. I'm wondering where I can find out exactly how deduplication and filtering for Pile CC is done?Thanks!