facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
932 stars 138 forks source link

Inquiries about korean datasets utilized in the CCNet pipeline #39

Open hyunmokky opened 1 year ago

hyunmokky commented 1 year ago

While studying data pipelines, I found CCNet. CCNet is very intriguing to me. I'm going to use CCNet to create a better data pipeline for Korean datasets. I have a question. In the paper, it is stated that CCNet conducted a study with the "Feb. 2019 snapshot of Common Crawl" dataset. I wonder how many Korean datasets are in that dataset. In the paper, the size of the datasets in table 6 is written as the size after data preprocessing. I wonder if the data preprocessing is only deduplication. Also, I'm curious about the size of the Korean dataset before the data preprocessing. If you share the size of the Korean dataset, it will be of great help to me who is conducting research using CCNet.

gwenzek commented 1 year ago

Hi, the processing pipeline is mostly: CommonCrawl -> Deduplication at paragraph level -> Language detection -> optionally LM based filtering. In the case of Korean we did train a LM on Korean Wikipedia and kept only the top 30% of text according to this LM. The final number of "clean" Korean is reported in the paper.