Inquiries about korean datasets utilized in the CCNet pipeline

facebookresearch / cc_net

Tools to download and cleanup Common Crawl data

MIT License

932 stars 138 forks source link

While studying data pipelines, I found CCNet. CCNet is very intriguing to me. I'm going to use CCNet to create a better data pipeline for Korean datasets. I have a question. In the paper, it is stated that CCNet conducted a study with the "Feb. 2019 snapshot of Common Crawl" dataset. I wonder how many Korean datasets are in that dataset. In the paper, the size of the datasets in table 6 is written as the size after data preprocessing. I wonder if the data preprocessing is only deduplication. Also, I'm curious about the size of the Korean dataset before the data preprocessing. If you share the size of the Korean dataset, it will be of great help to me who is conducting research using CCNet.

facebookresearch / cc_net

Inquiries about korean datasets utilized in the CCNet pipeline #39