facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
932 stars 138 forks source link

Inquiries about utilizing 2022 collected common rawl snapshots #40

Open hyunmokky opened 1 year ago

hyunmokky commented 1 year ago

In the paper, it is stated that CCNet conducted the study with the "common crawl snapshot in February 2019" dataset. I want to use the Common Crawl data snapshots collected after 2022. Is it also possible to classify Common Crawl data collected after 2022 by language using the CCNet github code?