IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 62 forks source link

Create dataset loader for KoPI-CC_News #223

Closed SamuelCahyawijaya closed 2 years ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?kopi_cc_news

Dataset kopi_cc_news
Description KoPI(Korpus Perayapan Indonesia)-CC_News is Indonesian Only Extract from CC NEWS Common Crawl from 2016-2022(july) ,each snapshots get extracted using warcio and filter using fasttext
License CC0
acul3 commented 2 years ago

self-assign