IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Create dataset loader for KoPI-CC (Korpus Perayapan Indonesia) #215

Closed SamuelCahyawijaya closed 2 years ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?kopi_cc

Dataset kopi_cc
Description KoPI-CC (Korpus Perayapan Indonesia)-CC is Indonesian Only Extract from Common Crawl snapshots ,each snapshots get extracted using ungoliant oscar tools and get extra "filtering" using deduplication technique (Exact Hash Dup and Minhash LSH)
License CC0
acul3 commented 2 years ago

self-assign