IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Close #223 | KoPI-CC_News Loader #231

Closed acul3 closed 2 years ago

acul3 commented 2 years ago

Close #223


ccepted config name format:

kopi_cc_news_{year}_{schema}

list availablle year = ["2016",2017","2018","2019","2020","2021","2022","all"]

if you use all, it will load all available year

example: say if you want to load news for year 2016 only and nusantara source ,the code will look like:

from datasets import load_dataset
dataset = load_dataset("/data/nusa-crowd/nusantara/nusa_datasets/kopi_cc_news/kopi_cc_news.py",name="kopi_cc_news 2016_nusantara_ssp")

test unit

python -m tests.test_nusantara nusantara/nusa_datasets/kopi_cc_news/kopi_cc_news.py --subset_id kopi_cc_news_2016

Checkbox

bryanwilie commented 2 years ago

Approving this. Thanks @acul3 for your contribution!

acul3 commented 2 years ago

hi @holylovenia yes. i'm just following the step from oscar tool paper...you can read it here at section 1 and 3

tldr this just to keep "line by line-oriented" text but also not destroy document boundary

holylovenia commented 2 years ago

hi @holylovenia yes. i'm just following the step from oscar tool paper...you can read it here at section 1 and 3

tldr this just to keep "line by line-oriented" text but also not destroy document boundary

Noted, @acul3! Thanks for explaining.