Close #223 | KoPI-CC_News Loader

acul3 commented 2 years ago

Close #223

ccepted config name format:

kopi_cc_news_{year}_{schema}

list availablle year = ["2016",2017","2018","2019","2020","2021","2022","all"]

if you use all, it will load all available year

example: say if you want to load news for year 2016 only and nusantara source ,the code will look like:

from datasets import load_dataset
dataset = load_dataset("/data/nusa-crowd/nusantara/nusa_datasets/kopi_cc_news/kopi_cc_news.py",name="kopi_cc_news 2016_nusantara_ssp")

test unit

python -m tests.test_nusantara nusantara/nusa_datasets/kopi_cc_news/kopi_cc_news.py --subset_id kopi_cc_news_2016

Checkbox

[ x] Confirm that this PR is linked to the dataset issue.
[ x] Create the dataloader script nusantara/nusa_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[ x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _NUSANTARA_VERSION variables.
[ x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[ x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one NusantaraConfig for the source schema and one for a nusantara schema.
[ x] Confirm dataloader script works with datasets.load_dataset function.
[ x] Confirm that your dataloader script passes the test suite run with python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

bryanwilie commented 2 years ago

Approving this. Thanks @acul3 for your contribution!

acul3 commented 2 years ago

hi @holylovenia yes. i'm just following the step from oscar tool paper...you can read it here at section 1 and 3

tldr this just to keep "line by line-oriented" text but also not destroy document boundary

holylovenia commented 2 years ago

hi @holylovenia yes. i'm just following the step from oscar tool paper...you can read it here at section 1 and 3

tldr this just to keep "line by line-oriented" text but also not destroy document boundary

Noted, @acul3! Thanks for explaining.

IndoNLP / nusa-crowd

Close #223 | KoPI-CC_News Loader #231

Checkbox