add thaigov - Githubissues

TysonYu commented 8 months ago

Closes #357.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

sabilmakbar commented 8 months ago

Hi @TysonYu, a suggestion to change the init PR message of Closes #{ISSUE_NUMBER} so that it will be linked to the dataloader issue for coming PRs (I've done it on this one, tho).

TysonYu commented 8 months ago

Hi @TysonYu, a suggestion to change the init PR message of Closes #{ISSUE_NUMBER} so that it will be linked to the dataloader issue for coming PRs (I've done it on this one, tho).

Okay, will do it for later ones.

TysonYu commented 8 months ago

rather than having to write on _split_generators and re-read again in _generate_examples, why we don't pass the all_data list in _split_generators gen_kwargs and use it directly on generate_examples? I think passing such is possible (see this SEACrowd Implementation)

Hey, I do by this way because it seems to be logically correct and clear. I agree your mentioned approach is another implementation and still my current approach should be fine. I think some other dataloaders also did in this way, such as indosum.

SEACrowd / seacrowd-datahub

add thaigov #412

Checkbox