Closes #309 | Create dataset loader for Vietnamese Hate Speech Detection (UIT-ViHSD) #309Uit vihsd

Gyyz commented 5 months ago

Closes #309 | Add/Update Dataloader {UIT-ViHSD}

First line PR Message: Closes #{ISSUE_NUMBER}

where you replace the {ISSUE_NUMBER} with the one corresponding to your dataset.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Gyyz commented 5 months ago

Sorry, open a new PR due to previous commit problems from #453

ljvmiranda921 commented 5 months ago

Thank you for updating the commits! I appreciate it. I approved it now. Just waiting for @raileymontalan 's review 👍 then we can merge

Gyyz commented 5 months ago

Hi @Gyyz, it isn't clear which of the [0, 1, 2] labels here correspond to the [CLEAN, OFFENSIVE, HATE] labels specified in the paper. Could you please specify if accordingly? Thanks.

Hi, @raileymontalan, added a logging.INFO to print the details.

raileymontalan commented 5 months ago

Hi @Gyyz, it isn't clear which of the [0, 1, 2] labels here correspond to the [CLEAN, OFFENSIVE, HATE] labels specified in the paper. Could you please specify if accordingly? Thanks.

Hi, @raileymontalan, added a logging.INFO to print the details.

Hi @Gyyz, after consulting with @holylovenia, we think it would be better for the labels to be the proper class names [CLEAN, OFFENSIVE, HATE] for the SEACrowd schema. I believe the labels for the source schema can be left as is ([0, 1, 2]).

Gyyz commented 5 months ago

Hi @Gyyz, it isn't clear which of the [0, 1, 2] labels here correspond to the [CLEAN, OFFENSIVE, HATE] labels specified in the paper. Could you please specify if accordingly? Thanks.

Hi, @raileymontalan, added a logging.INFO to print the details.

Hi @Gyyz, after consulting with @holylovenia, we think it would be better for the labels to be the proper class names [CLEAN, OFFENSIVE, HATE] for the SEACrowd schema. I believe the labels for the source schema can be left as is ([0, 1, 2]).

Sure. Will update this shortly.

SEACrowd / seacrowd-datahub

Closes #309 | Create dataset loader for Vietnamese Hate Speech Detection (UIT-ViHSD) #309Uit vihsd #501

Checkbox