Closes #617 | Add Dataloader SQuAD-ID-NLI

muhammadravi251001 commented 3 months ago

Title: Add Dataloader SQuAD-ID-NLI

First line PR Message: Closes https://github.com/SEACrowd/seacrowd-datahub/issues/617

Notes

For the _CITATION field, because of the notification of my workshop on 18 April, I still can't write that section. On 18 April, I will revisit and change this _CITATION.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

muhammadravi251001 commented 2 months ago

@muhammadravi251001 Checked, LGTM. Thank you for great work, just small issue like in previous one: need to delete comment

Thanks for the review, Sir!

muhammadravi251001 commented 2 months ago

Hi @muhammadravi251001, thanks for your hard work! The dataloader works well on my end. Just confirming, out of curiosity, I checked the label distribution per split and found that none of the data instances is labeled as "neutral" (i.e., 1).
train {0: 118445, 1: 0, 2: 118445}
validation {0: 11874, 1: 0, 2: 11874}
test {0: 11873, 1: 0, 2: 11873}
Is this intentional? If it is, I'll proceed with the merge.

Hi, Ms. Holy.

Yes, it was intentional because my model tries to do binary classification (entailment or contradiction), to get rid of the "gray characteristic" of neutral, it is also to tell the QA model (for my research) to avoid low-confidence answer because of the "gray characteristic" of neutral.

Even though, neutral is still needed in my NLI dataset, like this dataset.

holylovenia commented 2 months ago

Hi @muhammadravi251001, thanks for your hard work! The dataloader works well on my end. Just confirming, out of curiosity, I checked the label distribution per split and found that none of the data instances is labeled as "neutral" (i.e., 1).
train {0: 118445, 1: 0, 2: 118445}
validation {0: 11874, 1: 0, 2: 11874}
test {0: 11873, 1: 0, 2: 11873}
Is this intentional? If it is, I'll proceed with the merge.
Hi, Ms. Holy.

Yes, it was intentional because my model tries to do binary classification (entailment or contradiction), to get rid of the "gray characteristic" of neutral, it is also to tell the QA model (for my research) to avoid low-confidence answer because of the "gray characteristic" of neutral.

Even though, neutral is still needed in my NLI dataset, like this dataset.

Thanks for the clarification, @muhammadravi251001! Merging now.

PS: No need to call me "Ms.", no worries. 😂

SEACrowd / seacrowd-datahub

Closes #617 | Add Dataloader SQuAD-ID-NLI #633

Notes

Checkbox