Closes #341 | Create dataset loader for myParaphrase

djanibekov commented 7 months ago

Closes #341

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

djanibekov commented 7 months ago

File "..../seacrowd-datahub/tests/test_seacrowd.py", line 498, in test_schema
    self.assertEqual(type(features[key]), type(split.info.features[key]))
AssertionError: <class 'datasets.features.features.Value'> != <class 'datasets.features.features.ClassLabel'>

can't pass this test for column "is_paraphrase" in seacrowd pairs features schema

djanibekov commented 7 months ago

https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/constants.py#L303

https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/schemas/__init__.py#L4-L5

https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/schemas/pairs.py#L7-L26

@sabilmakbar found a mistake in constants.py, default "PAIRS_SCORE" schema function is pairs_features_score but it should be pairs_features. That's why during the test in default schema label has datasets.Value type while in my code I introduce it as datasets.ClassLabel

sabilmakbar commented 7 months ago

@sabilmakbar found a mistake in constants.py, default "PAIRS_SCORE" schema function is pairs_features_score but it should be pairs_features. That's why during the test in default schema label has datasets.Value type while in my code I introduce it as datasets.ClassLabel

This dataset is a paraphrasing task, and the is_paraphrasing values are binary ("0" or "1"). Hence this is not a semantic similarity (the latter ones should have float-based scoring, i.e., cosine similarity). Furthermore, in SEACrowd, we have a few other tasks that use the PAIRS_SCORE schema aside from Semantic Similarity; therefore, making the change per your proposal can affect the previously implemented data loaders.

My suggestion is you might want to look on the Paraphrasing Task as mentioned here: https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/constants.py#L96

which uses T2T Schema: https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/constants.py#L245

and has the following columns: https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/schemas/text_to_text.py#L12-L20

And we can use is_paraphrased columns by using this conditional

'text_1_name': "anchor_text"
'text_2_name': "paraphrased_text" if row["is_paraphrase"] else "non_paraphrased_text"

wdyt?

djanibekov commented 7 months ago

good idea, I will do new commit, @sabilmakbar

djanibekov commented 7 months ago

@sabilmakbar @MJonibek , done

holylovenia commented 7 months ago

@sabilmakbar found a mistake in constants.py, default "PAIRS_SCORE" schema function is pairs_features_score but it should be pairs_features. That's why during the test in default schema label has datasets.Value type while in my code I introduce it as datasets.ClassLabel

This dataset is a paraphrasing task, and the is_paraphrasing values are binary ("0" or "1"). Hence this is not a semantic similarity (the latter ones should have float-based scoring, i.e., cosine similarity). Furthermore, in SEACrowd, we have a few other tasks that use the PAIRS_SCORE schema aside from Semantic Similarity; therefore, making the change per your proposal can affect the previously implemented data loaders.

My suggestion is you might want to look on the Paraphrasing Task as mentioned here:

https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/constants.py#L96

which uses T2T Schema:

https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/constants.py#L245

and has the following columns:

https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/schemas/text_to_text.py#L12-L20

And we can use is_paraphrased columns by using this conditional
'text_1_name': "anchor_text"
'text_2_name': "paraphrased_text" if row["is_paraphrase"] else "non_paraphrased_text"
wdyt?

This is a good idea, but I doubt all future users will inspect the dataloader this deeply. May I suggest having 2 subset IDs?

{_DATALOADERNAME}_paraphrase --> make this as the default subset
{_DATALOADERNAME}_non_paraphrase

What do you think, @djanibekov @sabilmakbar @MJonibek?

PS: I've changed the datasheet and the issue ticket.

sabilmakbar commented 7 months ago

This is a good idea, but I doubt all future users will inspect the dataloader this deeply. May I suggest having 2 subset IDs?

{_DATALOADERNAME}_paraphrase --> make this as the default subset {_DATALOADERNAME}_non_paraphrase What do you think, @djanibekov @sabilmakbar @MJonibek?

Sounds great, but I think rather than having {_DATALOADER_NAME}_non_paraphrase, it'd be better changing it to the combined ones (paraphrased texts + non paraphrased text) using {_DATALOADER_NAME}_all and give an appropriate inline comment on the config creation + _generate_examples to give sufficient info on how we do the implementation.

and we have to add the subsets info accordingly in the issues :)

let me know if you have better ways to do this

djanibekov commented 7 months ago

@holylovenia @sabilmakbar If you doubt that future use will inspect dataloader I suggest changing pairs_features to pairs_features_score making label parameter of type Value rather than ClassLabel

MJonibek commented 7 months ago

I also think why not return to the beginning and just use pairs_features_score instead of pairs_features, and use labels 0 and 1 as continuous values instead of ClassLabel?

holylovenia commented 7 months ago

I also think why not return to the beginning and just use pairs_features_score instead of pairs_features, and use labels 0 and 1 as continuous values instead of ClassLabel?

This is a solid suggestion in my opinion. Are you on board with this idea, @djanibekov @sabilmakbar?

sabilmakbar commented 7 months ago

I also think why not return to the beginning and just use pairs_features_score instead of pairs_features, and use labels 0 and 1 as continuous values instead of ClassLabel?

This will alter the existing implementation that we have for this paraphrasing task (incl carry-over from NC). And their source usually contains positive paraphrase pairs. Choosing this suggestion will add one column which are not only be useless to the existing dataloaders, but potentially misleading since future users might presume the possible values are more than 1 (where it is not) -- and nulling it shouldn't be our option either.

I'm more inclined towards @holylovenia's initial suggestion to only add paraphrase datapoints into our dataloader; if we still want to add the non-paraphrased pairs too, we can add one more subset with _all naming.

edit: also, the paraphrase tasks uses T2T format, not pairs_score nor pairs_features_score

holylovenia commented 7 months ago

I also think why not return to the beginning and just use pairs_features_score instead of pairs_features, and use labels 0 and 1 as continuous values instead of ClassLabel?

This will alter the existing implementation that we have for this paraphrasing task (incl carry-over from NC). And their source usually contains positive paraphrase pairs. Choosing this suggestion will add one column which are not only be useless to the existing dataloaders, but potentially misleading since future users might presume the possible values are more than 1 (where it is not) -- and nulling it shouldn't be our option either.

I'm more inclined towards @holylovenia's initial suggestion to only add paraphrase datapoints into our dataloader; if we still want to add the non-paraphrased pairs too, we can add one more subset with _all naming.

edit: also, the paraphrase tasks uses T2T format, not pairs_score nor pairs_features_score

The dataset format is more like a paraphrase detection task rather than paraphrasing itself. I guess whichever is more useful is fine.

I'm leaning towards keeping the t2t schema because similar tasks are also grouped into Tasks.PARAPHRASING.

{_DATALOADERNAME}_paraphrase --> make this as the default subset
{_DATALOADERNAME}_non_paraphrase
{_DATALOADERNAME}_all

djanibekov commented 7 months ago

@holylovenia @sabilmakbar just a small clarification, how do we make subset as default subset? (P.S. I assumed that I need to change schema name for {_DATALOADERNAME}_paraphrase.

sabilmakbar commented 7 months ago

default subset is denoted by variable DEFAULT_CONFIG_NAME, @djanibekov

djanibekov commented 7 months ago

@sabilmakbar is it okay to leave naming without _paraphrase ending like on screenshot below

name=f"{_DATASETNAME}_seacrowd_{SEACROWD_SCHEMA_NAME}" not name=f"{_DATASETNAME}_seacrowd_{SEACROWD_SCHEMA_NAME}_paraphrase"

Otherwise, data loader does not pass test.py because of this line:

ValueError: BuilderConfig 'my_paraphrase_seacrowd_t2t' not found. Available: ['my_paraphrase_source', 'my_paraphrase_seacrowd_t2t_paraphrase', 'my_paraphrase_seacrowd_t2t_non_paraphrase', 'my_paraphrase_seacrowd_t2t_all']

sabilmakbar commented 7 months ago

@sabilmakbar is it okay to leave naming without _paraphrase ending like on screenshot below

name=f"{_DATASETNAME}seacrowd{SEACROWD_SCHEMA_NAME}" not name=f"{_DATASETNAME}seacrowd{SEACROWD_SCHEMA_NAME}_paraphrase"

it should be okay.

but, this naming my_paraphrase_seacrowd_t2t_non_paraphrase should be my_paraphrase_non_paraphrase_seacrowd_t2t (also applicable to the rest) and its source schema should be implemented too so all of the dataset correctness on all defined config can be checked via the test-cases.

ex:

#this will check the implementation of myparaphrase_non_paraphrase_source and myparaphrase_non_paraphrase_seacrowd_t2t
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py --subset_id myparaphrase_non_paraphrase

#this will check the implementation of myparaphrase_all_source and myparaphrase_all_seacrowd_t2t
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py --subset_id myparaphrase_all

djanibekov commented 7 months ago

@sabilmakbar, done wrote source schema for each [paraphrase, non_paraphrase, all]

these test went successfully

# paraphrase
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py 
#non_paraphrase
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py --subset_id myparaphrase_non_paraphrase
#all
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py --subset_id myparaphrase_all

djanibekov commented 7 months ago

@sabilmakbar @MJonibek, done

MJonibek commented 7 months ago

I will merge this PR then. Thank you everybody for these great work

SEACrowd / seacrowd-datahub

Closes #341 | Create dataset loader for myParaphrase #436

Checkbox