Closed djanibekov closed 7 months ago
File "..../seacrowd-datahub/tests/test_seacrowd.py", line 498, in test_schema
self.assertEqual(type(features[key]), type(split.info.features[key]))
AssertionError: <class 'datasets.features.features.Value'> != <class 'datasets.features.features.ClassLabel'>
can't pass this test for column "is_paraphrase" in seacrowd pairs features schema
@sabilmakbar found a mistake in constants.py, default "PAIRS_SCORE" schema function is pairs_features_score
but it should be pairs_features
. That's why during the test in default schema label has datasets.Value type while in my code I introduce it as datasets.ClassLabel
@sabilmakbar found a mistake in constants.py, default "PAIRS_SCORE" schema function is pairs_features_score but it should be pairs_features. That's why during the test in default schema label has datasets.Value type while in my code I introduce it as datasets.ClassLabel
This dataset is a paraphrasing task, and the is_paraphrasing
values are binary ("0" or "1"). Hence this is not a semantic similarity (the latter ones should have float-based scoring, i.e., cosine similarity). Furthermore, in SEACrowd, we have a few other tasks that use the PAIRS_SCORE
schema aside from Semantic Similarity; therefore, making the change per your proposal can affect the previously implemented data loaders.
My suggestion is you might want to look on the Paraphrasing Task as mentioned here: https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/constants.py#L96
which uses T2T Schema: https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/constants.py#L245
and has the following columns: https://github.com/SEACrowd/seacrowd-datahub/blob/1467d4d865114083dad7710f25acd047cd450e90/seacrowd/utils/schemas/text_to_text.py#L12-L20
And we can use is_paraphrased
columns by using this conditional
'text_1_name': "anchor_text"
'text_2_name': "paraphrased_text" if row["is_paraphrase"] else "non_paraphrased_text"
wdyt?
good idea, I will do new commit, @sabilmakbar
@sabilmakbar @MJonibek , done
@sabilmakbar found a mistake in constants.py, default "PAIRS_SCORE" schema function is pairs_features_score but it should be pairs_features. That's why during the test in default schema label has datasets.Value type while in my code I introduce it as datasets.ClassLabel
This dataset is a paraphrasing task, and the
is_paraphrasing
values are binary ("0" or "1"). Hence this is not a semantic similarity (the latter ones should have float-based scoring, i.e., cosine similarity). Furthermore, in SEACrowd, we have a few other tasks that use thePAIRS_SCORE
schema aside from Semantic Similarity; therefore, making the change per your proposal can affect the previously implemented data loaders.My suggestion is you might want to look on the Paraphrasing Task as mentioned here:
which uses T2T Schema:
and has the following columns:
And we can use
is_paraphrased
columns by using this conditional'text_1_name': "anchor_text" 'text_2_name': "paraphrased_text" if row["is_paraphrase"] else "non_paraphrased_text"
wdyt?
This is a good idea, but I doubt all future users will inspect the dataloader this deeply. May I suggest having 2 subset IDs?
{_DATALOADERNAME}_paraphrase
--> make this as the default subset{_DATALOADERNAME}_non_paraphrase
What do you think, @djanibekov @sabilmakbar @MJonibek?
PS: I've changed the datasheet and the issue ticket.
This is a good idea, but I doubt all future users will inspect the dataloader this deeply. May I suggest having 2 subset IDs?
{_DATALOADERNAME}_paraphrase --> make this as the default subset {_DATALOADERNAME}_non_paraphrase What do you think, @djanibekov @sabilmakbar @MJonibek?
Sounds great, but I think rather than having {_DATALOADER_NAME}_non_paraphrase
, it'd be better changing it to the combined ones (paraphrased texts + non paraphrased text) using {_DATALOADER_NAME}_all
and give an appropriate inline comment on the config creation + _generate_examples
to give sufficient info on how we do the implementation.
and we have to add the subsets info accordingly in the issues :)
let me know if you have better ways to do this
@holylovenia @sabilmakbar
If you doubt that future use will inspect dataloader I suggest changing pairs_features
to pairs_features_score
making label parameter of type Value rather than ClassLabel
I also think why not return to the beginning and just use pairs_features_score
instead of pairs_features
, and use labels 0
and 1
as continuous values instead of ClassLabel?
I also think why not return to the beginning and just use
pairs_features_score
instead ofpairs_features
, and use labels0
and1
as continuous values instead of ClassLabel?
This is a solid suggestion in my opinion. Are you on board with this idea, @djanibekov @sabilmakbar?
I also think why not return to the beginning and just use pairs_features_score instead of pairs_features, and use labels 0 and 1 as continuous values instead of ClassLabel?
This will alter the existing implementation that we have for this paraphrasing task (incl carry-over from NC). And their source usually contains positive paraphrase pairs. Choosing this suggestion will add one column which are not only be useless to the existing dataloaders, but potentially misleading since future users might presume the possible values are more than 1 (where it is not) -- and nulling it shouldn't be our option either.
I'm more inclined towards @holylovenia's initial suggestion to only add paraphrase datapoints into our dataloader; if we still want to add the non-paraphrased pairs too, we can add one more subset with _all
naming.
edit:
also, the paraphrase tasks uses T2T format, not pairs_score
nor pairs_features_score
I also think why not return to the beginning and just use pairs_features_score instead of pairs_features, and use labels 0 and 1 as continuous values instead of ClassLabel?
This will alter the existing implementation that we have for this paraphrasing task (incl carry-over from NC). And their source usually contains positive paraphrase pairs. Choosing this suggestion will add one column which are not only be useless to the existing dataloaders, but potentially misleading since future users might presume the possible values are more than 1 (where it is not) -- and nulling it shouldn't be our option either.
I'm more inclined towards @holylovenia's initial suggestion to only add paraphrase datapoints into our dataloader; if we still want to add the non-paraphrased pairs too, we can add one more subset with
_all
naming.edit: also, the paraphrase tasks uses T2T format, not
pairs_score
norpairs_features_score
The dataset format is more like a paraphrase detection task rather than paraphrasing itself. I guess whichever is more useful is fine.
I'm leaning towards keeping the t2t
schema because similar tasks are also grouped into Tasks.PARAPHRASING
.
{_DATALOADERNAME}_paraphrase
--> make this as the default subset{_DATALOADERNAME}_non_paraphrase
{_DATALOADERNAME}_all
@holylovenia @sabilmakbar just a small clarification, how do we make subset as default subset? (P.S. I assumed that I need to change schema name for {_DATALOADERNAME}_paraphrase
.
default subset is denoted by variable DEFAULT_CONFIG_NAME
, @djanibekov
@sabilmakbar is it okay to leave naming without _paraphrase
ending like on screenshot below
name=f"{_DATASETNAME}_seacrowd_{SEACROWD_SCHEMA_NAME}"
not name=f"{_DATASETNAME}_seacrowd_{SEACROWD_SCHEMA_NAME}_paraphrase"
Otherwise, data loader does not pass test.py
because of this line:
ValueError: BuilderConfig 'my_paraphrase_seacrowd_t2t' not found. Available: ['my_paraphrase_source', 'my_paraphrase_seacrowd_t2t_paraphrase', 'my_paraphrase_seacrowd_t2t_non_paraphrase', 'my_paraphrase_seacrowd_t2t_all']
@sabilmakbar is it okay to leave naming without _paraphrase ending like on screenshot below
name=f"{_DATASETNAME}seacrowd{SEACROWD_SCHEMA_NAME}" not name=f"{_DATASETNAME}seacrowd{SEACROWD_SCHEMA_NAME}_paraphrase"
it should be okay.
but, this naming my_paraphrase_seacrowd_t2t_non_paraphrase
should be my_paraphrase_non_paraphrase_seacrowd_t2t
(also applicable to the rest) and its source
schema should be implemented too so all of the dataset correctness on all defined config can be checked via the test-cases.
ex:
#this will check the implementation of myparaphrase_non_paraphrase_source and myparaphrase_non_paraphrase_seacrowd_t2t
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py --subset_id myparaphrase_non_paraphrase
#this will check the implementation of myparaphrase_all_source and myparaphrase_all_seacrowd_t2t
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py --subset_id myparaphrase_all
@sabilmakbar, done wrote source schema for each [paraphrase, non_paraphrase, all]
these test went successfully
# paraphrase
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py
#non_paraphrase
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py --subset_id myparaphrase_non_paraphrase
#all
python -m tests.test_seacrowd seacrowd/sea_datasets/myparaphrase/myparaphrase.py --subset_id myparaphrase_all
@sabilmakbar @MJonibek, done
I will merge this PR then. Thank you everybody for these great work
Closes #341
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.