Closed SamuelCahyawijaya closed 7 months ago
Hi @holylovenia @SamuelCahyawijaya, for the NER task, how should we handle it if a token is assigned multiple labels?
Hi @raileymontalan, I noticed that although the dataset sometimes has 2 labels label_1|label_2
for a token instead of 1 label label_1
, we cannot use the label_1
in sequence n
and then use the label_2
in sequence n+1
. For example, given this example from line 1003-1006:
shu B-PRO|B-BRA
water I-PRO|B-TYP
perfect I-PRO|I-TYP
liquid I-PRO|I-TYP
If we use label_1
for water
(i.e., I-PRO
), we cannot use label_2
for perfect
(i.e., I-TYP
) because the beginning of a new label should use B-
instead of I-
. Considering this, maybe we should have 2 subsets for source
schema and 4 subsets for seacrowd
schema:
ind_proner_automatic_source
ind_proner_manual_source
ind_proner_automatic_v1_seacrowd_seq_label
--> Use the single labels as usual. For data samples with two labels, only use label_1
.ind_proner_automatic_v2_seacrowd_seq_label
--> Use the single labels as usual. For data samples with two labels, only use label_2
.ind_proner_manual_v1_seacrowd_seq_label
--> Use the single labels as usual. For data samples with two labels, only use label_1
.ind_proner_manual_v2_seacrowd_seq_label
--> Use the single labels as usual. For data samples with two labels, only use label_2
.Let me know what you think about this.
cc: @raileymontalan @SamuelCahyawijaya @sabilmakbar
Got this, will implement this recommendation. Thanks @holylovenia !
Hi @holylovenia, could the dataset have 2 splits instead? One split could be automatic
, the other could be manual
. Thoughts?
Hi @holylovenia, could the dataset have 2 splits instead? One split could be
automatic
, the other could bemanual
. Thoughts?
I think it's better as subsets. We reserve splits for train/val/test splits.
Hi @holylovenia, so I've gone ahead and implemented the 6 schemas above that you suggested. Though, this now causes the Test to fail given that the schemas are not just source
and seacrowd_<task>
anymore, but are automatic_source
, manual_source
, automatic_l1_seacrowd_seq_label
, automatic_l2_seacrowd_seq_label
, etc. now. Is that alright?
Hi @holylovenia, so I've gone ahead and implemented the 6 schemas above that you suggested. Though, this now causes the Test to fail given that the schemas are not just
source
andseacrowd_<task>
anymore, but areautomatic_source
,manual_source
,automatic_l1_seacrowd_seq_label
,automatic_l2_seacrowd_seq_label
, etc. now. Is that alright?
May I know what's the error?
% python -m tests.test_seacrowd seacrowd/sea_datasets/ind_proner/ind_proner.py
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/ind_proner/ind_proner.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/ind_proner/ind_proner.py
INFO:__main__:self.SUBSET_ID: ind_proner
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.ind_proner.ind_proner
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SEQ_LABEL'}
INFO:__main__:schemas_to_check: {'SEQ_LABEL'}
INFO:__main__:Checking load_dataset with config name ind_proner_source
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:2483: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
warnings.warn(
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for ind_proner contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/ind_proner/ind_proner.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/raileymontalan/Documents/seacrowd-datahub/tests/test_seacrowd.py", line 134, in setUp
self.dataset_source = datasets.load_dataset(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2523, in load_dataset
builder_instance = load_dataset_builder(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2232, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 371, in __init__
self.config, self.config_id = self._create_builder_config(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 592, in _create_builder_config
raise ValueError(
ValueError: BuilderConfig 'ind_proner_source' not found. Available: ['ind_proner_automatic_source', 'ind_proner_manual_source', 'ind_proner_automatic_l1_seacrowd_seq_label', 'ind_proner_manual_l1_seacrowd_seq_label', 'ind_proner_automatic_l2_seacrowd_seq_label', 'ind_proner_manual_l2_seacrowd_seq_label']
----------------------------------------------------------------------
Ran 1 test in 0.014s
FAILED (errors=1)
(env-seacrowd) raileymontalan@Raileys-MacBook-Pro-2023 seacrowd-datahub %
Would I need to define a subset or schema variable somewhere?
% python -m tests.test_seacrowd seacrowd/sea_datasets/ind_proner/ind_proner.py INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/ind_proner/ind_proner.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None) INFO:__main__:self.PATH: seacrowd/sea_datasets/ind_proner/ind_proner.py INFO:__main__:self.SUBSET_ID: ind_proner INFO:__main__:self.SCHEMA: None INFO:__main__:self.DATA_DIR: None INFO:__main__:Checking for _SUPPORTED_TASKS ... module seacrowd.sea_datasets.ind_proner.ind_proner INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>] INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SEQ_LABEL'} INFO:__main__:schemas_to_check: {'SEQ_LABEL'} INFO:__main__:Checking load_dataset with config name ind_proner_source /Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:2483: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=<use_auth_token>' instead. warnings.warn( /Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for ind_proner contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/ind_proner/ind_proner.py You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( E ====================================================================== ERROR: runTest (__main__.TestDataLoader) Run all tests that check: ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/raileymontalan/Documents/seacrowd-datahub/tests/test_seacrowd.py", line 134, in setUp self.dataset_source = datasets.load_dataset( File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2523, in load_dataset builder_instance = load_dataset_builder( File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2232, in load_dataset_builder builder_instance: DatasetBuilder = builder_cls( File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 371, in __init__ self.config, self.config_id = self._create_builder_config( File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 592, in _create_builder_config raise ValueError( ValueError: BuilderConfig 'ind_proner_source' not found. Available: ['ind_proner_automatic_source', 'ind_proner_manual_source', 'ind_proner_automatic_l1_seacrowd_seq_label', 'ind_proner_manual_l1_seacrowd_seq_label', 'ind_proner_automatic_l2_seacrowd_seq_label', 'ind_proner_manual_l2_seacrowd_seq_label'] ---------------------------------------------------------------------- Ran 1 test in 0.014s FAILED (errors=1) (env-seacrowd) raileymontalan@Raileys-MacBook-Pro-2023 seacrowd-datahub %
Would I need to define a subset or schema variable somewhere?
Hi @raileymontalan, instead of automatic_source
and manual_source
, can you use ind_proner_automatic_source
and ind_proner_manual_source
?
Also, can you try python -m tests.test_seacrowd seacrowd/sea_datasets/ind_proner/ind_proner.py --subset_id=ind_proner_automatic
?
Dataloader name:
ind_proner/ind_proner.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?ind_proner