SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for Indonesian PRONER #350

Closed SamuelCahyawijaya closed 7 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: ind_proner/ind_proner.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?ind_proner

Dataset ind_proner
Description A corpus for Indonesian Product Named Entity Recognition (PRONER). We labeled a small amount of data and implemented a semi-supervised learning approach to label the rest of the data. We used conditional random fields (CRF) as the classifier.
Subsets Automatically Labeled Data, Manually Labeled Data
Languages ind
Tasks Named Entiy Recognition
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://github.com/dziem/proner-labeled-text/tree/master
HF URL -
Paper URL http://www.winlp.org/wp-content/uploads/2020/final_papers/37_Paper.pdf
raileymontalan commented 8 months ago

self-assign

raileymontalan commented 8 months ago

Hi @holylovenia @SamuelCahyawijaya, for the NER task, how should we handle it if a token is assigned multiple labels?

holylovenia commented 8 months ago

Hi @raileymontalan, I noticed that although the dataset sometimes has 2 labels label_1|label_2 for a token instead of 1 label label_1, we cannot use the label_1 in sequence n and then use the label_2 in sequence n+1. For example, given this example from line 1003-1006:

shu B-PRO|B-BRA
water   I-PRO|B-TYP
perfect I-PRO|I-TYP
liquid  I-PRO|I-TYP

If we use label_1 for water (i.e., I-PRO), we cannot use label_2 for perfect (i.e., I-TYP) because the beginning of a new label should use B- instead of I-. Considering this, maybe we should have 2 subsets for source schema and 4 subsets for seacrowd schema:

  1. ind_proner_automatic_source
  2. ind_proner_manual_source
  3. ind_proner_automatic_v1_seacrowd_seq_label --> Use the single labels as usual. For data samples with two labels, only use label_1.
  4. ind_proner_automatic_v2_seacrowd_seq_label --> Use the single labels as usual. For data samples with two labels, only use label_2.
  5. ind_proner_manual_v1_seacrowd_seq_label --> Use the single labels as usual. For data samples with two labels, only use label_1.
  6. ind_proner_manual_v2_seacrowd_seq_label --> Use the single labels as usual. For data samples with two labels, only use label_2.

Let me know what you think about this.

cc: @raileymontalan @SamuelCahyawijaya @sabilmakbar

raileymontalan commented 8 months ago

Got this, will implement this recommendation. Thanks @holylovenia !

raileymontalan commented 8 months ago

Hi @holylovenia, could the dataset have 2 splits instead? One split could be automatic, the other could be manual. Thoughts?

holylovenia commented 8 months ago

Hi @holylovenia, could the dataset have 2 splits instead? One split could be automatic, the other could be manual. Thoughts?

I think it's better as subsets. We reserve splits for train/val/test splits.

raileymontalan commented 8 months ago

Hi @holylovenia, so I've gone ahead and implemented the 6 schemas above that you suggested. Though, this now causes the Test to fail given that the schemas are not just source and seacrowd_<task> anymore, but are automatic_source, manual_source, automatic_l1_seacrowd_seq_label, automatic_l2_seacrowd_seq_label, etc. now. Is that alright?

holylovenia commented 8 months ago

Hi @holylovenia, so I've gone ahead and implemented the 6 schemas above that you suggested. Though, this now causes the Test to fail given that the schemas are not just source and seacrowd_<task> anymore, but are automatic_source, manual_source, automatic_l1_seacrowd_seq_label, automatic_l2_seacrowd_seq_label, etc. now. Is that alright?

May I know what's the error?

raileymontalan commented 8 months ago
% python -m tests.test_seacrowd seacrowd/sea_datasets/ind_proner/ind_proner.py                      
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/ind_proner/ind_proner.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/ind_proner/ind_proner.py
INFO:__main__:self.SUBSET_ID: ind_proner
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.ind_proner.ind_proner
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SEQ_LABEL'}
INFO:__main__:schemas_to_check: {'SEQ_LABEL'}
INFO:__main__:Checking load_dataset with config name ind_proner_source
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:2483: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for ind_proner contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/ind_proner/ind_proner.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/raileymontalan/Documents/seacrowd-datahub/tests/test_seacrowd.py", line 134, in setUp
    self.dataset_source = datasets.load_dataset(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2523, in load_dataset
    builder_instance = load_dataset_builder(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2232, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 592, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'ind_proner_source' not found. Available: ['ind_proner_automatic_source', 'ind_proner_manual_source', 'ind_proner_automatic_l1_seacrowd_seq_label', 'ind_proner_manual_l1_seacrowd_seq_label', 'ind_proner_automatic_l2_seacrowd_seq_label', 'ind_proner_manual_l2_seacrowd_seq_label']

----------------------------------------------------------------------
Ran 1 test in 0.014s

FAILED (errors=1)
(env-seacrowd) raileymontalan@Raileys-MacBook-Pro-2023 seacrowd-datahub % 

Would I need to define a subset or schema variable somewhere?

holylovenia commented 8 months ago
% python -m tests.test_seacrowd seacrowd/sea_datasets/ind_proner/ind_proner.py                      
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/ind_proner/ind_proner.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/ind_proner/ind_proner.py
INFO:__main__:self.SUBSET_ID: ind_proner
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.ind_proner.ind_proner
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SEQ_LABEL'}
INFO:__main__:schemas_to_check: {'SEQ_LABEL'}
INFO:__main__:Checking load_dataset with config name ind_proner_source
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:2483: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for ind_proner contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/ind_proner/ind_proner.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/raileymontalan/Documents/seacrowd-datahub/tests/test_seacrowd.py", line 134, in setUp
    self.dataset_source = datasets.load_dataset(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2523, in load_dataset
    builder_instance = load_dataset_builder(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2232, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 592, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'ind_proner_source' not found. Available: ['ind_proner_automatic_source', 'ind_proner_manual_source', 'ind_proner_automatic_l1_seacrowd_seq_label', 'ind_proner_manual_l1_seacrowd_seq_label', 'ind_proner_automatic_l2_seacrowd_seq_label', 'ind_proner_manual_l2_seacrowd_seq_label']

----------------------------------------------------------------------
Ran 1 test in 0.014s

FAILED (errors=1)
(env-seacrowd) raileymontalan@Raileys-MacBook-Pro-2023 seacrowd-datahub % 

Would I need to define a subset or schema variable somewhere?

Hi @raileymontalan, instead of automatic_source and manual_source, can you use ind_proner_automatic_source and ind_proner_manual_source?

Also, can you try python -m tests.test_seacrowd seacrowd/sea_datasets/ind_proner/ind_proner.py --subset_id=ind_proner_automatic?