SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #264 | Create dataset loader for mySentence #264 #291

Closed Gyyz closed 7 months ago

Gyyz commented 9 months ago

Closes #264

Checkbox

Gyyz commented 9 months ago

script output:

# yuz @ null in ~/workspace/seacrowd-datahub on git:mysentence x [21:27:22] 
$ python -m tests.test_seacrowd seacrowd/sea_datasets/mysentence/mysentence.py --subset_id mysentence 
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/mysentence/mysentence.py', schema=None, subset_id='mysentence', data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/mysentence/mysentence.py
INFO:__main__:self.SUBSET_ID: mysentence
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.mysentence.mysentence
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.POS_TAGGING: 'POS'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SEQ_LABEL'}
INFO:__main__:schemas_to_check: {'SEQ_LABEL'}
INFO:__main__:Checking load_dataset with config name mysentence_source
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:2479: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for mysentence contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/mysentence/mysentence.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 40000 examples [00:02, 16399.02 examples/s]
Generating test split: 4712 examples [00:00, 16852.41 examples/s]
Generating validation split: 2414 examples [00:00, 17002.37 examples/s]
INFO:__main__:Checking load_dataset with config name mysentence_seacrowd_seq_label
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:2479: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for mysentence contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/mysentence/mysentence.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 40000 examples [00:02, 16179.66 examples/s]
Generating test split: 4712 examples [00:00, 15659.38 examples/s]
Generating validation split: 2414 examples [00:00, 16716.31 examples/s]
INFO:__main__:Dataset sample [source]
{'id': '0', 'tokens': ['ဘာ', 'ရယ်', 'လို့', 'တိတိကျကျ', 'ထောက်မပြ', 'နိုင်', 'ပေမဲ့', 'ပြဿနာ', 'တစ်', 'ခု', 'ခု', 'ရှိ', 'တယ်', 'နဲ့', 'တူ', 'တယ်'], 'labels': ['B', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'E']}
INFO:__main__:Dataset sample [seacrowd_seq_label]
{'id': '0', 'tokens': ['ဘာ', 'ရယ်', 'လို့', 'တိတိကျကျ', 'ထောက်မပြ', 'နိုင်', 'ပေမဲ့', 'ပြဿနာ', 'တစ်', 'ခု', 'ခု', 'ရှိ', 'တယ်', 'နဲ့', 'တူ', 'တယ်'], 'labels': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3]}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 2414 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 40000
tokens: 543541
labels: 543541

test
==========
id: 4712
tokens: 63622
labels: 63622

validation
==========
id: 2414
tokens: 32315
labels: 32315

.
----------------------------------------------------------------------
Ran 1 test in 13.703s

OK
(ani) 
# yuz @ null in ~/workspace/seacrowd-datahub on git:mysentence x [21:27:45] 
$ python -m tests.test_seacrowd seacrowd/sea_datasets/mysentence/mysentence.py --subset_id mysentence+paragraphs
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/mysentence/mysentence.py', schema=None, subset_id='mysentence+paragraphs', data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/mysentence/mysentence.py
INFO:__main__:self.SUBSET_ID: mysentence+paragraphs
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.mysentence.mysentence
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.POS_TAGGING: 'POS'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SEQ_LABEL'}
INFO:__main__:schemas_to_check: {'SEQ_LABEL'}
INFO:__main__:Checking load_dataset with config name mysentence+paragraphs_source
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:2479: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for mysentence contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/mysentence/mysentence.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 47002 examples [00:03, 13637.80 examples/s]
Generating test split: 5512 examples [00:00, 14003.42 examples/s]
Generating validation split: 3079 examples [00:00, 12728.72 examples/s]
INFO:__main__:Checking load_dataset with config name mysentence+paragraphs_seacrowd_seq_label
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:2479: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for mysentence contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/mysentence/mysentence.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 47002 examples [00:03, 13424.81 examples/s]
Generating test split: 5512 examples [00:00, 13862.90 examples/s]
Generating validation split: 3079 examples [00:00, 12583.52 examples/s]
INFO:__main__:Dataset sample [source]
{'id': '0', 'tokens': ['နားလည်', 'ပါ', 'ပြီ'], 'labels': ['B', 'N', 'E']}
INFO:__main__:Dataset sample [seacrowd_seq_label]
{'id': '0', 'tokens': ['နားလည်', 'ပါ', 'ပြီ'], 'labels': [0, 2, 3]}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 3079 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 47002
tokens: 834243
labels: 834243

test
==========
id: 5512
tokens: 96632
labels: 96632

validation
==========
id: 3079
tokens: 61782
labels: 61782

.
----------------------------------------------------------------------
Ran 1 test in 16.272s

OK
(ani) 
# yuz @ null in ~/workspace/seacrowd-datahub on git:mysentence x [21:28:08] 
$ python -m tests.test_seacrowd seacrowd/sea_datasets/mysentence/mysentence.py                                  
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/mysentence/mysentence.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/mysentence/mysentence.py
INFO:__main__:self.SUBSET_ID: mysentence
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.mysentence.mysentence
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.POS_TAGGING: 'POS'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SEQ_LABEL'}
INFO:__main__:schemas_to_check: {'SEQ_LABEL'}
INFO:__main__:Checking load_dataset with config name mysentence_source
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:2479: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for mysentence contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/mysentence/mysentence.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 40000 examples [00:02, 16274.95 examples/s]
Generating test split: 4712 examples [00:00, 16691.35 examples/s]
Generating validation split: 2414 examples [00:00, 16784.20 examples/s]
INFO:__main__:Checking load_dataset with config name mysentence_seacrowd_seq_label
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:2479: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for mysentence contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/mysentence/mysentence.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 40000 examples [00:02, 16075.43 examples/s]
Generating test split: 4712 examples [00:00, 15542.23 examples/s]
Generating validation split: 2414 examples [00:00, 16705.88 examples/s]
INFO:__main__:Dataset sample [source]
{'id': '0', 'tokens': ['ဘာ', 'ရယ်', 'လို့', 'တိတိကျကျ', 'ထောက်မပြ', 'နိုင်', 'ပေမဲ့', 'ပြဿနာ', 'တစ်', 'ခု', 'ခု', 'ရှိ', 'တယ်', 'နဲ့', 'တူ', 'တယ်'], 'labels': ['B', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'E']}
INFO:__main__:Dataset sample [seacrowd_seq_label]
{'id': '0', 'tokens': ['ဘာ', 'ရယ်', 'လို့', 'တိတိကျကျ', 'ထောက်မပြ', 'နိုင်', 'ပေမဲ့', 'ပြဿနာ', 'တစ်', 'ခု', 'ခု', 'ရှိ', 'တယ်', 'နဲ့', 'တူ', 'တယ်'], 'labels': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3]}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 2414 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 40000
tokens: 543541
labels: 543541

test
==========
id: 4712
tokens: 63622
labels: 63622

validation
==========
id: 2414
tokens: 32315
labels: 32315

.
----------------------------------------------------------------------
Ran 1 test in 13.791s

OK
(ani)
# yuz @ null in ~/workspace/seacrowd-datahub on git:mysentence x [21:28:44] 
$ make check_file=seacrowd/sea_datasets/mysentence/mysentence.py                                                
black --line-length 250 --target-version py38 seacrowd/sea_datasets/mysentence/mysentence.py
All done! ✨ 🍰 ✨
1 file left unchanged.
isort seacrowd/sea_datasets/mysentence/mysentence.py
flake8 seacrowd/sea_datasets/mysentence/mysentence.py --max-line-length 250