SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #115 | Create dataset loader for PhoMT dataset #489

Closed yana-xuyan closed 6 months ago

yana-xuyan commented 7 months ago

Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples:

Title: Closes #{ISSUE_NUMBER} | Add/Update Dataloader {DATALOADER_NAME}

First line PR Message: Closes #{ISSUE_NUMBER}

where you replace the {ISSUE_NUMBER} with the one corresponding to your dataset.

Checkbox

yana-xuyan commented 7 months ago

command line

python -m tests.test_seacrowd seacrowd/sea_datasets/phomt/phomt.py --subset_id phomt_en_vi --data_dir ../PhoMT/

The output is as follows:

INFO:__main__:args: Namespace(data_dir='../PhoMT/', path='seacrowd/sea_datasets/phomt/phomt.py', schema=None, subset_id='phomt_en_vi', use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/phomt/phomt.py
INFO:__main__:self.SUBSET_ID: phomt_en_vi
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: ../PhoMT/
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.phomt.phomt
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.MACHINE_TRANSLATION: 'MT'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'T2T'}
INFO:__main__:schemas_to_check: {'T2T'}
INFO:__main__:Checking load_dataset with config name phomt_en_vi_source
Downloading and preparing dataset phomt/phomt_en_vi_source to /home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_source-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c...
Dataset phomt downloaded and prepared to /home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_source-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 65.80it/s]
INFO:__main__:Checking load_dataset with config name phomt_en_vi_seacrowd_t2t
Downloading and preparing dataset phomt/phomt_en_vi_seacrowd_t2t to /home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_seacrowd_t2t-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c...
Dataset phomt downloaded and prepared to /home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_seacrowd_t2t-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 65.81it/s]
WARNING:datasets.builder:Found cached dataset phomt (/home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_source-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 67.31it/s]
INFO:__main__:Dataset sample [source]
{'id': '0', 'text_1': 'It begins with a countdown.', 'text_2': 'Câu chuyện bắt đầu với buổi lễ đếm ngược.', 'text_1_name': 'en', 'text_2_name': 'vi'}
WARNING:datasets.builder:Found cached dataset phomt (/home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_seacrowd_t2t-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 69.18it/s]
INFO:__main__:Dataset sample [seacrowd_t2t]
{'id': '0', 'text_1': 'It begins with a countdown.', 'text_2': 'Câu chuyện bắt đầu với buổi lễ đếm ngược.', 'text_1_name': 'en', 'text_2_name': 'vi'}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 19151 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 2977999
text_1: 2977999
text_2: 2977999
text_1_name: 2977999
text_2_name: 2977999

validation
==========
id: 18719
text_1: 18719
text_2: 18719
text_1_name: 18719
text_2_name: 18719

test
==========
id: 19151
text_1: 19151
text_2: 19151
text_1_name: 19151
text_2_name: 19151

.
----------------------------------------------------------------------
Ran 1 test in 940.637s

OK
holylovenia commented 7 months ago

Hi @yana-xuyan, thanks for your contribution to SEACrowd! Could you please use "eng" and "vie" in the subset names and for the text_1_name and text_2_name variables?

Just a friendly reminder. 🙏