Closes #115 | Create dataset loader for PhoMT dataset

yana-xuyan commented 7 months ago

Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples:

Title: Closes #{ISSUE_NUMBER} | Add/Update Dataloader {DATALOADER_NAME}

First line PR Message: Closes #{ISSUE_NUMBER}

where you replace the {ISSUE_NUMBER} with the one corresponding to your dataset.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

yana-xuyan commented 7 months ago

command line

python -m tests.test_seacrowd seacrowd/sea_datasets/phomt/phomt.py --subset_id phomt_en_vi --data_dir ../PhoMT/

The output is as follows:

INFO:__main__:args: Namespace(data_dir='../PhoMT/', path='seacrowd/sea_datasets/phomt/phomt.py', schema=None, subset_id='phomt_en_vi', use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/phomt/phomt.py
INFO:__main__:self.SUBSET_ID: phomt_en_vi
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: ../PhoMT/
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.phomt.phomt
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.MACHINE_TRANSLATION: 'MT'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'T2T'}
INFO:__main__:schemas_to_check: {'T2T'}
INFO:__main__:Checking load_dataset with config name phomt_en_vi_source
Downloading and preparing dataset phomt/phomt_en_vi_source to /home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_source-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c...
Dataset phomt downloaded and prepared to /home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_source-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 65.80it/s]
INFO:__main__:Checking load_dataset with config name phomt_en_vi_seacrowd_t2t
Downloading and preparing dataset phomt/phomt_en_vi_seacrowd_t2t to /home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_seacrowd_t2t-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c...
Dataset phomt downloaded and prepared to /home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_seacrowd_t2t-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 65.81it/s]
WARNING:datasets.builder:Found cached dataset phomt (/home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_source-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 67.31it/s]
INFO:__main__:Dataset sample [source]
{'id': '0', 'text_1': 'It begins with a countdown.', 'text_2': 'Câu chuyện bắt đầu với buổi lễ đếm ngược.', 'text_1_name': 'en', 'text_2_name': 'vi'}
WARNING:datasets.builder:Found cached dataset phomt (/home/xuyan/.cache/huggingface/datasets/phomt/phomt_en_vi_seacrowd_t2t-data_dir=..%2FPhoMT/1.0.0/48e4c08ed728f934efad0838d4c7731a6a266db308a00276698ab971abeda39c)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 69.18it/s]
INFO:__main__:Dataset sample [seacrowd_t2t]
{'id': '0', 'text_1': 'It begins with a countdown.', 'text_2': 'Câu chuyện bắt đầu với buổi lễ đếm ngược.', 'text_1_name': 'en', 'text_2_name': 'vi'}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 19151 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 2977999
text_1: 2977999
text_2: 2977999
text_1_name: 2977999
text_2_name: 2977999

validation
==========
id: 18719
text_1: 18719
text_2: 18719
text_1_name: 18719
text_2_name: 18719

test
==========
id: 19151
text_1: 19151
text_2: 19151
text_1_name: 19151
text_2_name: 19151

.
----------------------------------------------------------------------
Ran 1 test in 940.637s

OK

holylovenia commented 7 months ago

Hi @yana-xuyan, thanks for your contribution to SEACrowd! Could you please use "eng" and "vie" in the subset names and for the text_1_name and text_2_name variables?

Just a friendly reminder. 🙏

SEACrowd / seacrowd-datahub

Closes #115 | Create dataset loader for PhoMT dataset #489

Checkbox