Closes #424 | Add Dataloader Bactrian-X

Closes #424

I implemented one config per language/subset. Thus, configs will look like this: bactrian_x_id_source, bactrian_x_km_seacrowd_t2t, etc. When testing, pass bactrian_x_<subset> to the --subset_id parameter.

As there is one more variable for the input response in the source schema, I added that manually as Instruction: {instruction}\nInput: {input}" in text_1 of seacrowd_t2t schema. I don't know if that is allowed, so let's discuss.

Note that for Khmer subset, the loaded data will look as follows:

INFO:__main__:Dataset sample [source]
{'instruction': 'តើ\u200bស៊ុម\u200bក្រិត\u200bក្នុង\u200bការ\u200bថត\u200bរូប\u200bតារាសាស្ត្រ\u200bមាន\u200bអ្វីខ្លះ?', 'input': '', 'id': 'dolly-10828', 'output': 'ស៊ុម\u200bក្រិត\u200bក្នុង\u200bការ\u200bថត\u200bរូប\u200bតារាសាស្ត្រ\u200bវាប្រកាស\u200bពី\u200bប្រភេទ\u200bនិង\u200bលក្ខណៈ\u200bរបស់\u200bរូបភាព\u200b។ វាតម្រូវ\u200bឲ្យ\u200bទៅ\u200bដល់\u200bលក្ខណៈ\u200bប្រូបាបនៃ\u200bរូបភាព\u200bដែល\u200bប្រើ\u200bដើម្បី\u200bថត\u200bបន្ទាត់\u200bស្មើ\u200bនឹង\u200bតម្លៃ\u200bមួយ\u200bចំនួន\u200b។ ក្នុង\u200bបំណង\u200bនេះ\u200bនៃ\u200bការថត\u200bរូប\u200bតារាសាស្ត្រ\u200bមាន\u200bករណី\u200bដូចជា\u200bប្រូបាប\u200bពណ៌\u200bរូបភាព\u200bក្បែរ\u200bតែ\u200bមិន\u200bត្រូវ\u200bបាន\u200bប្រាកដ\u200bនៅពេលដែល\u200bវា\u200bល្អបំផុត\u200bទេ\u200b។ គុណភាព\u200bរូបភាព\u200bមាន\u200bតម្លៃ\u200bបំផុត\u200bនៅ\u200bពេល\u200bដែល\u200bវា\u200bអាច\u200bប្រើ\u200bបាន\u200bទៅ\u200bនឹង\u200bប្រិតប្រយ័ត្ន\u200bនៃ\u200bវាល\u200bផ្សេងទៀត\u200bដែល\u200bវា\u200bត្រូវ\u200bបានផ្ដល់។'}
INFO:__main__:Dataset sample [seacrowd_t2t]
{'id': 'dolly-10828', 'text_1': 'Instruction: តើ\u200bស៊ុម\u200bក្រិត\u200bក្នុង\u200bការ\u200bថត\u200bរូប\u200bតារាសាស្ត្រ\u200bមាន\u200bអ្វីខ្លះ?\nInput: ', 'text_2': 'ស៊ុម\u200bក្រិត\u200bក្នុង\u200bការ\u200bថត\u200bរូប\u200bតារាសាស្ត្រ\u200bវាប្រកាស\u200bពី\u200bប្រភេទ\u200bនិង\u200bលក្ខណៈ\u200bរបស់\u200bរូបភាព\u200b។ វាតម្រូវ\u200bឲ្យ\u200bទៅ\u200bដល់\u200bលក្ខណៈ\u200bប្រូបាបនៃ\u200bរូបភាព\u200bដែល\u200bប្រើ\u200bដើម្បី\u200bថត\u200bបន្ទាត់\u200bស្មើ\u200bនឹង\u200bតម្លៃ\u200bមួយ\u200bចំនួន\u200b។ ក្នុង\u200bបំណង\u200bនេះ\u200bនៃ\u200bការថត\u200bរូប\u200bតារាសាស្ត្រ\u200bមាន\u200bករណី\u200bដូចជា\u200bប្រូបាប\u200bពណ៌\u200bរូបភាព\u200bក្បែរ\u200bតែ\u200bមិន\u200bត្រូវ\u200bបាន\u200bប្រាកដ\u200bនៅពេលដែល\u200bវា\u200bល្អបំផុត\u200bទេ\u200b។ គុណភាព\u200bរូបភាព\u200bមាន\u200bតម្លៃ\u200bបំផុត\u200bនៅ\u200bពេល\u200bដែល\u200bវា\u200bអាច\u200bប្រើ\u200bបាន\u200bទៅ\u200bនឹង\u200bប្រិតប្រយ័ត្ន\u200bនៃ\u200bវាល\u200bផ្សេងទៀត\u200bដែល\u200bវា\u200bត្រូវ\u200bបានផ្ដល់។', 'text_1_name': 'instruction + input', 'text_2_name': 'output'}

At first, I thought this should be an encoding problem and need to be solved. But turns out I also get the same result when loading from HF directly as follows:

from datasets import load_dataset

data = load_dataset("MBZUAI/Bactrian-X", "km")
print(data['train'][0])
# {'instruction': 'តើ\u200bស៊ុម\u200bក្រិត\u200bក្នុង\u200bការ\u200bថត\u200bរូប\u200bតារាសាស្ត្រ\u200bមាន\u200bអ្វីខ្លះ?', 'input': '', 'id': 'dolly-10828', 'output': 'ស៊ុម\u200bក្រិត\u200bក្នុង\u200bការ\u200bថត\u200bរូប\u200bតារាសាស្ត្រ\u200bវាប្រកាស\u200bពី\u200bប្រភេទ\u200bនិង\u200bលក្ខណៈ\u200bរបស់\u200bរូបភាព\u200b។ វាតម្រូវ\u200bឲ្យ\u200bទៅ\u200bដល់\u200bលក្ខណៈ\u200bប្រូបាបនៃ\u200bរូបភាព\u200bដែល\u200bប្រើ\u200bដើម្បី\u200bថត\u200bបន្ទាត់\u200bស្មើ\u200bនឹង\u200bតម្លៃ\u200bមួយ\u200bចំនួន\u200b។ ក្នុង\u200bបំណង\u200bនេះ\u200bនៃ\u200bការថត\u200bរូប\u200bតារាសាស្ត្រ\u200bមាន\u200bករណី\u200bដូចជា\u200bប្រូបាប\u200bពណ៌\u200bរូបភាព\u200bក្បែរ\u200bតែ\u200bមិន\u200bត្រូវ\u200bបាន\u200bប្រាកដ\u200bនៅពេលដែល\u200bវា\u200bល្អបំផុត\u200bទេ\u200b។ គុណភាព\u200bរូបភាព\u200bមាន\u200bតម្លៃ\u200bបំផុត\u200bនៅ\u200bពេល\u200bដែល\u200bវា\u200bអាច\u200bប្រើ\u200bបាន\u200bទៅ\u200bនឹង\u200bប្រិតប្រយ័ត្ន\u200bនៃ\u200bវាល\u200bផ្សេងទៀត\u200bដែល\u200bវា\u200bត្រូវ\u200bបានផ្ដល់។'}

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

SEACrowd / seacrowd-datahub

Closes #424 | Add Dataloader Bactrian-X #552

Checkbox