Closes #513 | Add Dataloader GNOME

Closes #513

Based on discussion #456, I implemented all possible language pairs listed on the datasheet.

For parallel MT dataloaders, we agreed upon having a subset for every possible direction with at least 1 SEA language.

Thus, configs will look like this: gnome_en-id_source, gnome_tl-vi_seacrowd_t2t, etc. When testing, pass gnome_<subset> to the --subset_id parameter.

Here is a useful script to test all possible language pairs:

To run this script, save it to a file (e.g., `gnome_tests.sh`), make it executable with `chmod +x gnome_tests.sh`, and execute it with `./gnome_tests.sh`. Ensure you run the script from the seacrowd root directory. ```bash #!/bin/bash mkdir -p data/gnome SUBSETS=("en" "vi" "my" "id" "th" "tl" "ms" "lo") success_count=0 fail_count=0 declare -a failed_tests for src_lang in "${SUBSETS[@]}"; do for tgt_lang in "${SUBSETS[@]}"; do if [ "$src_lang" != "$tgt_lang" ]; then lang_pair="${src_lang}-${tgt_lang}" python_command="python -m tests.test_seacrowd seacrowd/sea_datasets/gnome/gnome.py --subset_id=gnome_${lang_pair}" output_file="data/gnome/${lang_pair}.txt" temp_output_file="data/gnome/${lang_pair}_temp.txt" # for cleaner cli output echo "Testing language pair: $lang_pair" # run the test, save the output, and redirect verbose output to a temporary file script -q -c "$python_command" "$temp_output_file" > /dev/null cat "$temp_output_file" > "$output_file" rm "$temp_output_file" # check if the test was successful if grep -q "OK" "$output_file"; then echo "Test for $lang_pair: SUCCESS" ((success_count++)) else echo "Test for $lang_pair: FAILURE" failed_tests+=("$lang_pair") ((fail_count++)) fi fi done done echo "-----------------------" echo "SUMMARY: $((success_count + fail_count)) tests total" echo "Success: $success_count" echo "Failure: $fail_count" if [ ${#failed_tests[@]} -gt 0 ]; then echo "Failed tests:" for test in "${failed_tests[@]}"; do echo "- $test" done fi ```

This is the intended test result summary:

SUMMARY: 56 tests total
Success: 50
Failure: 6
Failed tests:
- my-tl
- my-lo
- tl-my
- tl-lo
- lo-my
- lo-tl

The failed tests above are due to the absence of a suitable corpus, as determined by checking the OPUS API (see the dataloader implementation). I have included the failed subsets instead of listing subset exceptions in the datasheet later on, and it is better that way IMO.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

SEACrowd / seacrowd-datahub

Closes #513 | Add Dataloader GNOME #563

Checkbox