SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Closes #513 | Add Dataloader GNOME #563

Closed akhdanfadh closed 5 months ago

akhdanfadh commented 6 months ago

Closes #513

Based on discussion #456, I implemented all possible language pairs listed on the datasheet.

For parallel MT dataloaders, we agreed upon having a subset for every possible direction with at least 1 SEA language.

Thus, configs will look like this: gnome_en-id_source, gnome_tl-vi_seacrowd_t2t, etc. When testing, pass gnome_<subset> to the --subset_id parameter.

Here is a useful script to test all possible language pairs: To run this script, save it to a file (e.g., `gnome_tests.sh`), make it executable with `chmod +x gnome_tests.sh`, and execute it with `./gnome_tests.sh`. Ensure you run the script from the seacrowd root directory. ```bash #!/bin/bash mkdir -p data/gnome SUBSETS=("en" "vi" "my" "id" "th" "tl" "ms" "lo") success_count=0 fail_count=0 declare -a failed_tests for src_lang in "${SUBSETS[@]}"; do for tgt_lang in "${SUBSETS[@]}"; do if [ "$src_lang" != "$tgt_lang" ]; then lang_pair="${src_lang}-${tgt_lang}" python_command="python -m tests.test_seacrowd seacrowd/sea_datasets/gnome/gnome.py --subset_id=gnome_${lang_pair}" output_file="data/gnome/${lang_pair}.txt" temp_output_file="data/gnome/${lang_pair}_temp.txt" # for cleaner cli output echo "Testing language pair: $lang_pair" # run the test, save the output, and redirect verbose output to a temporary file script -q -c "$python_command" "$temp_output_file" > /dev/null cat "$temp_output_file" > "$output_file" rm "$temp_output_file" # check if the test was successful if grep -q "OK" "$output_file"; then echo "Test for $lang_pair: SUCCESS" ((success_count++)) else echo "Test for $lang_pair: FAILURE" failed_tests+=("$lang_pair") ((fail_count++)) fi fi done done echo "-----------------------" echo "SUMMARY: $((success_count + fail_count)) tests total" echo "Success: $success_count" echo "Failure: $fail_count" if [ ${#failed_tests[@]} -gt 0 ]; then echo "Failed tests:" for test in "${failed_tests[@]}"; do echo "- $test" done fi ```

This is the intended test result summary:

SUMMARY: 56 tests total
Success: 50
Failure: 6
Failed tests:
- my-tl
- my-lo
- tl-my
- tl-lo
- lo-my
- lo-tl

The failed tests above are due to the absence of a suitable corpus, as determined by checking the OPUS API (see the dataloader implementation). I have included the failed subsets instead of listing subset exceptions in the datasheet later on, and it is better that way IMO.

Checkbox