Based on discussion #456, I implemented all possible language pairs listed on the datasheet.
For parallel MT dataloaders, we agreed upon having a subset for every possible direction with at least 1 SEA language.
Thus, configs will look like this: gnome_en-id_source, gnome_tl-vi_seacrowd_t2t, etc. When testing, pass gnome_<subset> to the --subset_id parameter.
Here is a useful script to test all possible language pairs:
To run this script, save it to a file (e.g., `gnome_tests.sh`), make it executable with `chmod +x gnome_tests.sh`, and execute it with `./gnome_tests.sh`. Ensure you run the script from the seacrowd root directory.
```bash
#!/bin/bash
mkdir -p data/gnome
SUBSETS=("en" "vi" "my" "id" "th" "tl" "ms" "lo")
success_count=0
fail_count=0
declare -a failed_tests
for src_lang in "${SUBSETS[@]}"; do
for tgt_lang in "${SUBSETS[@]}"; do
if [ "$src_lang" != "$tgt_lang" ]; then
lang_pair="${src_lang}-${tgt_lang}"
python_command="python -m tests.test_seacrowd seacrowd/sea_datasets/gnome/gnome.py --subset_id=gnome_${lang_pair}"
output_file="data/gnome/${lang_pair}.txt"
temp_output_file="data/gnome/${lang_pair}_temp.txt" # for cleaner cli output
echo "Testing language pair: $lang_pair"
# run the test, save the output, and redirect verbose output to a temporary file
script -q -c "$python_command" "$temp_output_file" > /dev/null
cat "$temp_output_file" > "$output_file"
rm "$temp_output_file"
# check if the test was successful
if grep -q "OK" "$output_file"; then
echo "Test for $lang_pair: SUCCESS"
((success_count++))
else
echo "Test for $lang_pair: FAILURE"
failed_tests+=("$lang_pair")
((fail_count++))
fi
fi
done
done
echo "-----------------------"
echo "SUMMARY: $((success_count + fail_count)) tests total"
echo "Success: $success_count"
echo "Failure: $fail_count"
if [ ${#failed_tests[@]} -gt 0 ]; then
echo "Failed tests:"
for test in "${failed_tests[@]}"; do
echo "- $test"
done
fi
```
The failed tests above are due to the absence of a suitable corpus, as determined by checking the OPUS API (see the dataloader implementation). I have included the failed subsets instead of listing subset exceptions in the datasheet later on, and it is better that way IMO.
Checkbox
[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
Closes #513
Based on discussion #456, I implemented all possible language pairs listed on the datasheet.
Thus, configs will look like this:
gnome_en-id_source
,gnome_tl-vi_seacrowd_t2t
, etc. When testing, passgnome_<subset>
to the--subset_id
parameter.Here is a useful script to test all possible language pairs:
To run this script, save it to a file (e.g., `gnome_tests.sh`), make it executable with `chmod +x gnome_tests.sh`, and execute it with `./gnome_tests.sh`. Ensure you run the script from the seacrowd root directory. ```bash #!/bin/bash mkdir -p data/gnome SUBSETS=("en" "vi" "my" "id" "th" "tl" "ms" "lo") success_count=0 fail_count=0 declare -a failed_tests for src_lang in "${SUBSETS[@]}"; do for tgt_lang in "${SUBSETS[@]}"; do if [ "$src_lang" != "$tgt_lang" ]; then lang_pair="${src_lang}-${tgt_lang}" python_command="python -m tests.test_seacrowd seacrowd/sea_datasets/gnome/gnome.py --subset_id=gnome_${lang_pair}" output_file="data/gnome/${lang_pair}.txt" temp_output_file="data/gnome/${lang_pair}_temp.txt" # for cleaner cli output echo "Testing language pair: $lang_pair" # run the test, save the output, and redirect verbose output to a temporary file script -q -c "$python_command" "$temp_output_file" > /dev/null cat "$temp_output_file" > "$output_file" rm "$temp_output_file" # check if the test was successful if grep -q "OK" "$output_file"; then echo "Test for $lang_pair: SUCCESS" ((success_count++)) else echo "Test for $lang_pair: FAILURE" failed_tests+=("$lang_pair") ((fail_count++)) fi fi done done echo "-----------------------" echo "SUMMARY: $((success_count + fail_count)) tests total" echo "Success: $success_count" echo "Failure: $fail_count" if [ ${#failed_tests[@]} -gt 0 ]; then echo "Failed tests:" for test in "${failed_tests[@]}"; do echo "- $test" done fi ```This is the intended test result summary:
The failed tests above are due to the absence of a suitable corpus, as determined by checking the OPUS API (see the dataloader implementation). I have included the failed subsets instead of listing subset exceptions in the datasheet later on, and it is better that way IMO.
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.