Closed patrickamadeus closed 2 months ago
will add ssp
schema within this week after discussing with @SamuelCahyawijaya
Updated SSP
approach with (some) retests!
cc @ljvmiranda921 @holylovenia @SamuelCahyawijaya
jv-ms
jv
vi
tl
Checking the data source, we can add the ff. SEA languages to the dataloader:
- Cebuano
- Filipino (fil): Normally we consider
tl
andfil
to be the same, but the monolingual documents offil
andtl
here are different.- Khmer
- Lao
- Madurese
- Pampanga
@holylovenia Please also update the datasheet accordingly, thanks!
I'd like to confirm before I rectify the language list.
So the complete list of the languages is ["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"]
, right? I'm assuming Pampanga == Kapampangan (pam).
cc: @elyanah-aco @ljvmiranda921
So the complete list of the languages is
["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"]
, right? I'm assuming Pampanga == Kapampangan (pam).
Yes this is correct, thanks!
So the complete list of the languages is
["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"]
, right? I'm assuming Pampanga == Kapampangan (pam).Yes this is correct, thanks!
Done!
Hi @patrickamadeus, a friendly reminder to address @elyanah-aco's suggestions.
Thank you for the review! Updated the lang list. @elyanah-aco @holylovenia
Thank you for the nitpick! It's done in the latest commit! @elyanah-aco
Thank you for the nitpick! It's done in the latest commit! @elyanah-aco
Hi @elyanah-aco, may I know if you're ready to approve this PR? If all your concerns have been addressed by @patrickamadeus, I'm inclined to get this PR merged.
Closes #512
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.T2T
malay-thai
malay-burmese
english-javanese
indonesian-javanese
SSP
malay-burmese
indonesian-javanese
NB:
en
orid
, first line is outlier because it shows the document title (might be misinterpreted as an error as depicted from last 2 examples)Tasks.MACHINE_TRANSLATION
, will add task addition is needed for language modeling.~