SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
57 stars 54 forks source link

Closes #512 | Add/Update Dataloader QED #549

Closed patrickamadeus closed 2 months ago

patrickamadeus commented 3 months ago

Closes #512

Checkbox

T2T

malay-thai

image

malay-burmese

image

english-javanese

image

indonesian-javanese

image

SSP

malay-burmese

image

indonesian-javanese

image

NB:

patrickamadeus commented 3 months ago

will add ssp schema within this week after discussing with @SamuelCahyawijaya

patrickamadeus commented 3 months ago

Updated SSP approach with (some) retests!

cc @ljvmiranda921 @holylovenia @SamuelCahyawijaya

MT jv-ms

image

SSP jv

image

SSP vi

image

SSP tl

image
holylovenia commented 3 months ago

Checking the data source, we can add the ff. SEA languages to the dataloader:

  • Cebuano
  • Filipino (fil): Normally we consider tl and fil to be the same, but the monolingual documents of fil and tl here are different.
  • Khmer
  • Lao
  • Madurese
  • Pampanga

@holylovenia Please also update the datasheet accordingly, thanks!

I'd like to confirm before I rectify the language list.

So the complete list of the languages is ["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"], right? I'm assuming Pampanga == Kapampangan (pam).

cc: @elyanah-aco @ljvmiranda921

elyanah-aco commented 3 months ago

So the complete list of the languages is ["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"], right? I'm assuming Pampanga == Kapampangan (pam).

Yes this is correct, thanks!

holylovenia commented 3 months ago

So the complete list of the languages is ["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"], right? I'm assuming Pampanga == Kapampangan (pam).

Yes this is correct, thanks!

Done!

holylovenia commented 2 months ago

Hi @patrickamadeus, a friendly reminder to address @elyanah-aco's suggestions.

patrickamadeus commented 2 months ago

Thank you for the review! Updated the lang list. @elyanah-aco @holylovenia

patrickamadeus commented 2 months ago

Thank you for the nitpick! It's done in the latest commit! @elyanah-aco

holylovenia commented 2 months ago

Thank you for the nitpick! It's done in the latest commit! @elyanah-aco

Hi @elyanah-aco, may I know if you're ready to approve this PR? If all your concerns have been addressed by @patrickamadeus, I'm inclined to get this PR merged.