Create tok.js, update eo.js

common-voice / sentence-collector

Tool to collect and review sentences for Common Voice

https://commonvoice.mozilla.org/sentence-collector/

Mozilla Public License 2.0

81 stars 63 forks source link

Create tok.js, update eo.js #610

Closed janPensa closed 2 years ago

janPensa commented 2 years ago

Wrote a new validation script for Toki Pona.

Changes compared to default:

changed sentence limit from 14 words to 90 characters
added rules to enforce Toki Pona's phonotactics, which should eliminate words and names with ambiguous or impossible pronunciations, and also catch a lot of typos
added "capital letters at start of word only" rule, which can replace the default "no abbreviations" one

Also updated Esperanto's validation script with translations and minor improvements.

janPensa commented 2 years ago

@MichaelKohler By the way, at the moment there are 11 sentences among the validated sentences at https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt that violate these new validation rules, because of typos or invalid words and names. Do you know what would be the best way to get these and any associated recordings removed?

MichaelKohler commented 2 years ago

You can give me a list of these sentences and I can remove them from Sentence Collector. However removing the recordings itself is probably too much work for the benefit of removing only 11 sentences. I would suggest to just report them through the UI when you encounter them.

MichaelKohler commented 2 years ago

:tada: This PR is included in version 2.17.2 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket:

janPensa commented 2 years ago

@MichaelKohler Thank you for the help!

As for the 11 sentences I mentioned, they are:

jan lilii o tawa supa lape. jan Timi li sona ala e sona pi pali ni. jan Timi li wile ala lon jan pi kulupu ni. mi ken ala pali tawa pona nim mi tawa ma Sanai. ona li jo e nimi mute pi toki Inl sike pan pi ma Italia li kama ala lon insa mi. sina sutopatikuna. tenpo mute la jan Timi li ante e tomo ona. tenpo pini la jan Sonja li yupekosi lili. toki ni la mi yupekosi e kulupu ni pi toki kalama, a a a!

I've also ran the corpus through a word frequency counter and found these 7 typos at the bottom of the list:

ale li pona. tan same a la mi pilin monsuta? jan lawa li lona poka mi. kupupu musi ni li pakala e tomo sina. ma seli la ko lejo li tawa sama kiwen telo lete. mi sone e ni: tomo sina li lon nasin seme. ona mute li loki e ni: mi toki mute. tenpo ni la mi lukine ijo sin pi pakala mute.

janPensa commented 2 years ago

Oh crap, for some reason [\u00C0-\u02BF] and [\u1E00-\u1EFF] match with every letter in the Latin alphabet, making the Sentence Collector reject every submission.

I'll write a hotfix...

Edit: Here it is https://github.com/common-voice/sentence-collector/pull/616