Closed janPensa closed 2 years ago
@MichaelKohler By the way, at the moment there are 11 sentences among the validated sentences at https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt that violate these new validation rules, because of typos or invalid words and names. Do you know what would be the best way to get these and any associated recordings removed?
You can give me a list of these sentences and I can remove them from Sentence Collector. However removing the recordings itself is probably too much work for the benefit of removing only 11 sentences. I would suggest to just report them through the UI when you encounter them.
:tada: This PR is included in version 2.17.2 :tada:
The release is available on GitHub release
Your semantic-release bot :package::rocket:
@MichaelKohler Thank you for the help!
As for the 11 sentences I mentioned, they are:
jan lilii o tawa supa lape. jan Timi li sona ala e sona pi pali ni. jan Timi li wile ala lon jan pi kulupu ni. mi ken ala pali tawa pona nim mi tawa ma Sanai. ona li jo e nimi mute pi toki Inl sike pan pi ma Italia li kama ala lon insa mi. sina sutopatikuna. tenpo mute la jan Timi li ante e tomo ona. tenpo pini la jan Sonja li yupekosi lili. toki ni la mi yupekosi e kulupu ni pi toki kalama, a a a!
I've also ran the corpus through a word frequency counter and found these 7 typos at the bottom of the list:
ale li pona. tan same a la mi pilin monsuta? jan lawa li lona poka mi. kupupu musi ni li pakala e tomo sina. ma seli la ko lejo li tawa sama kiwen telo lete. mi sone e ni: tomo sina li lon nasin seme. ona mute li loki e ni: mi toki mute. tenpo ni la mi lukine ijo sin pi pakala mute.
Oh crap, for some reason [\u00C0-\u02BF]
and [\u1E00-\u1EFF]
match with every letter in the Latin alphabet, making the Sentence Collector reject every submission.
I'll write a hotfix...
Edit: Here it is https://github.com/common-voice/sentence-collector/pull/616
Wrote a new validation script for Toki Pona.
Changes compared to default:
Also updated Esperanto's validation script with translations and minor improvements.