common-voice / sentence-collector

Tool to collect and review sentences for Common Voice
https://commonvoice.mozilla.org/sentence-collector/
Mozilla Public License 2.0
81 stars 62 forks source link

Investigation: Armenian/Armenian Western not well reflected in Sentence Collector #454

Closed MichaelKohler closed 1 year ago

MichaelKohler commented 3 years ago

Currently in the Sentence Collector we have hy, which currently contains 2132 sentences. These currently do not get exported as Common Voice knows hy-AM and hyw. We need to adjust this.

Sentences in Sentence Collector: https://commonvoice.mozilla.org/sentence-collector/sentences/text/hy

Questions to be answered

Migration

Code English Native Action
hy ??? ??? Figure out where to migrate to
hy-AM Armenian Հայերեն Create in Sentence Collector
hyw Armenian Western ??? Create in Sentence Collector
MichaelKohler commented 3 years ago

@ekeleshian given that you have added the Armenian sentences from Wikipedia, could you help out here?

FYI @ftyers

ekeleshian commented 3 years ago

Thank you for the ping @MichaelKohler So your question relates to a bigger question that the Armenian Common Voice community extensively talked about in posts and meetups - should we maintain one or multiple voice corpora for the Armenian language? The TLDR version of the answer is: let's maintain only one voice corpus that merges all dialects and both forms of Armenian orthography. We ultimately think that the Wikipedia language split is an anti-pattern that we want to prevent from bleeding into other systems. As a result, we merged together hyw.wikipedia and hy.wikipedia sentences in this PR. The non-TLDR version is explained in this blogpost.

ekeleshian commented 3 years ago

So I think the quick fix here is to migrate hy and hyw to hy-AM.

The ideal fix is to merge hy-AM into hy and get rid of hy-AM and hyw altogether.

MichaelKohler commented 3 years ago

Thanks for the quick response! In that case this is bigger than just the Sentence Collector. Pulling in @phirework (and sent it to Hillary as well, didn't know her GitHub handle).

MichaelKohler commented 1 year ago

@jessicarose I'm closing this issue, but might be worth for you to have a look at. This does not only go for Sentence Collector but also Common Voice itself.