Investigation: Armenian/Armenian Western not well reflected in Sentence Collector

common-voice / sentence-collector

Tool to collect and review sentences for Common Voice

https://commonvoice.mozilla.org/sentence-collector/

Mozilla Public License 2.0

81 stars 62 forks source link

Investigation: Armenian/Armenian Western not well reflected in Sentence Collector #454

Closed MichaelKohler closed 1 year ago

MichaelKohler commented 3 years ago

Currently in the Sentence Collector we have hy, which currently contains 2132 sentences. These currently do not get exported as Common Voice knows hy-AM and hyw. We need to adjust this.

Sentences in Sentence Collector: https://commonvoice.mozilla.org/sentence-collector/sentences/text/hy

Questions to be answered

Are the current sentences Armenian or Armenian Western?
What's the correct native name for Armenian Western?

Migration

Code	English	Native	Action
hy	???	???	Figure out where to migrate to
hy-AM	Armenian	Հայերեն	Create in Sentence Collector
hyw	Armenian Western	???	Create in Sentence Collector

MichaelKohler commented 3 years ago

@ekeleshian given that you have added the Armenian sentences from Wikipedia, could you help out here?

FYI @ftyers

ekeleshian commented 3 years ago

Thank you for the ping @MichaelKohler So your question relates to a bigger question that the Armenian Common Voice community extensively talked about in posts and meetups - should we maintain one or multiple voice corpora for the Armenian language? The TLDR version of the answer is: let's maintain only one voice corpus that merges all dialects and both forms of Armenian orthography. We ultimately think that the Wikipedia language split is an anti-pattern that we want to prevent from bleeding into other systems. As a result, we merged together hyw.wikipedia and hy.wikipedia sentences in this PR. The non-TLDR version is explained in this blogpost.

ekeleshian commented 3 years ago

So I think the quick fix here is to migrate hy and hyw to hy-AM.

The ideal fix is to merge hy-AM into hy and get rid of hy-AM and hyw altogether.

MichaelKohler commented 3 years ago

Thanks for the quick response! In that case this is bigger than just the Sentence Collector. Pulling in @phirework (and sent it to Hillary as well, didn't know her GitHub handle).

MichaelKohler commented 1 year ago

@jessicarose I'm closing this issue, but might be worth for you to have a look at. This does not only go for Sentence Collector but also Common Voice itself.