Closed MichaelKohler closed 1 year ago
@ekeleshian given that you have added the Armenian sentences from Wikipedia, could you help out here?
FYI @ftyers
Thank you for the ping @MichaelKohler So your question relates to a bigger question that the Armenian Common Voice community extensively talked about in posts and meetups - should we maintain one or multiple voice corpora for the Armenian language? The TLDR version of the answer is: let's maintain only one voice corpus that merges all dialects and both forms of Armenian orthography. We ultimately think that the Wikipedia language split is an anti-pattern that we want to prevent from bleeding into other systems. As a result, we merged together hyw.wikipedia and hy.wikipedia sentences in this PR. The non-TLDR version is explained in this blogpost.
So I think the quick fix here is to migrate hy and hyw to hy-AM.
The ideal fix is to merge hy-AM into hy and get rid of hy-AM and hyw altogether.
Thanks for the quick response! In that case this is bigger than just the Sentence Collector. Pulling in @phirework (and sent it to Hillary as well, didn't know her GitHub handle).
@jessicarose I'm closing this issue, but might be worth for you to have a look at. This does not only go for Sentence Collector but also Common Voice itself.
Currently in the Sentence Collector we have
hy
, which currently contains 2132 sentences. These currently do not get exported as Common Voice knowshy-AM
andhyw
. We need to adjust this.Sentences in Sentence Collector: https://commonvoice.mozilla.org/sentence-collector/sentences/text/hy
Questions to be answered
Armenian
orArmenian Western
?Armenian Western
?Migration