ftyers / commonvoice-utils

Linguistic processing for Common Voice
GNU Affero General Public License v3.0
51 stars 14 forks source link

missing Hausa characters #20

Closed JRMeyer closed 2 years ago

JRMeyer commented 2 years ago

these look like valid Hausa characters, but covo validate ha will either remove them or fail on them

ā ă

ftyers commented 2 years ago

In general, according to the Wikipedia page:

In standard written Hausa, tone is not marked. In recent linguistic and pedagogical materials, tone is marked by means of diacritics.

Does the data you have mark those?

There is ʼ (U+02BC MODIFIER LETTER APOSTROPHE) in the alphabet, which is probably equivalent to (U+2019 RIGHT SINGLE QUOTATION MARK). I added the normalisation in 2ed8882.

JRMeyer commented 2 years ago

yes, those two as are quite common in the bible text

ftyers commented 2 years ago

I added them in 38b92b5. It seems those are non-standard diacritic marks, but we can update this if we get a different dataset.