HaraldBerthelsen / marytts-lang-ar

Basic marytts language support for Arabic (very much work in progress!)
The Unlicense
3 stars 3 forks source link

Issues for native speakers to help #2

Open assem-ch opened 8 years ago

assem-ch commented 8 years ago

I am a coder and worked previously on Arabic language processing and I am a native speaker , could you list some issues that I can help with?

HaraldBerthelsen commented 8 years ago

Hi Assem, that would be great, if you have time to help out! You could start by testing things a bit and see for yourself what you think works and what doesn't? The marytts server at https://demo.morf.se/marytts/ usually works for testing (although sometimes I or somebody else crashes it..)

Then what I think is most important and easiest to start with is the phonetiser code in https://github.com/HaraldBerthelsen/marytts-lang-ar/blob/master/src/main/java/marytts/language/ar/JPhonemiser.java. It is meant to be a simple java adaptation of https://github.com/nawarhalabi/Arabic-Phonetiser/blob/master/phonetise-Buckwalter.py. But there are clear errors, where I didn't have the knowledge and/or the time to fix it properly.

An example: ستوكهولم is (I believe correctly..) diacritised as ستُوكهُولم (Buckwalter: stuwkhuwlm) and comes out through the phonetiser as "s t u0 - uu0 - k h u0 - uu0 l m". Of course it's a foreign name and has a special pronunciation, but ignore that and just look at the character sequence. The phoneme sequence should be "s t uu0 k h uu0 l m", I believe, and the syllable boundary should probably be between k and h.

Errors such as this comes out fairly ok through the synthesis, but that's more by good fortune..

Perhaps the first thing would be to set up a test set with some words and their correct phonetisation, and use them in a unit test?

linuxscout commented 8 years ago

Hi Harald, I had tested the output, it's a great work. Some errors come from the diacritization, which means future development of Mishkal will improve TTS operation. For the Stokholm work, I think, we should correct it in the diacritizer, I will work on it, to improve named entities recognition. Thanks

HaraldBerthelsen commented 8 years ago

Thank you for your kind words!

Yes the Stockholm example was poor, because it's a foreign word. I think the diacritisation of it is actually correct, though, but the problem with it is that the two long u-s are actually pronounced short? Many foreign words like that will need to go in a separate pronunciation dictionary. Anyway here's perhaps a better example of what I mean: لُوْز is (I think) the correct diacritisation of لوز . It comes out through the phonetisation rules as "l u0 - u0 - uu0 z", even worse than the Stockholm example, with two short u-s that shouldn't be there. The correct transcription should be "l uu0 z".

The cause of this is my own bad code in JPhonemiser.java. I should fix it - and I will, sometime, but I don't know when I'll have time for it.. So, Assem, if you think it's a good idea, you're very welcome to have a go at it!

linuxscout commented 8 years ago

Ok, I think that لوز example is not correct because in arabic we pronunciate it LAWZ لَوْز I will look at JPhonemiser

HaraldBerthelsen commented 8 years ago

Haha, well you see how my lack of knowledge confuses things! Actually I used the example because I remember eating fresh almonds from a tree outside Nablus in Palestine, and my friend there said "luuz". Perhaps a dialect difference, or I heard it wrong, or I remember wrong. Anyway the problem is still the same, that the JPhonemiser produces the wrong phoneme sequence sometimes.

But I'm a little bit confused also over this comment thread. It was started by assem-ch, and clearly you (linuxscout) also saw his question and my attempt to answer it. I think perhaps you (linuxscout) have more important things to do with your own great work with Mishkal, rather than look at my bad code! ;-)

Although of course you are perfectly welcome to do it if you want. I am naturally more than grateful for all help!

linuxscout commented 8 years ago

Ok, Yes the word لوز is pronounced Luuz in arabic dialect, but in arabic language it is pronounced as Lawz. We can imagine that Mishkal can give us a good form of prepared text to speech, by removing unnecessary letters like ال- definite article like السماء => اسّماء al-sama' => assama' or in case of added Alef like ذهبوا Dhahabuu instead of Dhahabuuwa. Assem and me are friends, we work together on arabic open source tools.

HaraldBerthelsen commented 8 years ago

Great! Yes, that could be a good thing, preparing for synthesis with mishkal. Sorry for my confusion there, I didn't realise you worked together!

atefBB commented 7 years ago

I'm a 'native' arabic speaker too! can I help?

HaraldBerthelsen commented 7 years ago

Hi atefBB, yes, absolutely, that would be wonderful, I will repeat here most of my answer to assem-ch, I think it is still valid:

that would be great, if you have time to help out! You could start by testing things a bit and see for yourself what you think works and what doesn't? The marytts server at https://demo.morf.se/marytts/ usually works for testing (although sometimes I or somebody else crashes it..) And of course install and run marytts, marytts-lang-ar, and the voice from https://github.com/HaraldBerthelsen/voice-ar-nah-hsmm. That may be a bit tricky, I'll do my best to help out if needed.

Then what I think is most important and easiest to start with is the phonetiser code in https://github.com/HaraldBerthelsen/marytts-lang-ar/blob/master/src/main/java/marytts/language/ar/JPhonemiser.java. It is meant to be a simple java adaptation of https://github.com/nawarhalabi/Arabic-Phonetiser/blob/master/phonetise-Buckwalter.py. But there are clear errors, where I didn't have the knowledge and/or the time to fix it properly.

Perhaps the first thing would be to set up a test set with some words and their correct phonetisation, and use them in a unit test?

atefBB commented 7 years ago

I'll see what can I do !