How to fix pronunciation issues

fquirin commented 2 years ago

I've been experimenting with txt2pho and MBROLA and noticed something odd:

The sentence Heute ist der 19.12.2021 will abort after 19. (pronounced "neunzehnte") :-/.

I'm using this command:

echo "Heute ist der 19.12.2021" | iconv -cs -f UTF-8 -t ISO-8859-1 | ./txt2pho -m | mbrola /usr/share/mbrola/de2/de2" - test.wav

I was wondering where one can edit the rule that is responsible for this behavior.

Besides that there is a light problem with numbers at the end of a sentence, because they will always be spoken as ordinals: "Er wurde Heute 40.". Though without context it is indeed unclear if "Er wurde Heute vierzig" or "Er wurde Heute Vierzigster" is the right version ^^.

Btw I'm avoiding 'preproc' because it has it's own set of issues :sweat_smile: :see_no_evil:

GHPS commented 2 years ago

I was wondering where one can edit the rule that is responsible for this behavior.

Well - this is not a bug in txt2pho...

The programs in this repo serve very different purposes. txt2pho is responsible for converting text to phonems - just plain text, no complex numbers, fractions, times, dates or currencies since that is a completely different kind of problem. This latter problem is tackled by the preprocessor preproc.

At first glance both problems seem to be quite easy to solve. As always with natural languages both become much harder the closer one looks at them.

Take your example

echo "Heute ist der 19.12.2021" | ./preproc data/PPRules/rules.lst data/hadifix.abk

gets translated to

Heute ist der 19 12 zwei tausend einundzwanzig

which means that the ordinals are not spoken correctly. OK, let's fix rules.lst so we get

Heute ist der 19n 12n zwei tausend einundzwanzig

which is better but still incorrect since in German ordinals are declined in the sentence. So we need an algorithm which understands the different parts of language in a sentence and translates "19." into "19n", "19te" or "19ter" respectively.

Besides that there is a light problem with numbers at the end of a sentence, because they will always be spoken as ordinals:

Yes, that is basically a similar problem. The preprocessor must be able to understand the meaning of "40." at the end of the sentence.

At the moment no-one has volunteered to write such an elaborated version of preproc...

PS: If you have a text output coming from an another programs it's rather easy to ensure the correctness of the spoken output:

Heute ist der 19te 12te 2021.

fquirin commented 2 years ago

I'm aware of the complexity but I'm confused why 19.12.2021 becomes neunzehnte and that's it. The rest is removed completely! And I'm not even using preproc (see example above). I'm catching some edge cases already in the assistant TTS preprocessor (e.g. 10:30 Uhr -> 10 Uhr 30), but as you know dates are extremely messy in German :grimacing: ... so I was hoping to get "neunzehn punkt zwölf punkt zweitausendeinundzwanzig" for now from txt2pho as in espeak for example.

GHPS commented 2 years ago

I'm aware of the complexity but I'm confused why 19.12.2021 becomes neunzehnte and that's it.

OK, let's focus on txt2pho...

echo "Heute ist der 19.12.2021"|./txt2pho | mbrola -e /usr/share/mbrola/de2/de2 - test.wav

becomes

Heute ist der 19 Punkt 12 Punkt 2021

or

_ 10   0  86 
h 81  23  88  48  89  73  91  98  92 
OY 121  15  94  31  96  48  98  64 100  81 101  98 102 
t 83  14 104  39 104 
@ 58  24 104  59 104  93 104 
_ 41  39 103  88 103 
I 46  33 103  76 102 
s 69  13 102  42 101  71 101 100 100 
t 70  29  98 
d 48 
e: 57   4  96  39  95  74  95 
6 70   7  94  36  96  64  98  93 100 
n 53  28 101  66 102 
OY 109   2 103  20 103  39 103  57 103  75 103  94 103 
n 56  23 102  59 102  95 102 
t 66   5 101  35 100 
s 63  17 100  49  99  81  99 
e: 56  14  98  50  97  86  97 
n 57  21  96  56  95  91  94 
p 92  15  92  37  91 
U 66   5  91  35  93  65  95  95  97 
N 60  28  98  62 100  95 101 
k 63   3 102  35 103 
t 52  17 102 
s 57  16 101  51 101  86 100 
v 37  32 100  86  99 
9 64  23  98  55  98  86  97 
l 61  18  97  51  99  84 100 
f 57  18 101  53 102  88 102 
p 98   7 102  28 102  48 101 
U 66  23 101  53 101  83 100 
N 61  15 100  48  99  80 100 
k 64  20 102 
t 54  33 102 
s 58  29 100  64 100  98  99 
v 33  58  99 
aI 119   5  98  22  97  39  97  55  96  72  95  89  95 
t 68  15  98  44  99 
aU 84  23  98  46  98  70  98  94  98 
z 34  44  97 
E 57   2  97  37  96  72  96 
n 49   8  95  49  95  90  94 
d 49   2  92 
aI 100   6  92  26  91  46  90  66  90  86  89 
n 20  30  88 
U 51  12  88  51  87  90  87 
n 52  29  86  67  86 
t 46  37  85 
s 46  13  85  57  85 100  85 
v 15 
a 75   7  85  33  85  60  84  87  84 
n 48  21  83  62  83 
t 39   5  81 
s 49   6  81  47  80  88  79 
I 43  33  79  79  78 
C 63  17  77  49  77  81  76 
s 60  13  76  47  76  80  76 
_ 483   2  85   6  85  10  85  14  85  18  85  22  85  27  85  31  85  35  85  39  85

fquirin commented 2 years ago

Ah sorry, there was a dot missing. Try this: echo "Heute ist der 19.12.2021."|./txt2pho | mbrola -e /usr/share/mbrola/de2/de2 - test.wav

GHPS commented 2 years ago

Yes - can see the problem now.

I'm wondering what the cause for this strange behaviour is.

fquirin commented 2 years ago

I thought it was a rule defined somewhere since it actually transforms 19. to ordinal. Maybe it fails to handle 2021. but then I'd expect to hear 12. at least. Any files I could check for ordinal transformation?

fquirin commented 2 years ago

Yes - can see the problem now. I'm wondering what the cause for this strange behaviour is.

Did you find out anything new about this? I've seen there were some recent commits related to dates :slightly_smiling_face:

fquirin commented 2 years ago

@GHPS I've integrated txt2pho in the latest SEPIA-Home release :slightly_smiling_face: . Here are instructions to install it.

I really like the voices but from time to time I find some strange artifacts (that don't appear in espeak or default MBROLA). For example if you ask SEPIA for the date you will get the answer "Heute ist der 12.05.2022" but what you hear is really weird: "Heute ist der zwölft punkt null fünf null zwei null zwei zwei punkt zweitausendzweiundzwanzig" :sweat_smile: :see_no_evil: .

My "speak" script looks like this (arguments: gender, voice, text): echo "$3" | iconv -cs -f UTF-8 -t ISO-8859-1 | ./txt2pho "-$1" | mbrola /usr/share/mbrola/"$2"

GHPS commented 2 years ago

@GHPS I've integrated txt2pho in the latest SEPIA-Home release slightly_smiling_face . Here are instructions to install it.

Great - SEPIA is a very promissing project. I'll link to the instructions in the readme of this project.

Concerning the pronunciation issue: The log-files should give some insight what is going on/wrong.

I'll take a deeper look into the code in the next week...

fquirin commented 2 years ago

Great, thanks! :slightly_smiling_face: I'll try to fix the problem with dates in SEPIA's own TTS pre-processor in the meantime. It seems German dates have been a pain for TTS since the dawn of time =)

fquirin commented 2 years ago

Hi @GHPS

I found another issue with the pronunciation, again related to "." after numbers 😢.

echo "Licht steht auf 70." | iconv -cs -f UTF-8 -t ISO-8859-1 | ./txt2pho -m | mbrola /usr/share/mbrola/de3/de3 - test.wav

The "70" will not be spoken at all. It works when I remove the "." at the end.

GHPS commented 2 years ago

Thanks for the information.

The "70" will not be spoken at all. It works when I remove the "." at the end.

That is in principle the same problem as discussed above: txt2pho converts a stream of text to phonems - but has no concepts for parts of speech or even complete sentences. In this context the character string "70." has no meaning since it is no word or a correct German number. It is therefore ignored.

That is why the preprocessor is necessary. It uses a number of heuristics to decide whether "70." means "siebzigster" oder "siebzig" at the end of the sentence. It even understands constructs like "70.000".

In short: Use preproc to convert numbers or whole sentences before sending the stream to txt2pho.

fquirin commented 2 years ago

In short: Use preproc to convert numbers or whole sentences before sending the stream to txt2pho

The preprocessor has unfortunately some weird behavior as well :-/ for example: echo "Der 70. Geburtstag ist am 01.01.2023" | iconv -cs -f UTF-8 -t ISO-8859-1 | ./preproc -r data/preproc.rls -a data/preproc.abk -> Der siebzigste Geburtstag ist am 01n 01n zwei tausend dreiundzwanzig

I initially removed the preprocessor because I'm doing my own processing first, but it may still be the better option compared to loosing numbers completely 😅

GHPS / txt2pho

How to fix pronunciation issues #5