espeak-ng-mborla sound is worse than espeak-ng-mborla-generic (or espeak-mborla-geneirc) - Githubissues

brailcom / speechd

Common high-level interface to speech synthesis

GNU General Public License v2.0

226 stars 65 forks source link

espeak-ng-mborla sound is worse than espeak-ng-mborla-generic (or espeak-mborla-geneirc) #949

Closed cwendling closed 1 month ago

cwendling commented 3 months ago

Steps to reproduce

Compare sound output between espeak-ng-mbrola (beware of #902) and espeak-ng-mborla-generic: the espeak-ng-mbrola one is a lot less human-like.

I used the following command to capture a sample (using French mbrola voices):

parecord -d @DEFAULT_MONITOR@ /tmp/sample.flac & pid=$!; spd-say -w -o espeak-ng-mbrola "Voici quelques mots pour tester." -y french-mbrola-1; sleep 1; spd-say -w -o espeak-ng-mbrola-generic "Voici quelques mots pour tester." -y fr1; kill "$pid"

Obtained behavior

The espeak-ng-mbrola one is a lot less human-like, the espeak-ng-mborla-generic one sounds "better".

Expected behavior

This is actually OK if it's not a bug (it might just be that the mbrola synthesizer is better at this, which is fine); but speech-dispatcher lists espeak-ng-mbrola as "better" than espeak-ng-mbrola-generic (in module_compare() from src/server/speechd.c). It might be true for the feature set, but it's not for (my) ears.

IMO the sorting should take into account the perceived voice quality as well as other factors, especially when two modules otherwise look so similar to the user.

sthibaul commented 3 months ago

espeak-ng-mbrola is not supposed to produce worse than espeak-ng-mbrola-generic, they're supposed to be exactly the same, since the code is actually the same: libespeak-ng is the same, and in the non-generic case it calls the external mbrola tool, thus essentially the same as the pipeline in the -generic case. If a difference exists that makes the non-generic worse, it should be spotted to fix it, it's probably something dumb such as some default parameters that for whatever reason don't end up being the same.

In the end, espeak-ng-mbrola is supposed to be better, not in terms of audio quality (since they're expected to be exactly the same) but in terms of flexibility (audio pipelining, stopping, etc.)

sthibaul commented 1 month ago

Apparently espeak-ng doesn't report the proper audio rate (22KHz instead of 16KHz)

sthibaul commented 1 month ago

I believe this is now fixed

cwendling commented 1 month ago

@sthibaul indeed, thanks! However, (trying a patched 0.11.4 ATM, I'll try testing true master at some point) now the sentence end is cut off. I don't know if it's a direct consequence of this or it just reveals a side effect, but it's affecting both spd-say and Orca.

sthibaul commented 1 month ago

now the sentence end is cut off

Was it not the case before patching?

cwendling commented 1 month ago

No, the sound was weird and fast but not cut off at the end, at least not that I can hear.

I didn't look into it, but maybe there's another discrepancy with the sample rate leading to incorrect timing computation or something? Or a bug dropping the last sample could have more impact maybe as it spans more?

sthibaul commented 1 month ago

That would completely depend on your configuration. Here with master and the pulse backend, I'm not noticing anything.

sthibaul commented 1 month ago

Does it also cut off with french-mbrola-2?

sthibaul commented 1 month ago

Does the cut-off show up in parecord too?

cwendling commented 1 month ago

Does it also cut off with french-mbrola-2?

I don't have -2, but it doesn't happen with -4. However, this voice always sounds a bit weird (with the generic or not), and didn't change with the patching.

Does the cut-off show up in parecord too?

Yes.

I tried debugging this a bit, and the issue seems to be that espeak sends a spurious sample rate change event, or that it's not handled in the correct order versus the sample collection. With french-mborla-4, I get 22050 all the way:

[…]
 Thu Sep 19 11:03:10 2024 [390224]: Espeak-ng: Successfully set synthesis voice to french-mbrola-4.
 Thu Sep 19 11:03:10 2024 [391570]: Espeak-ng: Got sample rate 22050
 Thu Sep 19 11:03:10 2024 [391624]: Espeak-ng: Got sample rate 22050
 Thu Sep 19 11:03:10 2024 [391666]: Espeak-ng: Got sample rate 22050
 Thu Sep 19 11:03:10 2024 [391688]: Espeak-ng: pushing 6616 samples
 Thu Sep 19 11:03:10 2024 [392130]: Espeak-ng: pushing 6616 samples
 Thu Sep 19 11:03:10 2024 [392489]: Espeak-ng: pushing 6616 samples
 Thu Sep 19 11:03:10 2024 [392790]: Espeak-ng: pushing 6616 samples
 Thu Sep 19 11:03:10 2024 [393152]: Espeak-ng: Got sample rate 22050
 Thu Sep 19 11:03:10 2024 [393189]: Espeak-ng: pushing 5867 samples
 Thu Sep 19 11:03:10 2024 [492627]: Espeak-ng: pushing 418 samples
 Thu Sep 19 11:03:10 2024 [529520]: Espeak-ng: Leaving module_speak() normally.

While with french-mborla-1, I get 16000 temporarily, and it reverts back to 22050 for the last couple of sample batch:

[…]
 Thu Sep 19 11:03:13 2024 [483213]: Espeak-ng: Successfully set synthesis voice to french-mbrola-1.
 Thu Sep 19 11:03:13 2024 [486501]: Espeak-ng: Got sample rate 22050
 Thu Sep 19 11:03:13 2024 [486709]: Espeak-ng: Got sample rate 22050
 Thu Sep 19 11:03:13 2024 [486811]: Espeak-ng: Got sample rate 16000
 Thu Sep 19 11:03:13 2024 [486917]: Espeak-ng: pushing 6616 samples
 Thu Sep 19 11:03:13 2024 [487246]: Espeak-ng: pushing 6616 samples
 Thu Sep 19 11:03:13 2024 [487549]: Espeak-ng: pushing 6616 samples
 Thu Sep 19 11:03:13 2024 [487883]: Espeak-ng: Got sample rate 22050
 Thu Sep 19 11:03:13 2024 [487992]: Espeak-ng: pushing 4787 samples
 Thu Sep 19 11:03:13 2024 [488158]: Espeak-ng: pushing 418 samples
 Thu Sep 19 11:03:13 2024 [488282]: Espeak-ng: Leaving module_speak() normally.

I believe this likely explains the issue if the last samples are not actually at rate 22050.

cwendling commented 1 month ago

If I hack to force sample rate to 16000 the sound is good with no cutoff using french-mbrola-1.

cwendling commented 1 month ago

I took a moment to look into this a bit, and I don't know the solution but possibly espeak-ng (1.51) is the issue. It's own code (in speech.c's dispatch_audio()) is only looking for the espeakEVENT_SAMPLERATE if it's the first event in the list, which looks like a bug to me. Doing the same in speech-dispatcher's module leads to the first sentence being properly spoken with french-mbrola-1 (right sample right, not cutoff), but all subsequent ones are using 22050 sample rate again.

espeka-ng CLI tool seems to work at first, but that's until you try and mix voices with different sample rates. Basically if using only 22050 voices it's all good, but mixing them seems to deadlock it. For example, this works:

$ espeak-ng -m '<ssml><p><voice name="french-mbrola-4">Voici quelques mots pour tester.</voice> <voice name="English (Great Britain)">This is</voice> <voice name="french-mbrola-4">un test, tout va bien ?</voice>  <voice name="English (American)">Dunno, whatcha thinkin?</voice></p></ssml>'

but only until you replace any french-mbrola-4 with french-mbrola-1, in which case it stops in the middle of the latter voice's part.

sthibaul commented 5 days ago

Ok, leaving the espeak-ng bug for now, and compensating here, assuming that the event list starts with the proper sample rate change, and we ignore the others.