Closed cwendling closed 1 month ago
espeak-ng-mbrola is not supposed to produce worse than espeak-ng-mbrola-generic, they're supposed to be exactly the same, since the code is actually the same: libespeak-ng is the same, and in the non-generic case it calls the external mbrola tool, thus essentially the same as the pipeline in the -generic case. If a difference exists that makes the non-generic worse, it should be spotted to fix it, it's probably something dumb such as some default parameters that for whatever reason don't end up being the same.
In the end, espeak-ng-mbrola is supposed to be better, not in terms of audio quality (since they're expected to be exactly the same) but in terms of flexibility (audio pipelining, stopping, etc.)
Apparently espeak-ng doesn't report the proper audio rate (22KHz instead of 16KHz)
I believe this is now fixed
@sthibaul indeed, thanks! However, (trying a patched 0.11.4 ATM, I'll try testing true master at some point) now the sentence end is cut off. I don't know if it's a direct consequence of this or it just reveals a side effect, but it's affecting both spd-say and Orca.
now the sentence end is cut off
Was it not the case before patching?
No, the sound was weird and fast but not cut off at the end, at least not that I can hear.
I didn't look into it, but maybe there's another discrepancy with the sample rate leading to incorrect timing computation or something? Or a bug dropping the last sample could have more impact maybe as it spans more?
That would completely depend on your configuration. Here with master and the pulse backend, I'm not noticing anything.
Does it also cut off with french-mbrola-2
?
Does the cut-off show up in parecord too?
Does it also cut off with
french-mbrola-2
?
I don't have -2, but it doesn't happen with -4. However, this voice always sounds a bit weird (with the generic or not), and didn't change with the patching.
Does the cut-off show up in parecord too?
Yes.
I tried debugging this a bit, and the issue seems to be that espeak sends a spurious sample rate change event, or that it's not handled in the correct order versus the sample collection. With french-mborla-4
, I get 22050 all the way:
[…]
Thu Sep 19 11:03:10 2024 [390224]: Espeak-ng: Successfully set synthesis voice to french-mbrola-4.
Thu Sep 19 11:03:10 2024 [391570]: Espeak-ng: Got sample rate 22050
Thu Sep 19 11:03:10 2024 [391624]: Espeak-ng: Got sample rate 22050
Thu Sep 19 11:03:10 2024 [391666]: Espeak-ng: Got sample rate 22050
Thu Sep 19 11:03:10 2024 [391688]: Espeak-ng: pushing 6616 samples
Thu Sep 19 11:03:10 2024 [392130]: Espeak-ng: pushing 6616 samples
Thu Sep 19 11:03:10 2024 [392489]: Espeak-ng: pushing 6616 samples
Thu Sep 19 11:03:10 2024 [392790]: Espeak-ng: pushing 6616 samples
Thu Sep 19 11:03:10 2024 [393152]: Espeak-ng: Got sample rate 22050
Thu Sep 19 11:03:10 2024 [393189]: Espeak-ng: pushing 5867 samples
Thu Sep 19 11:03:10 2024 [492627]: Espeak-ng: pushing 418 samples
Thu Sep 19 11:03:10 2024 [529520]: Espeak-ng: Leaving module_speak() normally.
While with french-mborla-1
, I get 16000 temporarily, and it reverts back to 22050 for the last couple of sample batch:
[…]
Thu Sep 19 11:03:13 2024 [483213]: Espeak-ng: Successfully set synthesis voice to french-mbrola-1.
Thu Sep 19 11:03:13 2024 [486501]: Espeak-ng: Got sample rate 22050
Thu Sep 19 11:03:13 2024 [486709]: Espeak-ng: Got sample rate 22050
Thu Sep 19 11:03:13 2024 [486811]: Espeak-ng: Got sample rate 16000
Thu Sep 19 11:03:13 2024 [486917]: Espeak-ng: pushing 6616 samples
Thu Sep 19 11:03:13 2024 [487246]: Espeak-ng: pushing 6616 samples
Thu Sep 19 11:03:13 2024 [487549]: Espeak-ng: pushing 6616 samples
Thu Sep 19 11:03:13 2024 [487883]: Espeak-ng: Got sample rate 22050
Thu Sep 19 11:03:13 2024 [487992]: Espeak-ng: pushing 4787 samples
Thu Sep 19 11:03:13 2024 [488158]: Espeak-ng: pushing 418 samples
Thu Sep 19 11:03:13 2024 [488282]: Espeak-ng: Leaving module_speak() normally.
I believe this likely explains the issue if the last samples are not actually at rate 22050.
If I hack to force sample rate to 16000 the sound is good with no cutoff using french-mbrola-1.
I took a moment to look into this a bit, and I don't know the solution but possibly espeak-ng (1.51) is the issue. It's own code (in speech.c's dispatch_audio()
) is only looking for the espeakEVENT_SAMPLERATE
if it's the first event in the list, which looks like a bug to me. Doing the same in speech-dispatcher's module leads to the first sentence being properly spoken with french-mbrola-1 (right sample right, not cutoff), but all subsequent ones are using 22050 sample rate again.
espeka-ng
CLI tool seems to work at first, but that's until you try and mix voices with different sample rates. Basically if using only 22050 voices it's all good, but mixing them seems to deadlock it. For example, this works:
$ espeak-ng -m '<ssml><p><voice name="french-mbrola-4">Voici quelques mots pour tester.</voice> <voice name="English (Great Britain)">This is</voice> <voice name="french-mbrola-4">un test, tout va bien ?</voice> <voice name="English (American)">Dunno, whatcha thinkin?</voice></p></ssml>'
but only until you replace any french-mbrola-4
with french-mbrola-1
, in which case it stops in the middle of the latter voice's part.
Ok, leaving the espeak-ng bug for now, and compensating here, assuming that the event list starts with the proper sample rate change, and we ignore the others.
Steps to reproduce
Compare sound output between espeak-ng-mbrola (beware of #902) and espeak-ng-mborla-generic: the espeak-ng-mbrola one is a lot less human-like.
I used the following command to capture a sample (using French mbrola voices):
Obtained behavior
The espeak-ng-mbrola one is a lot less human-like, the espeak-ng-mborla-generic one sounds "better".
Expected behavior
This is actually OK if it's not a bug (it might just be that the mbrola synthesizer is better at this, which is fine); but speech-dispatcher lists espeak-ng-mbrola as "better" than espeak-ng-mbrola-generic (in
module_compare()
from src/server/speechd.c). It might be true for the feature set, but it's not for (my) ears.IMO the sorting should take into account the perceived voice quality as well as other factors, especially when two modules otherwise look so similar to the user.