Any way to manually force phonemes? Issue with incorrect utterances of common words

Camb-ai / MARS5-TTS

MARS5 speech model (TTS) from CAMB.AI

https://www.camb.ai

GNU Affero General Public License v3.0

2.53k stars 206 forks source link

Any way to manually force phonemes? Issue with incorrect utterances of common words #29

Closed platform-kit closed 2 months ago

platform-kit commented 5 months ago

Tried to generate some outputs using this sentence from the demo's instructions:

We provide several generation candidates when you synthesize text, and attempt to pick the best one on the right.

The word "several" simply WILL NOT come out correctly. It comes out as "seeval," "seeral," "seel," etc.

I am sure this is a byproduct of being an early release, but I want to flag it now as I think that in addition to training there ought to be a way to manually pass in pronunciation data using ssml

Example:

 <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme>
 <phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>

This way, if the autoregressive model repeatedly guesses incorrectly (i.e. on an unusual name), there is a way to force the right result.

akshhack commented 5 months ago

Hey @platform-kit just to be sure. Did you try out the new checkpoints?

platform-kit commented 5 months ago

@akshhack I'm not sure. I used the demo linked on the readme, yesterday. So if you updated the demo with the new checkpoints, then, yes.

akshhack commented 5 months ago

Let me know if you can confirm and always drop some samples; @NourAlMerey it'll be useful for us to maybe to create a report / issue template. Thanks!

NourAlMerey commented 5 months ago

@akshhack good idea. Will do that.

platform-kit commented 4 months ago

@akshhack Here's a sample from the replicate demo

Input: We provide several generation candidates when you synthesize text, and attempt to pick the best one on the right.

Ref Audio: https://files.catbox.moe/be6df3.wav

Transcript: We actually haven't managed to meet demand.

Output: https://files.catbox.moe/07ru0x.mp3

API version: https://replicate.com/camb-ai/mars5-tts/versions/097744a80bc07de9293fd35f9997bb86dbbf68a11a1d98c3e1c2295ee5bb89ab

platform-kit commented 4 months ago

Hi guys, just bumping this as I was able to ship a demo on Replicate that uses the new weights and returns audio in the browser based UI

Here's the sample - as you can see it is still not producing correct prosody (notice the mispronunciation on "quality audio"): https://replicate.delivery/pbxt/7XRjuEf1b2QdSSFeTsbFRblEG8ft4nWV3cdfXuK7GeNfFHegJA/output.mp3

replicate demo

predict.py

hubconf.py with latest weights

arnavmehta7 commented 4 months ago

Hey @platform-kit , would you like to make the PR to this file: https://github.com/Camb-ai/MARS5-TTS/blob/master/cog/predict.py Otherwise I will take it up :)

platform-kit commented 4 months ago

@arnavmehta7 NP I'll submit it in the next couple days.