Camb-ai / MARS5-TTS

MARS5 speech model (TTS) from CAMB.AI
https://www.camb.ai
GNU Affero General Public License v3.0
1.37k stars 95 forks source link

Any way to manually force phonemes? Issue with incorrect utterances of common words #29

Open platform-kit opened 1 week ago

platform-kit commented 1 week ago

Tried to generate some outputs using this sentence from the demo's instructions:

We provide several generation candidates when you synthesize text, and attempt to pick the best one on the right.

The word "several" simply WILL NOT come out correctly. It comes out as "seeval," "seeral," "seel," etc.

I am sure this is a byproduct of being an early release, but I want to flag it now as I think that in addition to training there ought to be a way to manually pass in pronunciation data using ssml

Example:

 <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme>
 <phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>

This way, if the autoregressive model repeatedly guesses incorrectly (i.e. on an unusual name), there is a way to force the right result.

akshhack commented 1 week ago

Hey @platform-kit just to be sure. Did you try out the new checkpoints?

platform-kit commented 1 week ago

@akshhack I'm not sure. I used the demo linked on the readme, yesterday. So if you updated the demo with the new checkpoints, then, yes.

akshhack commented 1 week ago

Let me know if you can confirm and always drop some samples; @NourAlMerey it'll be useful for us to maybe to create a report / issue template. Thanks!

NourAlMerey commented 1 week ago

@akshhack good idea. Will do that.

platform-kit commented 6 days ago

@akshhack Here's a sample from the replicate demo

Input: We provide several generation candidates when you synthesize text, and attempt to pick the best one on the right.

Ref Audio: https://files.catbox.moe/be6df3.wav

Transcript: We actually haven't managed to meet demand.

Output: https://files.catbox.moe/07ru0x.mp3

API version: https://replicate.com/camb-ai/mars5-tts/versions/097744a80bc07de9293fd35f9997bb86dbbf68a11a1d98c3e1c2295ee5bb89ab

platform-kit commented 1 day ago

Hi guys, just bumping this as I was able to ship a demo on Replicate that uses the new weights and returns audio in the browser based UI

Here's the sample - as you can see it is still not producing correct prosody (notice the mispronunciation on "quality audio"): https://replicate.delivery/pbxt/7XRjuEf1b2QdSSFeTsbFRblEG8ft4nWV3cdfXuK7GeNfFHegJA/output.mp3

replicate demo

predict.py

hubconf.py with latest weights