Open DeceptiveMagic opened 3 days ago
I'm not affiliated with ElevenLabs, but I think it meant that you have to open the generated audio with an audio editor (like Audacity) and trim it, so select the part that contains the text from inside the quotes and delete everything else.
You might be able to automate this by using the text-to-speech-with-timestamps API endpoint.
With this you'll recieve an array with each character from your prompt, paired with a timestamp. There you'd have to look for the first and last occurence of the quotes character, read that timestamp and use this then to trim the audio sample.
Of course I'd be delighted to learn about a more practical approach to achieve this ;-)
Path: /speech-synthesis/prompting
The examples given for using dialogue tags appear to make the process simple and easy, by just adding quotes around the verbal portions, and describing the dialogue outside of the quotes. However, this does not work for the voice clip generation, as the characters will read the entirety of the script, including the parts outside of the quotation marks.
This is an issue that is recognized by the documentation states "You will also have to somehow remove the prompt as the AI will read exactly what you give it."
What does this mean? There is no example on how to do this, and by removing the prompt, then the context is also removed. While the feature would be extremely useful for context and inflections, the current implementation causes the program to rely entirely on AI inferences of intended emotion.
The AI tends to do a decent job at picking up on and displaying emotions. However, there are problems that occur with inflections and emotion enough to where it causes problems.
If there could be an example posted of how to use the dialogue tags, instead of how not to correctly prompt the AI, that would really help clear up some of the issues.