aedocw / epub2tts

Turn an epub or text file into an audiobook
Apache License 2.0
496 stars 45 forks source link

Speech cut off mid sentence and punctuation error #57

Closed Aamir3d closed 10 months ago

Aamir3d commented 10 months ago

Not sure if I raised this issue here before.

Issue 1: When using the epub2tts script with the default settings, I noticed many sentences were cut off in the middle, and the speech got muddled and skipped to the next sentence. This happens across models.

For example this text from the epub _“I’m sorry we’re late,” Captain Malloy said to Yousif’s father.

“You’re welcome any time,” the doctor answered, shaking his hand.

“The District Commissioner planned to be here,” Malloy explained. “But at the last minute something came up and he couldn’t make it. He asked me to convey to you his regrets and his congratulations. Mabrook.”

“Thank you,” the doctor said.

“The house is truly magnificent.”

“You’re very kind.”_

The audio :

https://github.com/aedocw/epub2tts/assets/122131887/5b63d1d6-d970-4815-8527-b4ae372e1dd5

Issue 2 There are punctuation errors where the letters after an apostrophe ' are also vocalized. Eg; They're is spoken as They Re

aedocw commented 10 months ago

Regarding sentences being cut off, I am able to reproduce this. For instance with this phrase: '“I always do my best to treat people, including those I disagree with, respectfully and will continue to do so.”' When I extract ONLY the paragraph that sentence appears in and run that only through epub2tts, it reads the whole sentence. However when I do the entire chapter of that book it only reads “I always do my best to treat people, including those I disagree with".

This, and the punctuation issues, are coming from Coqui-TTS. Dropping part of the longer sentences has something to do with the overall size of the set of text being read, but I don't know what could be done about it. It would be interesting to try dramatically shortening what is sent for TTS (i.e. instead of entire chapters, send only paragraphs). This would result in WAY more WAV files being created and then concatenated so would likely have a performance penalty. I haven't tried it yet because I'm not sure (programatically) how best to break only on paragraphs while still maintaining grouping so that ultimately you end up with chapter/section breaks in the m4b file (and those breaks are based on each individual WAV file).

I'll leave this open because it is a legitimate issue even if it's caused by a dependency (Coqui-TTS). Maybe someone will else will take a shot at fixing this :)

aedocw commented 10 months ago

Ugh now I wonder if this problem is larger than I realized :(. I'm reading along and caught another sentence in the same chapter the is cut off. Similarly it's one with multiple commas, and it drops everything after the last comma. This is going to inspire me to try smaller chunks for reading after all.

Thank you @Aamir3d for bringing this up!

Aamir3d commented 10 months ago

Ugh now I wonder if this problem is larger than I realized :(. I'm reading along and caught another sentence in the same chapter the is cut off. Similarly it's one with multiple commas, and it drops everything after the last comma. This is going to inspire me to try smaller chunks for reading after all.

Thank you @Aamir3d for bringing this up!

You're welcome @aedocw ! Hope this gets sorted. Yours is an excellent project for audiobook conversion.

I saw another interesting project that I would like to bring to your attention https://github.com/bnsantoso/sub-to-audio . Essentially your project and this one are both using Coqui TTS as the backend. I wonder if you can extend the project to include TXT, Epub, PDF and other common text formats?

Another request would be to create a GUI for ease of use. PS: Do you use LinkedIn?

aedocw commented 10 months ago

sub-to-audio looks really interesting, I'll take a closer look. It's nice to see something properly written vs. this hack job haha.

GUI for ease of use would be nice, and is something I have had in mind for a while. What I plan to do first is make this run as a daemon with an API, then put a web interface in front of that. That could easily be run with docker/docker-compose so you could get a relatively easy path to GUI.

I am on linkedin at https://linkedin.com/in/christopheraedo

Aamir3d commented 10 months ago

sub-to-audio looks really interesting, I'll take a closer look. It's nice to see something properly written vs. this hack job haha.

GUI for ease of use would be nice, and is something I have had in mind for a while. What I plan to do first is make this run as a daemon with an API, then put a web interface in front of that. That could easily be run with docker/docker-compose so you could get a relatively easy path to GUI.

I am on linkedin at https://linkedin.com/in/christopheraedo

Connected with you. Thanks for your effort and interaction on this one! Here's another superb project if you've not heard of it before by @rsxdalv - I've found it incredibly useful for short audio content/training/music. https://github.com/rsxdalv/tts-generation-webui

aedocw commented 10 months ago

That looks cool, thanks for sharing. I have played a little bit with Bark but it's not usable without a decent GPU, and I have not played with it enough to see if it's remotely usable for long-form stuff.

I am eagerly awaiting a pre-trained model here https://github.com/yl4579/StyleTTS2 - it's got a lot going for it and seems to do well with long segments. Probably still going to absolutely require a GPU but if it works well I'll definitely build that in as an option once it stabilizes some.

Aamir3d commented 10 months ago

I am eagerly awaiting a pre-trained model here https://github.com/yl4579/StyleTTS2 - it's got a lot going for it and seems to do well with long segments. Probably still going to absolutely require a GPU but if it works well I'll definitely build that in as an option once it stabilizes some.

Thanks, I'll take a look at this one. You're correct, a GPU (and a lot of VRAM) definitely helps! Although, I've found you don't need a very beefy GPU. An Nvidia 3060 with 12GB is enough to run local LLMs (7B), Stable Diffusion XL and most TTS.

aedocw commented 10 months ago

Try the branch "chunky", which creates an individual wave file for each sentence, but still rolls them up into chapter wave files so the current chapter splits work.

VERY distressing, it did not fix the problem :(

Processing time: 0.1843724250793457 Real-time factor: 0.015357403947565627 Text splitted to sentences. ['“I always do my best to treat people, including those I disagree with, respectfully and will continue to do so.”']

It also skipped this part, probably because of the smart quotes: ['“I feel like we have lost you”'] (that just sounded like "uh").

I have noticed this in books I've listened to, but never dug into it. I'm going to have to spend some time seeing what I can do about removing special characters like those smart-quotes (and maybe all quotes in general because that should not impact the reading). As far as I can tell, Coqui-TTS does notice commas and it introduces a slight pause, so I do not want to remove them (but that's the next thing I'll do to see if that makes any difference, just for fun).

Aamir3d commented 10 months ago

Yeah, epubs do have a lot of quotes and extended punctuation, and this would be problematic.

OT- Just had a thought, have you explored looking at Piper TTS which comes with an MIT Licence? I asked the author of the TTS Gen WebUI to look into integrating it.

https://github.com/rsxdalv/tts-generation-webui/issues/191#issue-1938804196

Maybe the issue is with Coqui overall in these complex cases.

aedocw commented 10 months ago

I think I found the issue, at least with the book I was testing with. The problem was quotes and smart quotes. I'm stripping them now before sending to Coqui-TTS and it's working great for me.

Please test with your known-bad book and see if it works properly now, let me know, thanks!

(BTW I have not looked at Piper TTS but I will add it to my list of things to check out :) )

Aamir3d commented 10 months ago

Thanks! I'm going to try this out later today and see how it works with another couple ebooks! I'll share an update.

Aamir3d commented 10 months ago

@aedocw This worked really well! Thank you for your assistance with this. Both the punctuation and the 'missing speech' are correct with the book I tried earlier.

aedocw commented 10 months ago

EXCELLENT! Thank you so much for finding this issue and noting it here, I really appreciate it, and finding and fixing this is a big improvement!