aedocw / epub2tts

Turn an epub or text file into an audiobook
Apache License 2.0
426 stars 43 forks source link

Conversion to text file adds in spaces inside of words and add extra punctuation where it is not supposed to be. #216

Closed Nikanoru closed 4 months ago

Nikanoru commented 4 months ago

Hello!

A friend asked me if I could do another epub to audio with your program for him and I encountered some issues with the epub -> txt conversion.

It seems in the first sentence of every new part, there is an added space inside of a word. So for example instead of "Word" it will be "W ord". And also the "-" or "--" characters seems to sometimes? turn into ","

I wasn't sure if there is something wrong with the epub or the conversion process, and it seems it is indeed the latter. I tried a different program to do the conversion and indeed the extra spaces and extra punctiation was not there, but unfortnately the nice "Part Feature" is not availible there.

I included a few screenshots to highlight what I mean. Left side of the picture is from epub2tts and right side is the other program (https://github.com/kevinxiong/epub2txt).

Screenshot 2024-03-01 140229 Screenshot 2024-03-01 140417 Screenshot 2024-03-01 140544

aedocw commented 4 months ago

This is expected behavior, and I'm not sure of an easy way to address these two things.

The first one, where usually the first word of a new chapter will have a space after the word is due to the ebook formatting. Many (most?) books will have the first letter of the first word much larger, sometimes in a different font. When extracting the text, these get pulled out as two individual elements. Now that I think about it though maybe I could check to see if the first element is a single character followed by a space, and drop that space. The problem though is this is not super consistent, some books I have converted to text did not have this issue.

The second problem (where"-" is converted to ",") is something I put in when I found XTTS would make weird sounds sometimes when there was punctuation in a word. During conversion, a bunch of "special characters" are replaced with commas. It might be worth adding in a flag that allows the user to skip this part if so desired, then any audio issues or other errors that come up from it would be just a side effect of choosing to "skip sanitize".

I'll keep this issue open, and will add --skip-sanitize as an option (but it might be a little while before I get to it).

aedocw commented 4 months ago

I'm adding --skip-cleanup as an option that, if specified, will not remove special characters. It should help with what you ran into.

Nikanoru commented 4 months ago

I hope it's not an issue if I respond here, I am not sure about the workings of Github with closed issues, so I hope it's not a problem.

Thank you for your swift action on this issue :)

I wonder, since the 2 different programs I tested for epub -> txt did not create those extra spaces, would it be possible to get inspiritation from them? I am not sure how all this works, but both programs are open source here on github.

Anyways, great to see you are still working on your program and helping folks out. :) If there will ever be a possibility of providing you some financial thanksgiving :P please let me know.

aedocw commented 4 months ago

I will take a look when I get a chance and see if I find any inspiration :)