Closed Nikanoru closed 4 months ago
This is expected behavior, and I'm not sure of an easy way to address these two things.
The first one, where usually the first word of a new chapter will have a space after the word is due to the ebook formatting. Many (most?) books will have the first letter of the first word much larger, sometimes in a different font. When extracting the text, these get pulled out as two individual elements. Now that I think about it though maybe I could check to see if the first element is a single character followed by a space, and drop that space. The problem though is this is not super consistent, some books I have converted to text did not have this issue.
The second problem (where"-" is converted to ",") is something I put in when I found XTTS would make weird sounds sometimes when there was punctuation in a word. During conversion, a bunch of "special characters" are replaced with commas. It might be worth adding in a flag that allows the user to skip this part if so desired, then any audio issues or other errors that come up from it would be just a side effect of choosing to "skip sanitize".
I'll keep this issue open, and will add --skip-sanitize
as an option (but it might be a little while before I get to it).
I'm adding --skip-cleanup
as an option that, if specified, will not remove special characters. It should help with what you ran into.
I hope it's not an issue if I respond here, I am not sure about the workings of Github with closed issues, so I hope it's not a problem.
Thank you for your swift action on this issue :)
I wonder, since the 2 different programs I tested for epub -> txt did not create those extra spaces, would it be possible to get inspiritation from them? I am not sure how all this works, but both programs are open source here on github.
Anyways, great to see you are still working on your program and helping folks out. :) If there will ever be a possibility of providing you some financial thanksgiving :P please let me know.
I will take a look when I get a chance and see if I find any inspiration :)
Hello!
A friend asked me if I could do another epub to audio with your program for him and I encountered some issues with the epub -> txt conversion.
It seems in the first sentence of every new part, there is an added space inside of a word. So for example instead of "Word" it will be "W ord". And also the "-" or "--" characters seems to sometimes? turn into ","
I wasn't sure if there is something wrong with the epub or the conversion process, and it seems it is indeed the latter. I tried a different program to do the conversion and indeed the extra spaces and extra punctiation was not there, but unfortnately the nice "Part Feature" is not availible there.
I included a few screenshots to highlight what I mean. Left side of the picture is from epub2tts and right side is the other program (https://github.com/kevinxiong/epub2txt).