Closed aedocw closed 6 months ago
Not sure if this is any help, although I believe it's the multiple encodes to mp3 and other lossy formats that's creating audible artifacts. Basically, when you encode to a lossy file such as mp3 or aac each time it's essentially throwing away pieces of audio based on a certain algorithm. With the original files not to mention the temp files being encoded in mp3 at a very low bitrate, each time more and more audio is being thrown out which you can't get back. This creates the audible artifacts I believe. The benefit with the epub2tts implementation is at least then the temporary files are in a lossless format which basically means that if you stay in a lossless format like wav or flac you can re-encode between formats as many times as you want without a loss of quality. You don't get the original audio that was dispensed with back, but at least there's no more degradation. The problem comes when you switch to formats like mp3 especially at such a low bitrate like 32 kb/s which by today's standards is extremely low. I'm happy to do a more in-depth explanation if that helps, although this is my observation from what I have observed.
I ran the script a couple times today, tried both the default voice and AvaMultilingualNeural, both of which led to some very sqeaky, high pitch sounds in the background, or when certain sounds are said.
Using edge TTS, the audio produced with this script vs. that of epub2tts is not as good. It SHOULD be the same, as it follows the same functional steps, but maybe there's something missing. Need to compare the two and figure out why this is.