Closed aedocw closed 9 months ago
Good solution - use "whisper" (https://github.com/openai/whisper) to transcribe each audio chunk after it's done, and compare the transcript to the original text. Comparison could be kind of fuzzy, since things like names may be spelled differently but pronounced the same way. "fuzzywuzzy" would be a good library for this. If the comparison between original and transcript is below some threshold, try to re-encode that chunk again.
Sometimes when using XTTS, one sentence group/chunk will sound like nonsense. I can't reproduce it at will, but it has come up a few times in one of the first long books I created using XTTS.