RVC-Boss / GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
MIT License
35.25k stars 4.02k forks source link

Incorrect readings [Japanese] #99

Open Kamikadashi opened 9 months ago

Kamikadashi commented 9 months ago

There is currently a problem with incorrect readings that plagues all available text-to-speech (TTS) solutions. It’s essentially impossible to fully rely on a TTS that makes these errors. For instance, GPT-SoVITS doesn’t support unusual Kanji used, such as skipping 嗤 in “自分の勘の良さに嗤ってしまう.” It misreads 呆れる in “呆れる青子だが、半分は本気で感心していた.” and 彼女 in “そんな事を訊くなんて、まったく彼女らしくない.” It also consistently struggles with 風 and keeps reading it as かぜ even when it should be read as ふう, for example in “久遠寺有珠はそういう風に育てられている.”

In the past, I’ve explored ways to fix this for a different TTS and found Yomikata. It can solve the problem for some words, but unfortunately, it’s far from perfect and its development seems to have stalled.

I’ve been considering whether a new model could be trained for this purpose using data generated via ChatGPT. I’ve conducted some tests and with the correct prompt, ChatGPT seems to disambiguate correct readings 98% of the time.

Is this a viable idea? If yes, how large should the dataset be and what should its structure be to successfully solve the main problem in question?

RVC-Boss commented 9 months ago

There is currently a problem with incorrect readings that plagues all available text-to-speech (TTS) solutions. It’s essentially impossible to fully rely on a TTS that makes these errors. For instance, GPT-SoVITS doesn’t support unusual Kanji used, such as skipping 嗤 in “自分の勘の良さに嗤ってしまう.” It misreads 呆れる in “呆れる青子だが、半分は本気で感心していた.” and 彼女 in “そんな事を訊くなんて、まったく彼女らしくない.” It also consistently struggles with 風 and keeps reading it as かぜ even when it should be read as ふう, for example in “久遠寺有珠はそういう風に育てられている.”

In the past, I’ve explored ways to fix this for a different TTS and found Yomikata. It can solve the problem for some words, but unfortunately, it’s far from perfect and its development seems to have stalled.

I’ve been considering whether a new model could be trained for this purpose using data generated via ChatGPT. I’ve conducted some tests and with the correct prompt, ChatGPT seems to disambiguate correct readings 98% of the time.

Is this a viable idea? If yes, how large should the dataset be and what should its structure be to successfully solve the main problem in question?

@Kamikadashi Here is the Japanese text frontend. https://github.com/RVC-Boss/GPT-SoVITS/blob/main/GPT_SoVITS/text/japanese.py I have limited knowledge of Japanese, perhaps others may need to assist in optimizing these codes.