Improved generation of non-English texts

SAnsAN-9119 commented 10 months ago

Hi there! Hi Jarod!

I'm a beginner in voice generation and I'm trying to train a model using the Slavic language. Could you give any advice on improving the quality of the training dataset?

For the first time, I started by training a model using 20 minutes of voice. I used base Whisper Model and openai/whisper as Whisper Backends. I started Transcribe and Process with Slice Segments enabled. Then I neglected to check the dataset and the resulting generation did not differ much from that given by the standard autoregressive.pth, but it had strange sounds at the end, resembling sighs and other strange sounds.

For the second time, I have already watched all your YouTube videos regarding voice generation and started following this algorithm: 1) I used 2 hours of the original voice. I used large-v2 Whisper Mode and openai/whisper as Whisper Backends. I started Transcribe and Process with Slice Segments enabled. 2) Listened to the audio, corrected the start and end of audio files in whisper.json (Quite often, when cutting segments, sounds from another segment fall in, or some pairs of words at the junction of segments sound almost like one word and it is difficult to separate them even manually. Sometimes you can hear sounds on the edges of the road, inhaling) 3) Corrected the text in the files train.txt , validation.txt , whisper.json so that it matches the text in the audio files. (In some cases, it incorrectly recognizes similar-sounding words or incorrectly recognizes/skips words at the beginning or end of a segment) 4) Also, for reliability, I replace all numeric values with text and characters such as "%" are also replaced with text 5) In the settings, I specified the Autoregressive Model that I received for the first time (Most likely it was a big mistake. But then I thought that I would cover the shortcomings of that model with a lot of correct data at further stages of training) and started training using the settings given below: Снимок экрана 2024-01-09 201516

I trained the model for a week, until I stopped updating the progress on the charts in the web ui. I stopped the training because I was tired of waiting, and the web ui said that I needed to train the model for two more weeks, that is, I completed the training by only 33%. I tried to generate data on the intermediate model that turned out. Yes, his voice sounded better, but he still spoke with a big English accent. But the strange sounds at the end were also present, maybe even more than before.

For the third time, I decided to start learning from the very beginning, using autoregressive.pth, but in order not to wait long, I took only 30 minutes of voice, which learned quite quickly, in about half a day, but as I understood from the description of the library, this kind of graph could not lead to a good result (los_text went lower than los_mel): Снимок экрана 2024-01-09 201502

As a result, I got the scariest generation I've ever heard! Strange sounds could make up more than half of the audio file (sometimes they looked like duplicates of words that I generated, and sometimes as random English parts of words)

Now I'm looking for a way to get rid of manual text editing and the beginning and ending of segments in whisper.json. I tried using large-v3 Whisper Mode for transcription and, as for me, the results were worse than when using large-v2. When using openai/whisper as Whisper Backends, I still get quite a lot of inaccuracies at the start and end of the segment, but this is the first time I've seen something similar to hallucinations! So at the end of one segment there was the beginning of a word from the next segment, but several words were added to the recognized text that did not match the text from the next segment (except for the first word, it matched). Although the next segment was recognized correctly (except that the audio was cut off at the beginning). I also found that one of the segments is missing the conjunction "and".

When using m-bain/whisperx as Whisper Backends, I also found that one of the segments lacks the conjunction "and", as in openai/whisper (in the same place). There was also a segment in which there were two words, but there was a big chunk of silence between them and ended up in whisper.json was only the first word of the two.

I was also surprised when I saw that openai/whisper and m-bain/whisperx form whisper.json in a different format.

All of the above made me even more confused. I have a couple of questions hanging right now that I couldn't solve on my own.: 1) Is there any way to speed up the learning process of the model? Do I understand correctly that this directly depends only on the amount of VRAM, or is there something else that can speed up learning (without compromising quality)? 2) Which Whisper Backends is better to use to get a better model? So far, it looks like whisperx is cooler, because every word is marked up there (The only question is how correct that markup is, because it can also be inaccurate. As you can see from the screenshot above, this is the case. It shows that one word stretched for more than 4 seconds) 3) If I use m-bain/whisperx, how do I insert the missing words correctly and what should I write in the "score" parameter (as I understood it, how confident is whisperx in the correctness of recognition)? 1? And does it even need to be specified? 4) If I use openai/whisper, do I also need to edit "tokens" when I change the text that is specified in this segment? And how is this even done? 5) Is it possible to use the built-in tokenizer at all.json or do I need to search for/create a tokenizer for the Slavic language?

My PC Settings: Windows 10 pro Ryzen 5 3600 Nvidia RTX 3060 Ti 32 RAM

P.S. Thank you for your time!

JarodMica commented 10 months ago

As a quick note, this repo is only optimized for English. My knowledge on training other languages is extremely limited atm, so sorry about that.

I'll do my best below to answer your questions:

Training takes a long time. VRAM is a big part of this, though, it's surprising it was going to take weeks for 2 hours of audio. However, I have trained on a 3060 12gb and yes, it is quite slow. There's no way that I know of on how to speed it up.
Whisperx is generally better than whipser. However, Whisperx uses an "alignment" model and I'm not sure how accurate it is for Slavic. It's possible that it may be much worse than that of the English one, but this is something you might want to look at the whisperx github for as whisperx is its own giant project itself.
This I'm not sure of. To my knowledge, it is done automatically and there wouldn't be a way you can change the score of a word as I'm guessing it's basing that off of the whisper model itself, though there may be other parameters in whisper itself you can adjust. That is beyond this project.
Not to my knowledge
You cannot use the built-in tokenizer for other languages, you need to create a custom tokenizer for it. How one does that is still a mystery for me personally and I still need to find how to do this. A youtuber you might wanna look at is Nanonomad who has trained on french and spanish using custom tokenizers.

SAnsAN-9119 commented 10 months ago

You cannot use the built-in tokenizer for other languages, you need to create a custom tokenizer for it. How one does that is still a mystery for me personally and I still need to find how to do this. A youtuber you might wanna look at is Nanonomad who has trained on french and spanish using custom tokenizers.

If I'm going to use Whispers, do I need a tokenizer? Is tokenizer used when training the model? Because I also tried to find a way to get an alternative tokenizer, but nothing worked out for me.

JarodMica commented 9 months ago

I think I have found a way to train using the built in tokenizer, but will be sharing the details more on my YouTube

Closing for now as this is more of a discussion topic rather than an issue.

JarodMica / ai-voice-cloning

Improved generation of non-English texts #13