Plachtaa / VITS-fast-fine-tuning

This repo is a pipeline of VITS finetuning for fast speaker adaptation TTS, and many-to-many voice conversion
Apache License 2.0
4.69k stars 705 forks source link

Can I train a English 1 Speaker Voice? #427

Open CypherpunkSamurai opened 1 year ago

CypherpunkSamurai commented 1 year ago

Hello 👋🏼

I want to train a vits model on pretrained model using english voice I have. How do I make the data and which model to use?

my current data is "metadata.csv", "voice_1.wav"... metadata.csv has:

voice_1|Transcript
voice_2|Transcript...

also do I need to use the official ljspeech model for english?

AnyaCoder commented 1 year ago

You can use your prepare wav files only to train, follow the instructions in DATA.md and LOCAL.md

CypherpunkSamurai commented 1 year ago

Which model for English?

AnyaCoder commented 1 year ago

Which model for English?

CJE model

CypherpunkSamurai commented 1 year ago

Just a note the file https://drive.google.com/file/d/132l97zjanpoPY4daLgqXoM7HKXPRbS84/view?usp=sharing is not found. Can you please reupload it to github releases / huggingface or mediafire etc?

CypherpunkSamurai commented 1 year ago

What is sampled_audio4ft_v2.zip ?

AnyaCoder commented 1 year ago

What is sampled_audio4ft_v2.zip ?

aux train data, if you don't have enough qualified audio files

CypherpunkSamurai commented 1 year ago

@AnyaCoder how can I use manual transcript like ljspeech and not run whisper?

CypherpunkSamurai commented 1 year ago

I checked final_annotation_train buts its all unknown ascii

AnyaCoder commented 1 year ago

I checked final_annotation_train buts its all unknown ascii

The latest transcriptions are the files named short_character_anno.txt and long_character_anno.txt, which are written in utf-8. You can run whisper first, get the two files mentioned before, and manually replace the transcrption using your own one, like ljspeech or vctk. The file final_annotation_train.txt contains only phonetic symbols, that perplex humans. If you are really reluctant, you can manually write the two files above, referring to the format wav_path|speaker_id|[lang]transcription[lang].Then run the preprocessv2.py, get the final_annotation_train.txt and final_annotation_val.txt.

CypherpunkSamurai commented 1 year ago

If you are really reluctant, you can manually write the two files above, referring to the format

Like...

list_of_folders = os.listdir("custom_voices")

for voice_n, voice in enumerate(list_of_folders):
     for metadata_line in open(os.path.join("custom_voices", voice, "metadata.csv")):
         line = metadata_line.split("|")
         transcript = line[len(line)-1]
         wav_file_p = os.path.join("custom_voices", voice, line[0])
         open("short_character_anno.txt", "a").write(f"{wav_file_p}|{voice_n}|{transcript}")
AnyaCoder commented 1 year ago

Examples: image ZH -> EN

CypherpunkSamurai commented 1 year ago

Ah, ok. So speaker name instead of id. And language tagging.

Like

./wav/ljspeech/wav1.wav|ljspeech_v2|[EN]Printing in the only...[EN]

CypherpunkSamurai commented 1 year ago

I'm using a complete replica of ljspeech, i just wanna use this repo for training.

The TTS from mozilla is really weird.

AnyaCoder commented 1 year ago

Author

This repo is great. I'm sure it will satisfy you.

CypherpunkSamurai commented 12 months ago

Just a question, can i use sentences like this:

379 chars

There was nothing so _very_ remarkable in that; nor did Alice think it so _very_ much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually _took a watch out of its waistcoat-pocket_, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.

The sentence is continuation with ,. The annotations like quote and ; give effect to the voice. Would be great to retain the annotations.

Can we use long sentences? Can we retain annotations while training?

AnyaCoder commented 12 months ago
  1. Punctuation issues: I haven't tested all the punctuation, but it is known that commas, full stops, apostrophes and exclamation points can produce pauses. It's best to replace them all with one of these four if you can.
  2. It's best not to use long sentences, which can cause a lot of video memory usage and potentially make the video memory low. 2s to 10s is appropriate.
  3. This depends on the corresponding sentence in the final_annotation_train.txt file, whether the punctuation you want.
CypherpunkSamurai commented 12 months ago

Alright...

  1. Ok. Will try.
  2. I'm using a P100 / T4 (Online GPU) for training, is it ok to keep long sentences?
AnyaCoder commented 12 months ago

Alright...

  1. Ok. Will try.
  2. I'm using a P100 / T4 (Online GPU) for training, is it ok to keep long sentences?

For Q.2, try a try. If the problem appears, shorten the length of wav files~

CypherpunkSamurai commented 11 months ago

I think I missed a point somewhere. can you please let me know how to use train the original ljspeech using this code? I can then change it accordingly.

AnyaCoder commented 11 months ago

I think I missed a point somewhere. can you please let me know how to use train the original ljspeech using this code? I can then change it accordingly.

What problems occured?

CypherpunkSamurai commented 11 months ago

I think I missed a point somewhere. can you please let me know how to use train the original ljspeech using this code? I can then change it accordingly.

What problems occured?

I don't want to run whisper. I want to use the metadata from the zip only. There are like 1k wav, and running whisper will take a lot of time. I already have the metadata, I want to use it instead.

AnyaCoder commented 11 months ago

That's ok, here is the python code.

with open('metadata.csv', 'r', encoding='utf-8') as source_file:
    # 打开或创建short_character_anno.txt进行写入
    with open('short_character_anno.txt', 'w', encoding='utf-8') as target_file:
        # 逐行读取source_file
        for line in source_file:
            # 删除行尾的换行符并按“|”分隔行
            parts = line.strip().split('|')
            if len(parts) != 2:
                # 如果该行不是两个部分,跳过此行
                continue
            voice_name, transcript = parts
            # 转换为期望的格式
            new_line = f"{voice_name}.wav|ljspeech|[EN]{transcription}[EN]\n"
            # 写入到short_character_anno.txt
            target_file.write(new_line)

make sure that the file metadata.csv is in the same folder with your *.wav files.

vic-yes commented 10 months ago

I think I missed a point somewhere. can you please let me know how to use train the original ljspeech using this code? I can then change it accordingly.

@CypherpunkSamurai , have you tried to train the original ljspeech?

@AnyaCoder , I also would like to train the ljspeech using this code. https://huggingface.co/espnet/kan-bayashi_ljspeech_vits/tree/main/exp/tts_train_vits_raw_phn_tacotron_g2p_en_no_space, here is the ljspeech model I'd like to train, but the config file is a ymal file, could you know how to convert the file to the "finetune_speaker.json"? image

I also found that there are two models "G" and "D", what differences are they? Do I have to split the ljspeech model to G and D model?

CypherpunkSamurai commented 9 months ago

@vic-yes no I haven't. I tried on my custom voice, it turned out robotic. The tone is weird.

AnyaCoder commented 9 months ago

@CypherpunkSamurai @vic-yes come to my repository Bert-VITS2, it is very good in 3 languages ZH, JP, EN

vic-yes commented 9 months ago

@AnyaCoder , thanks, I just noticed this project too, and I am currently studying it.

vic-yes commented 9 months ago

@AnyaCoder , do you have a pre-trained English model used in Bert-VITS2 for finetuning?

AnyaCoder commented 9 months ago

yes, you can get them from here: pretrained_models

CypherpunkSamurai commented 9 months ago

@AnyaCoder any colab scripts?

vic-yes commented 9 months ago

yes, you can get them from here: pretrained_models

@AnyaCoder , it seems to be a CN pre-trained model, not EN.

AnyaCoder commented 9 months ago

yes, you can get them from here: pretrained_models

@AnyaCoder , it seems to be a CN pre-trained model, not EN. Oh, sorry, here are the models for EN: https://huggingface.co/SpicyqSama007/Bert-VITS2-v2.0.1-ZH-JP-EN/tree/main

vic-yes commented 9 months ago

yes, you can get them from here: pretrained_models

@AnyaCoder , it seems to be a CN pre-trained model, not EN. Oh, sorry, here are the models for EN: https://huggingface.co/SpicyqSama007/Bert-VITS2-v2.0.1-ZH-JP-EN/tree/main

@AnyaCoder grateful for your help! could you also provide the config.json file? image

AnyaCoder commented 9 months ago

@AnyaCoder any colab scripts?

my repo Bert-VITS2 script, you can modify the path to your data: base.ipynb

or these links(translation required, I think you will know): 【腾讯文档】2.0-轻量部署方法: https://docs.qq.com/doc/DS3V5d3dnZXlxVUlL 【bilibili】使用方法: https://docs.qq.com/doc/DS3V3QnFLQVJYZHdn 【腾讯文档】2.0的维护: https://docs.qq.com/doc/DS3V3QnFLQVJYZHdn 【腾讯文档】Bert-VITS2-v2.0报错自查 https://docs.qq.com/doc/DS0FqQ0VGZ2dnVWlP 嫌麻烦,可以用整合包(Intergration Package): 「Bert-VITS2-v2.0.1」 链接:https://pan.quark.cn/s/f18e81404a63 提取码:SepH

AnyaCoder commented 9 months ago

yes, you can get them from here: pretrained_models

@AnyaCoder , it seems to be a CN pre-trained model, not EN. Oh, sorry, here are the models for EN: https://huggingface.co/SpicyqSama007/Bert-VITS2-v2.0.1-ZH-JP-EN/tree/main

@AnyaCoder grateful for your help! could you also provide the config.json file? image

config.json can be found in the repo Bert-VITS2, in the folderconfigs