adelacvg / NS2VC

Unofficial implementation of NaturalSpeech2 for Voice Conversion and Text to Speech
223 stars 12 forks source link

branch in V4 version train it's working ? #33

Open lpscr opened 7 months ago

lpscr commented 7 months ago

hi ! thank you very much for your work and this amazing repo

i try train the branch v4 i have something very wrong here when i train about 3 hours it's not change i have noise all the steps i use this

1 .python preprocess.py

  1. python model1.py

29000 steps v4 branch image

in v3 or main branch after some steps i have this

5000 steps v3 or main branch image

like you see in v4 , i get only noise i do something wrong

can you please tell me in the v4 train working ? or what i do wrong

thank you for your time

adelacvg commented 7 months ago

You haven't done anything wrong. Due to the model v4 having over 200 million parameters, the training process is very slow. I am currently experimenting with features such as offset noise, normalization, and cfg to make the training more stable. Your results seem quite normal, and theoretically, the convergence time of the v4 model is close to that of sd1.5. The previous three versions used smaller noise and predicted x0, resulting in faster training. However, v4 employs the classic approach of predicting noise as the target.

lpscr commented 7 months ago

this is so cool! I understand now. I'm going to retrain to see thank you very much for explanation and quick reply .

rishikksh20 commented 7 months ago

@lpscr are you able to converge the model ?

rishikksh20 commented 6 months ago

@adelacvg checked you update the model arch on v4. Is implementation completed? and is new model converge faster? I have collected lots of audio data now waiting for GPU availability to start training.

adelacvg commented 6 months ago

Yes, the previous training process was slow to converge due to issues with the UNet. Additionally, there were semantic problems caused by a bug in the diffusion training architecture from ControlNet. The current diffusion training framework is now based on Tortoise, eliminating any semantic faults. Furthermore, the architecture employs transformer blocks without updown, leading to much faster convergence.

rishikksh20 commented 6 months ago

Thanks :) Are you using HuBERT only for context vector? As my usecase is for non-english language so I thought to use Whisper layer 24 features rather than HuBERT.

adelacvg commented 6 months ago

Regarding contentvec, I chose it primarily to prevent timbre leakage. Hubert or Whisper have noticeable timbre leakage issues when trained using self-supervision. I have trained a model, and although there is some loss in audio quality during zero-shot scenarios, it performs better than the previous model on the same data scale.

rishikksh20 commented 5 months ago

Hi @adelacvg Is it possible to transfer bit Prosody and style also from NS2VC architecture not just voice? For simply voice conversion it working good, although voice not match exactly but still fine

adelacvg commented 5 months ago

Certainly, but I believe that prosody and speed are better suited for GPT or an acoustic model. The diffusion part, working as a good decoder, should suffice.

rishikksh20 commented 5 months ago

Just need to ask one more question, Are semantic tokens like Hu-BERT, wav2vec, and ContentVec have prosody information?

adelacvg commented 5 months ago

Of course, prosody encompasses fundamental frequency, pause duration, intonation, and other essential information. Semantic tokens inherently carry duration information and intonation.

rishikksh20 commented 5 months ago

Yes, I have the same intuition because pronunciation is an integral part of linguistics.

rishikksh20 commented 4 months ago

Hi @adelacvg Have you checked YODAS : https://huggingface.co/datasets/espnet/yodas 370k hours dataset, although data quality is poor as music is there or some samples are empty but still good quality data for VC pretraining. If you are not GPU poor :cry: you can pretrain this to YODAS :sweat_smile:.

adelacvg commented 4 months ago

@rishikksh20 Thank you very much for your suggestion. However, I'm currently short on GPU resources, and all GPUs are being used for experiments with the AR TTS model based on GPT. The pre-trained model may be trained when there are available GPUs.

rishikksh20 commented 4 months ago

@adelacvg Everyone is GPU-poor, I am also waiting for my GPU to be vacated. By the way how's the progress with TTTS training do you have any sample to share? I have tested the Hierspeech++'s Non-autoregressive Text to vector module along with NS2VC which acts as end-to-end TTS, and it is performing well. GPT-based Text to Vector which I have tested before shows lots of hallucination.

adelacvg commented 4 months ago

@rishikksh20 The model in the master branch of TTTS is based on Tortoise, and the results are comparable to Tortoise. I have provided a Colab link for testing the pre-trained model. For the v2 version, I would like to use a training method similar to Valle's, while still using Diffusion as the decoder, with the hope of achieving better zero-shot results.

rishikksh20 commented 2 months ago

For v4 I am planning to train on Encodec features for better speaker generalization as commented here https://github.com/adelacvg/NS2VC/issues/16#issuecomment-2084663655 . Has anyone tried this before or like to give me any heads up thought?