Open acul3 opened 3 months ago
- There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.
- No SSL features like Whisper or cevc were used; these were merely copied from other projects.
- Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.
- Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.
thanks you for your quick answer @adelacvg
one last question if you dont mind,, for point number 3 , is there specific config( for each target layers,dimension etc) especially for flowvae, i see there are specific config for gpt and diff
thanks once again,
i am planning to reproduce your code, but using multilingual language(english and malay),, need to train bpe first
- There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.
- No SSL features like Whisper or cevc were used; these were merely copied from other projects.
- Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.
- Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.
thanks you for your quick answer @adelacvg
one last question if you dont mind,, for point number 3 , is there specific config( for each target layers,dimension etc) especially for flowvae, i see there are specific config for gpt and diff
thanks once again,
i am planning to reproduce your code, but using multilingual language(english and malay),, need to train bpe first
For vqvae and flowvae specific config, you can check config_24k.json
vaegan part. For multilingual, you can use the voice_tokenizer.py
to train your custom bpe tokenizer.
just finsih 50% step of flowvae ( 13M samples,300k of 600k step)
for the next step training (vqvae) i need to load the flowvae model .pt right ? @adelacvg and then continue training target
here sample from flowvae : https://github.com/user-attachments/assets/a0b5151e-e13a-4f5f-86bc-e38edb4ead2a
Yes, just use the results from the previous step for the next step of the training.
hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth
Yes, just use the results from the previous step for the next step of the training.
hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth
Yes, just use the results from the previous step for the next step of the training.
It's normal; VQ-VAE only needs to capture the semantics approximately.
hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth
Yes, just use the results from the previous step for the next step of the training.
It's normal; VQ-VAE only needs to capture the semantics approximately.
ok, i am at gpt stage now, after training 2 days,, the result now looks like ground truth,,but still not there
ground truth: https://github.com/user-attachments/assets/5c27cf96-7921-4ca1-af1f-dc8d2050bfe2
sample:
https://github.com/user-attachments/assets/6ecc76d3-1727-40cf-80c4-8fe353bf3ce6
@adelacvg
btw i change my vocab size gpt to 512, due multilinguality
i just change the config
"gpt":{
"model_dim":768,
"max_mel_tokens":1600,
"max_text_tokens":800,
"heads":16,
"mel_length_compression":1024,
"use_mel_codes_as_input":true,
"layers":10,
"number_text_tokens":513,
"number_mel_codes":8194,
"start_mel_token":8192,
"stop_mel_token":8193,
"start_text_token":512,
"train_solo_embeddings":false,
"spec_channels":128
number_text_tokens and start_text_token
it is correct right?
thank you again
@adelacvg
btw i change my vocab size gpt to 512, due multilinguality
i just change the config
"gpt":{ "model_dim":768, "max_mel_tokens":1600, "max_text_tokens":800, "heads":16, "mel_length_compression":1024, "use_mel_codes_as_input":true, "layers":10, "number_text_tokens":513, "number_mel_codes":8194, "start_mel_token":8192, "stop_mel_token":8193, "start_text_token":512, "train_solo_embeddings":false, "spec_channels":128
number_text_tokens and start_text_token
it is correct right?
thank you again
In the GPT step, the infering results are close to those of VQ-VAE. You just need to ensure that the semantics are correct, and after diffusion, they will become high quality.
Ensure that the referenced mel is a short segment of audio to avoid GPT overfitting on the speaker's conditions. I have updated some parameters of the VQ-VAE, resulting in a higher codebook utilization, which should lead to better results.
@adelacvg btw , how can i infer diffusion part?, it seems api.py only provide vqvae and gpt (old commit) only
finishing gpt train and continue diff now,
Infer_diffusion function is the same as the infer function, do_spectrogram_diffusion
part do the sample process.
@adelacvg have you got good result?
training diff 2 days i got same result as gpt(robotic sound but semantic is there)
after using last commit , i finally got good result,,thank you
any tips how to make infer faster @adelacvg ? (maybe like tortoise sytle)
For the GPT part, you can use acceleration frameworks similar to VLM, and they also support GPT2. For the diffusion part, you can adopt faster sampling methods with fewer sampling steps. Alternatively, like XTTS, you can use GANs instead of diffusion, although performance may decrease, it can achieve very fast results for the timbre in the training dataset.
hey @adelacvg thank for sharing the code
after reading the code i want to ask you few question about new 24k model if you dont mind
what make different about this model from previous one (https://huggingface.co/adelacvg/Detail/tree/main) beside sample rate
did you not use speech encoder in 24k model? (i see there is speech encoder in utils.like hubert , whisper etc, but i think is from previous model), did you also still use ContentVec768L12.py ?
i see train_target in (https://github.com/adelacvg/detail_tts/blob/master/vqvae/configs/config_24k.json) , i assume it has multpile step of training, if i want to train from scratch , do i need to change it ? say "gpt" first, flowqae , and diff ( is this correct ?)
if i want to train scratch i just remove (https://github.com/adelacvg/detail_tts/blob/7e2466855f401637fe94f39c185121990f679f31/train.py#L461) right?
sorry if is this a lot question,, thanks in advance