OlaWod / FreeVC

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
MIT License
602 stars 111 forks source link

Questions #4

Closed francqz31 closed 1 year ago

francqz31 commented 2 years ago

Hello , in the paper you said "Our models are trained up to 900k steps on a single NVIDIA 3090 GPU.The batch size is set to 64 with a maximum segment length of 128 frames"

1- How many hours/days did it take you to train these 900k steps , if you can be specific ?

2-Is this method data hungry, because in the paper you said "Only VCTK corpus is used for training" and VCTK has almost 44 hours of speech from 107 speakers" can this algorithm be used for example with 2 speakers with 4-5 hours of speech or 5 speakers with 5-6 hours of speech (Seen to Seen of course or unseen to seen) and give the same quality and similarity as the paper?

3-Is feeding the algorithm 48Khz or atleast 22050khz audio instead of 16khz going to make that huge of a difference?

4-You said "And a HiFi-GAN v1 vocoder is used to transform the modified mel-spectrogram into waveform" IS using a better vocoder than hifi-gan v1 (like the new released ones in 2022 Gan based or DDPM based) going to make that huge of a difference?

OlaWod commented 2 years ago
  1. About 10 days.
  2. We have not attempted to train in this low-resource setting. I‘ll give it a try in a few weeks.
  3. Audio with a higher sampling rate sounds better than audio with a lower sampling rate. The WavLM module operates at 16kHz, thus the model structure needs to be redesigned to make it synthesize audios with a different sampling rate. For example, this paper uses a length resampling decoder to tackle this problem. Also, there are many works on speech super-resolution, and it is possible to jointly train a 16kHz VC model and a 16kHz-to-xxkHz speech super-resolution model.
  4. At the very beginning of our experiment we used a HiFi-GAN trained by ourselves, and trained the VC model to 800k steps. Later we switched to the official HiFi-GAN as it is available to everyone. But after training the new VC model to 800k steps we found that the objective results (WER, CER, SSIM, etc.) are slightly worse than the old one. I was unhappy with this and so I continued to train it to 900k steps so that its performance can match our old model. So, I think a better vocoder can make a difference, but it won’t be huge.
OlaWod commented 1 year ago
  1. About 10 days.
  2. We have not attempted to train in this low-resource setting. I‘ll give it a try in a few weeks.
  3. Audio with a higher sampling rate sounds better than audio with a lower sampling rate. The WavLM module operates at 16kHz, thus the model structure needs to be redesigned to make it synthesize audios with a different sampling rate. For example, this paper uses a length resampling decoder to tackle this problem. Also, there are many works on speech super-resolution, and it is possible to jointly train a 16kHz VC model and a 16kHz-to-xxkHz speech super-resolution model.
  4. At the very beginning of our experiment we used a HiFi-GAN trained by ourselves, and trained the VC model to 800k steps. Later we switched to the official HiFi-GAN as it is available to everyone. But after training the new VC model to 800k steps we found that the objective results (WER, CER, SSIM, etc.) are slightly worse than the old one. I was unhappy with this and so I continued to train it to 900k steps so that its performance can match our old model. So, I think a better vocoder can make a difference, but it won’t be huge.
  1. Below are the testing results:

model 1: FreeVC trained up to 540k steps with data of only 6 VCTK speakers (2079 utterances, 69.753 minutes in total) model 2: FreeVC trained up to 540k steps with the same dataset split as in the paper

results of 1200 VCTK-to-seen conversions:

|        | WER% (↓) | CER% (↓) | F0-PCC (↑) | O-Nat. (↑) | O-Sim. (↑) |
|Model 1 |   7.17   |   2.85   |   76.69    |    4.30    |   78.70    |
|Model 2 |   7.71   |   2.97   |   81.79    |    4.47    |   80.10    |

results of 1200 LibriTTS-to-seen conversions:

|        | WER% (↓) | CER% (↓) | F0-PCC (↑) | O-Nat. (↑) | O-Sim. (↑) |
|Model 1 |   3.53   |   1.20   |   66.69    |    4.48    |   81.67    |
|Model 2 |   3.22   |   1.05   |   71.64    |    4.59    |   82.59    |
francqz31 commented 1 year ago

@OlaWod OMG Thank you so much Mr.Jingyi for updating me with your results under low-resource setting. Can you upload some .wav results so I can hear and know the quality and naturalness of the low-resource results , I Also attempted and Upsampled some of the results on your page which are in 16KHz to 48KHz ( I used the 3x model that can upsample from 16khz to 48khz ) 1-https://drive.google.com/file/d/1LVoVoknVy-Y0iz6psqIlTFqf33w8vFGx/view?usp=share_link 2-https://drive.google.com/file/d/1D3vYuBnOGLyCbhp5l_V7dYW7md50LjY4/view?usp=share_link 3-https://drive.google.com/file/d/1ItMHQajGxhGiUOXkBMId73QXCoZLkJkf/view?usp=share_link 5-Finally do you think that vo-coder would make a huge difference? https://arxiv.org/abs/2206.13404 they claim it is artifact free , although some people trained it and said it is not that impressive , I on my own still didn't try it to judge but i think they might be doing something wrong!!!

OlaWod commented 1 year ago

I've uploaded some results here.

  1. Sorry I don't understand what you are trying to say.
  2. I think the difference won't be huge.
francqz31 commented 1 year ago

No problem, anyways the results of model2-540k are amazing , there is a big difference in the quality and naturalness of the s2s & u2s in model2-540k and model1-540k. Model2-450k samples are outstanding I would argue that they are better than the original demo or almost the same! Since model2 is trained with the same dataset split as in the paper I think training up to 900k would be A lot since 540k only makes such good results.

Ashraf-Ali-aa commented 1 year ago

@OlaWod could you share the code you used for the results, I wanted to reproduce it with a different audio set, thanks

OlaWod commented 1 year ago

@OlaWod could you share the code nyou used for the results, I wanted to reproduce it with a different audio set, thanks

WER, CER: here

F0-PCC:

from tqdm import tqdm
import numpy as np
import pyworld as pw
import argparse
import librosa

def get_f0(x, fs=16000, n_shift=160):
    x = x.astype(np.float64)
    frame_period = n_shift / fs * 1000
    f0, timeaxis = pw.dio(x, fs, frame_period=frame_period)
    f0 = pw.stonemask(x, f0, timeaxis, fs)
    return f0

def compute_f0(wav, sr=16000, frame_period=10.0):
    wav = wav.astype(np.float64)
    f0, timeaxis = pw.harvest(
        wav, sr, frame_period=frame_period, f0_floor=20.0, f0_ceil=600.0)
    return f0

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--txtpath", type=str, default="samples.txt", help="path to txt file")
    parser.add_argument("--title", type=str, default="1", help="output title")
    args = parser.parse_args()

    pccs = []
    with open(args.txtpath, "r") as f:
        for rawline in tqdm(f.readlines()):
            src, tgt = rawline.strip().split("|")
            src = librosa.load(src, sr=16000)[0]
            src_f0 = get_f0(src)
            tgt = librosa.load(tgt, sr=16000)[0]
            tgt_f0 = get_f0(tgt)
            if sum(src_f0) == 0:
                src_f0 = compute_f0(src)
                tgt_f0 = compute_f0(tgt)
                print(rawline)
            pcc = np.corrcoef(src_f0[:tgt_f0.shape[-1]], tgt_f0[:src_f0.shape[-1]])[0, 1]
            #print(i, pcc)
            if not np.isnan(pcc.item()):
                pccs.append(pcc.item())

    with open(f"result/{args.title}.txt", "w") as f:
        for pcc in pccs:
            f.write(f"{pcc}\n")
        pcc = sum(pccs) / len(pccs)
        f.write(f"mean: {pcc}")
    print("mean: ", pcc)

O-Nat.: here

O-Sim.:

from resemblyzer import VoiceEncoder, preprocess_wav
from tqdm import tqdm
import numpy as np
import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--txtpath", type=str, default="samples.txt", help="path to txt file")
    parser.add_argument("--title", type=str, default="1", help="output title")
    args = parser.parse_args()

    encoder = VoiceEncoder()    

    ssims = []
    with open(args.txtpath, "r") as f:
        for rawline in tqdm(f.readlines()):
            src, tgt = rawline.strip().split("|")
            src = preprocess_wav(src)
            src = encoder.embed_utterance(src)
            tgt = preprocess_wav(tgt)
            tgt = encoder.embed_utterance(tgt)
            ssim = np.inner(src, tgt)
            ssims.append(ssim.item())

    with open(f"result/{args.title}.txt", "w") as f:
        for ssim in ssims:
            f.write(f"{ssim}\n")
        ssim = sum(ssims) / len(ssims)
        f.write(f"mean: {ssim}")
    print("mean: ", ssim)