gemelo-ai / vocos

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
https://gemelo-ai.github.io/vocos/
MIT License
752 stars 86 forks source link

One click installer + usage question #2

Open rsxdalv opened 1 year ago

rsxdalv commented 1 year ago

Hi, thank you for the great project you have made available!

I added it to my one click installed package of AI based audio generators. Link

Here's the notebook I quickly created: https://github.com/rsxdalv/tts-generation-webui/blob/main/notebooks/vocos.ipynb

I wonder if using this in a pipeline with SunoAI/Bark has a different impact than with something else. I couldn't manage to link up the raw encodec codes so I used the final wav files. I saw the best result when using 12kbps bandwidth although if I remember correctly Bark model runs on 6kbps. In my small sample size I didn't see a unsupervised improvement although I found an example where it gives more "quality" to a sound sample (I included it next to the notebook).

I would love to see how would it go if I could link it up with the encodec tokens from Bark and how to best go about using it.

hubertsiuzdak commented 1 year ago

Hey, I've updated the Vocos API (#4) to make it easier to integrate with Bark. Take a look at the example notebook.

Hope it helps!

rsxdalv commented 1 year ago

Thank you! For now this is the initial UI, but it will grow from here. https://github.com/rsxdalv/tts-generation-webui/pull/35 localhost_7860_ (2)

gitihobo commented 1 year ago

hey rsxdalv Could you make a training section that lets us train our own vocos model at a higher sample rate?

rsxdalv commented 1 year ago

It's possible, do you have a sample of the command/dataset/config?

gitihobo commented 1 year ago

Dataset I am imagining multiple 10 second audio files config I was making for 48k is

pytorch_lightning==1.8.6

seed_everything: 4444

data: class_path: vocos.dataset.VocosDataModule init_args: train_params: filelist_path: E:\anaconda3\envs\vocos\TrainFiles\filelist.train sampling_rate: 48000 num_samples: 16384 batch_size: 16 num_workers: 8

val_params:
  filelist_path: E:\anaconda3\envs\vocos\TrainFiles\filelist.val
  sampling_rate: 48000
  num_samples: 48384
  batch_size: 16
  num_workers: 8

model: class_path: vocos.experiment.VocosExp init_args: sample_rate: 48000 initial_learning_rate: 2e-4 mel_loss_coeff: 45 mrd_loss_coeff: 0.1 num_warmup_steps: 0 # Optimizers warmup steps pretrain_mel_steps: 0 # 0 means GAN objective from the first iteration

# automatic evaluation
evaluate_utmos: true
evaluate_pesq: true
evaluate_periodicty: true

feature_extractor:
  class_path: vocos.feature_extractors.MelSpectrogramFeatures
  init_args:
    sample_rate: 48000
    n_fft: 1024
    hop_length: 256
    n_mels: 100
    padding: center

backbone:
  class_path: vocos.models.VocosBackbone
  init_args:
    input_channels: 100
    dim: 512
    intermediate_dim: 1536
    num_layers: 8

head:
  class_path: vocos.heads.ISTFTHead
  init_args:
    dim: 512
    n_fft: 1024
    hop_length: 256
    padding: center

trainer: logger: class_path: pytorch_lightning.loggers.TensorBoardLogger init_args: save_dir: logs/ callbacks: