rsxdalv commented 1 year ago

Hi, thank you for the great project you have made available!

I added it to my one click installed package of AI based audio generators. Link

Here's the notebook I quickly created: https://github.com/rsxdalv/tts-generation-webui/blob/main/notebooks/vocos.ipynb

I wonder if using this in a pipeline with SunoAI/Bark has a different impact than with something else. I couldn't manage to link up the raw encodec codes so I used the final wav files. I saw the best result when using 12kbps bandwidth although if I remember correctly Bark model runs on 6kbps. In my small sample size I didn't see a unsupervised improvement although I found an example where it gives more "quality" to a sound sample (I included it next to the notebook).

I would love to see how would it go if I could link it up with the encodec tokens from Bark and how to best go about using it.

hubertsiuzdak commented 1 year ago

Hey, I've updated the Vocos API (#4) to make it easier to integrate with Bark. Take a look at the example notebook.

Hope it helps!

rsxdalv commented 1 year ago

Thank you! For now this is the initial UI, but it will grow from here. https://github.com/rsxdalv/tts-generation-webui/pull/35 localhost_7860_ (2)

gitihobo commented 1 year ago

hey rsxdalv Could you make a training section that lets us train our own vocos model at a higher sample rate?

rsxdalv commented 1 year ago

It's possible, do you have a sample of the command/dataset/config?

gitihobo commented 1 year ago

Dataset I am imagining multiple 10 second audio files config I was making for 48k is

pytorch_lightning==1.8.6

seed_everything: 4444

data: class_path: vocos.dataset.VocosDataModule init_args: train_params: filelist_path: E:\anaconda3\envs\vocos\TrainFiles\filelist.train sampling_rate: 48000 num_samples: 16384 batch_size: 16 num_workers: 8

val_params:
  filelist_path: E:\anaconda3\envs\vocos\TrainFiles\filelist.val
  sampling_rate: 48000
  num_samples: 48384
  batch_size: 16
  num_workers: 8

model: class_path: vocos.experiment.VocosExp init_args: sample_rate: 48000 initial_learning_rate: 2e-4 mel_loss_coeff: 45 mrd_loss_coeff: 0.1 num_warmup_steps: 0 # Optimizers warmup steps pretrain_mel_steps: 0 # 0 means GAN objective from the first iteration

# automatic evaluation
evaluate_utmos: true
evaluate_pesq: true
evaluate_periodicty: true

feature_extractor:
  class_path: vocos.feature_extractors.MelSpectrogramFeatures
  init_args:
    sample_rate: 48000
    n_fft: 1024
    hop_length: 256
    n_mels: 100
    padding: center

backbone:
  class_path: vocos.models.VocosBackbone
  init_args:
    input_channels: 100
    dim: 512
    intermediate_dim: 1536
    num_layers: 8

head:
  class_path: vocos.heads.ISTFTHead
  init_args:
    dim: 512
    n_fft: 1024
    hop_length: 256
    padding: center

trainer: logger: class_path: pytorch_lightning.loggers.TensorBoardLogger init_args: save_dir: logs/ callbacks:

class_path: pytorch_lightning.callbacks.LearningRateMonitor
class_path: pytorch_lightning.callbacks.ModelSummary init_args: max_depth: 2
class_path: pytorch_lightning.callbacks.ModelCheckpoint init_args: monitor: val_loss filename: vocoscheckpoint{epoch}{step}{val_loss:.4f} save_top_k: 3 save_last: true
class_path: vocos.helpers.GradNormCallback

Lightning calculates max_steps across all optimizer steps (rather than number of batches)

This equals to 1M steps per generator and 1M per discriminator

max_steps: 2000000

You might want to limit val batches when evaluating all the metrics, as they are time-consuming

limit_val_batches: 100 accelerator: gpu strategy: ddp devices: [0] log_every_n_steps: 100

gemelo-ai / vocos

One click installer + usage question #2

pytorch_lightning==1.8.6

Lightning calculates max_steps across all optimizer steps (rather than number of batches)

This equals to 1M steps per generator and 1M per discriminator

You might want to limit val batches when evaluating all the metrics, as they are time-consuming