Open rsxdalv opened 1 year ago
Hey, I've updated the Vocos API (#4) to make it easier to integrate with Bark. Take a look at the example notebook.
Hope it helps!
Thank you! For now this is the initial UI, but it will grow from here. https://github.com/rsxdalv/tts-generation-webui/pull/35
hey rsxdalv Could you make a training section that lets us train our own vocos model at a higher sample rate?
It's possible, do you have a sample of the command/dataset/config?
Dataset I am imagining multiple 10 second audio files config I was making for 48k is
seed_everything: 4444
data: class_path: vocos.dataset.VocosDataModule init_args: train_params: filelist_path: E:\anaconda3\envs\vocos\TrainFiles\filelist.train sampling_rate: 48000 num_samples: 16384 batch_size: 16 num_workers: 8
val_params:
filelist_path: E:\anaconda3\envs\vocos\TrainFiles\filelist.val
sampling_rate: 48000
num_samples: 48384
batch_size: 16
num_workers: 8
model: class_path: vocos.experiment.VocosExp init_args: sample_rate: 48000 initial_learning_rate: 2e-4 mel_loss_coeff: 45 mrd_loss_coeff: 0.1 num_warmup_steps: 0 # Optimizers warmup steps pretrain_mel_steps: 0 # 0 means GAN objective from the first iteration
# automatic evaluation
evaluate_utmos: true
evaluate_pesq: true
evaluate_periodicty: true
feature_extractor:
class_path: vocos.feature_extractors.MelSpectrogramFeatures
init_args:
sample_rate: 48000
n_fft: 1024
hop_length: 256
n_mels: 100
padding: center
backbone:
class_path: vocos.models.VocosBackbone
init_args:
input_channels: 100
dim: 512
intermediate_dim: 1536
num_layers: 8
head:
class_path: vocos.heads.ISTFTHead
init_args:
dim: 512
n_fft: 1024
hop_length: 256
padding: center
trainer: logger: class_path: pytorch_lightning.loggers.TensorBoardLogger init_args: save_dir: logs/ callbacks:
class_path: vocos.helpers.GradNormCallback
max_steps: 2000000
limit_val_batches: 100 accelerator: gpu strategy: ddp devices: [0] log_every_n_steps: 100
Hi, thank you for the great project you have made available!
I added it to my one click installed package of AI based audio generators. Link
Here's the notebook I quickly created: https://github.com/rsxdalv/tts-generation-webui/blob/main/notebooks/vocos.ipynb
I wonder if using this in a pipeline with SunoAI/Bark has a different impact than with something else. I couldn't manage to link up the raw encodec codes so I used the final wav files. I saw the best result when using 12kbps bandwidth although if I remember correctly Bark model runs on 6kbps. In my small sample size I didn't see a unsupervised improvement although I found an example where it gives more "quality" to a sound sample (I included it next to the notebook).
I would love to see how would it go if I could link it up with the encodec tokens from Bark and how to best go about using it.