Miipher2.0 is a custom implementation of the feature cleaner seen in Miipher developed by Google to help restore degraded speech. It was used in the creation of LibriTTS-R. We do not implement the vocoder and use an inhouse whisper-to-wav vocoder.
The best checkpoint and associated files are here: model_training-test4.py, dataloading_test1.py, and the associated tensorboard outputs: tb_test4. The checkpoint is available in google drive here: Best checkpoints. CheckpointTest4.zip has the lower test loss but CheckpointTest4-iter3.zip has lower train loss.
Here is more information about Miipher and through process behind it: Presentation
Here are some general notes on SOTA techniques used out there.
There are 3 components that Miipher uses that we need to replace:
We chose PL-BERT in palce of PNG-BERT, ECAPA-TDNN in place of their custom speaker encoder, and Whisper embeddings in place of w2v-BERT.
We do not implement the fixed point iteration of the paper.
We had to tweak the architecture a bit (scaled down) for better results, but we believe on the right dataset and right set of hyperparameters, this model will have better results. Here is what the current successful architecture looks like:
We ignore the speaker identity aspect, feed the PL-BERT embeddings directly to the cross attention, and only stack 2x.
The PL-BERT embeddings and speaker encoder embeddings are calculated on the fly during training. The Whisper embeddings, however, are preprocessed and stored (WARNING: this takes up a lot of space). Ideally, better resources should enable all inputs to be processed on the fly during training.
To check out how the data is loaded go to dataloading_test1.py. This is the architecture used for the best checkpoint: model_training-test4.py. This is also the training file, make sure to use this (and load any necessary checkpoint) for further training.
The original architecture used is located in model_training.py.
Results can be seen in unseen and trainOutputs. Unseen represents audio that is unseen to the model during training and trainOutputs represents audio that is seen to the model during training. In each of the subfolders, there are 3 audio files: original, noisy (passed directly through vocoder), and miipher (passed through the miipher2.0 model then the vocoder).