Information about a complete training pipeline?

agonzalezd commented 2 years ago

Greetings.

I am aware of the existence of the different repositories for the generation of a voice conversion model. However, few information about a whole training pipeline is covered in the repositories. Could the README.md file be extended with information for training a voice conversion model from scratch? Similar to the information provided in your parallel repository hubert, in order to perform a full training pipeline for a voice conversion model. Information such as:

Repository requirements in a requirements.txt file
Dataset requirements, in terms of audio characteristics, number of speakers (e.g. input and output voices) and directory structure
Steps required for training a model from scratch. e.g. execute preprocess.py -i foo -o bar, then train.py -i bar -o model_output...

Thanks in advance for your time.

bshall commented 2 years ago

Hi @agonzalezd, thanks for the interest. I'm working on the README.md now and will have it done by the end of the week. I'll let you know as soon as I push it.

eschmidbauer commented 2 years ago

Any updates on a training README? I'm very impressed by this framework, we would very much like to be able to train our own model for testing.

bshall commented 2 years ago

@agonzalezd and @eschmidbauer, sorry about the long delay. I've updated the README with instructions on how to train the acoustic model on LJSpeech. It should be pretty simple to adapt to your own datasets. Let me know if you need anything else.

Good luck with the training!

eschmidbauer commented 2 years ago

@bshall thank you so much! i've got a model training already. One thing, I was hoping to finetune off of the model you have posted, but i get an error when i try to resume from it.

python3 train.py --resume ./pretrained/hubert-soft-aa8d82f5.pt custom_dataset custom_model
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:__mp_main__:Loading checkpoint from pretrained/hubert-soft-aa8d82f5.pt
Traceback (most recent call last):
  File "/root/bshall/acoustic-model/train.py", line 306, in <module>
    mp.spawn(
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/root/bshall/acoustic-model/train.py", line 121, in train
    global_step, best_loss = load_checkpoint(
  File "/root/bshall/acoustic-model/acoustic/utils.py", line 83, in load_checkpoint
    acoustic.load_state_dict(checkpoint["acoustic-model"])
KeyError: 'acoustic-model'

bshall commented 2 years ago

@eschmidbauer, glad you've got it working. I did some editing of the checkpoint to decrease the upload size so it doesn't have the optimizer state either. I think the easiest way to get the fine-tuning working for now would be to replace the load_checkpoint function with:

def load_checkpoint(
    load_path,
    acoustic,
    optimizer,
    rank,
    logger,
):
    logger.info(f"Loading checkpoint from {load_path}")
    checkpoint = torch.load(load_path, map_location={"cuda:0": f"cuda:{rank}"})
    acoustic.load_state_dict(checkpoint)
    return 0, float("inf")

Then after you're training process saves a checkpoint you can revert back to the original load_checkpoint function and use that going forward.

I'll add some fine-tuning functionality more properly early next week.

agonzalezd commented 2 years ago

Hi, everyone

This is so nice! Thanks for the effort and sharing your code with us!

By the way, is there any requirements file for the required python packages? Only torch and torchaudio are needed?

Thanks in advance!

eschmidbauer commented 2 years ago

Hi- quick update. Trying to resume training and i get the following error:

python3 train.py --resume ljspeech_model/model-best.pt LJSpeech-1.1 ./ljspeech_model2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:__mp_main__:Loading checkpoint from ljspeech_model/model-best.pt
Traceback (most recent call last):
  File "/root/bshall/acoustic-model/train.py", line 306, in <module>
    mp.spawn(
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/root/bshall/acoustic-model/train.py", line 121, in train
    global_step, best_loss = load_checkpoint(
  File "/root/bshall/acoustic-model/acoustic/utils.py", line 83, in load_checkpoint
    acoustic.load_state_dict(checkpoint["acoustic-model"])
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
    Missing key(s) in state_dict: "module.encoder.prenet.net.0.weight", "module.encoder.prenet.net.0.bias", "module.encoder.prenet.net.3.weight", "module.encoder.prenet.net.3.bias", "module.encoder.convs.0.weight", "module.encoder.convs.0.bias", "module.encoder.convs.3.weight", "module.encoder.convs.3.bias", "module.encoder.convs.4.weight", "module.encoder.convs.4.bias", "module.encoder.convs.7.weight", "module.encoder.convs.7.bias", "module.decoder.prenet.net.0.weight", "module.decoder.prenet.net.0.bias", "module.decoder.prenet.net.3.weight", "module.decoder.prenet.net.3.bias", "module.decoder.lstm1.weight_ih_l0", "module.decoder.lstm1.weight_hh_l0", "module.decoder.lstm1.bias_ih_l0", "module.decoder.lstm1.bias_hh_l0", "module.decoder.lstm2.weight_ih_l0", "module.decoder.lstm2.weight_hh_l0", "module.decoder.lstm2.bias_ih_l0", "module.decoder.lstm2.bias_hh_l0", "module.decoder.lstm3.weight_ih_l0", "module.decoder.lstm3.weight_hh_l0", "module.decoder.lstm3.bias_ih_l0", "module.decoder.lstm3.bias_hh_l0", "module.decoder.proj.weight".
    Unexpected key(s) in state_dict: "encoder.prenet.net.0.weight", "encoder.prenet.net.0.bias", "encoder.prenet.net.3.weight", "encoder.prenet.net.3.bias", "encoder.convs.0.weight", "encoder.convs.0.bias", "encoder.convs.3.weight", "encoder.convs.3.bias", "encoder.convs.4.weight", "encoder.convs.4.bias", "encoder.convs.7.weight", "encoder.convs.7.bias", "decoder.prenet.net.0.weight", "decoder.prenet.net.0.bias", "decoder.prenet.net.3.weight", "decoder.prenet.net.3.bias", "decoder.lstm1.weight_ih_l0", "decoder.lstm1.weight_hh_l0", "decoder.lstm1.bias_ih_l0", "decoder.lstm1.bias_hh_l0", "decoder.lstm2.weight_ih_l0", "decoder.lstm2.weight_hh_l0", "decoder.lstm2.bias_ih_l0", "decoder.lstm2.bias_hh_l0", "decoder.lstm3.weight_ih_l0", "decoder.lstm3.weight_hh_l0", "decoder.lstm3.bias_ih_l0", "decoder.lstm3.bias_hh_l0", "decoder.proj.weight".

bshall commented 2 years ago

@eschmidbauer, sorry about that. I've re-uploaded the pretrained checkpoints and updated the load_checkpoint function to handle resuming from the pretrained weights. Would you mind trying again and letting me know if you have any issues?

eschmidbauer commented 2 years ago

Hi- quick update. im still unable to "resume" training off of my own checkpoint but I am able to resume off of the one you provided here. Below is the output i get trying to resume from my own checkpoint (trained from scratch)

python3 train.py --resume ljspeech/acoustic.pt LJSpeech-1.1 ./ljspeech_model
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:__mp_main__:Loading checkpoint from pretrained_models/acoustic.pt
Traceback (most recent call last):
  File "/root/bshall/acoustic-model/train.py", line 306, in <module>
    mp.spawn(
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/root/bshall/acoustic-model/train.py", line 121, in train
    global_step, best_loss = load_checkpoint(
  File "/root/bshall/acoustic-model/acoustic/utils.py", line 83, in load_checkpoint
    acoustic.load_state_dict(checkpoint["acoustic-model"])
  File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
    Missing key(s) in state_dict: "module.encoder.prenet.net.0.weight", "module.encoder.prenet.net.0.bias", "module.encoder.prenet.net.3.weight", "module.encoder.prenet.net.3.bias", "module.encoder.convs.0.weight", "module.encoder.convs.0.bias", "module.encoder.convs.3.weight", "module.encoder.convs.3.bias", "module.encoder.convs.4.weight", "module.encoder.convs.4.bias", "module.encoder.convs.7.weight", "module.encoder.convs.7.bias", "module.decoder.prenet.net.0.weight", "module.decoder.prenet.net.0.bias", "module.decoder.prenet.net.3.weight", "module.decoder.prenet.net.3.bias", "module.decoder.lstm1.weight_ih_l0", "module.decoder.lstm1.weight_hh_l0", "module.decoder.lstm1.bias_ih_l0", "module.decoder.lstm1.bias_hh_l0", "module.decoder.lstm2.weight_ih_l0", "module.decoder.lstm2.weight_hh_l0", "module.decoder.lstm2.bias_ih_l0", "module.decoder.lstm2.bias_hh_l0", "module.decoder.lstm3.weight_ih_l0", "module.decoder.lstm3.weight_hh_l0", "module.decoder.lstm3.bias_ih_l0", "module.decoder.lstm3.bias_hh_l0", "module.decoder.proj.weight".
    Unexpected key(s) in state_dict: "encoder.prenet.net.0.weight", "encoder.prenet.net.0.bias", "encoder.prenet.net.3.weight", "encoder.prenet.net.3.bias", "encoder.convs.0.weight", "encoder.convs.0.bias", "encoder.convs.3.weight", "encoder.convs.3.bias", "encoder.convs.4.weight", "encoder.convs.4.bias", "encoder.convs.7.weight", "encoder.convs.7.bias", "decoder.prenet.net.0.weight", "decoder.prenet.net.0.bias", "decoder.prenet.net.3.weight", "decoder.prenet.net.3.bias", "decoder.lstm1.weight_ih_l0", "decoder.lstm1.weight_hh_l0", "decoder.lstm1.bias_ih_l0", "decoder.lstm1.bias_hh_l0", "decoder.lstm2.weight_ih_l0", "decoder.lstm2.weight_hh_l0", "decoder.lstm2.bias_ih_l0", "decoder.lstm2.bias_hh_l0", "decoder.lstm3.weight_ih_l0", "decoder.lstm3.weight_hh_l0", "decoder.lstm3.bias_ih_l0", "decoder.lstm3.bias_hh_l0", "decoder.proj.weight".

bshall commented 2 years ago

@eschmidbauer, thanks for the update! I've fixed the save_checkpoint function so you should now be able to resume from your own checkpoints as well. Sorry about these little bugs.

youssefabdelm commented 1 year ago

Hi @agonzalezd Thanks so much for making this incredible model!

I'm also trying to fine-tune and am getting this error, would be curious to know what I'm doing wrong: !python train.py --resume /content/acoustic-model/checkpoint/hubert-soft-0321fd7e.pt /content/dataset/ /content/new_checkpoint/

INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:root:Initializing MLIR with module: _site_initialize_0
DEBUG:root:Registering dialects from initializer <module 'jaxlib.mlir._mlir_libs._site_initialize_0' from '/usr/local/lib/python3.7/dist-packages/jaxlib/mlir/_mlir_libs/_site_initialize_0.so'>
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
INFO:__mp_main__:Loading checkpoint from /content/acoustic-model/checkpoint/hubert-soft-0321fd7e.pt
Traceback (most recent call last):
  File "train.py", line 310, in <module>
    join=True,
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/content/acoustic-model/train.py", line 135, in train
    n_epochs = STEPS // len(train_loader) + 1
ZeroDivisionError: integer division or modulo by zero

youssefabdelm commented 1 year ago

I converted all my .wav files to 16kHz, mono channel. My dataset size is about 30 minutes, each split into 5-10 second chunks.

My folder structure looks like this: -dataset --wavs ---train ---test ---dev --mels --soft

Edit: I printed the len(train_dataset) and it gives me 1, and len(train_loader) gives me 0.

youssefabdelm commented 1 year ago

When I ran mels.py I got:

Extracting features for /content/dataset/wavs
100% 234/234 [00:01<00:00, 191.27it/s]
Wrote 1 utterances, 725 frames (0.00 hours)

Not sure if this is the expected output or not.

youssefabdelm commented 1 year ago

I think I may have fixed that issue. My dataset had a long string at the front like: testingtestingtesting$1, testingtestingtesting$2, and so on.

I renamed all files to simply numbers like: 1,2,3,...

And this is what I get when running mels.py:

Extracting features for /content/dataset/wavs
100% 234/234 [00:01<00:00, 181.79it/s]
Wrote 230 utterances, 171887 frames (0.48 hours)

I reran encode.py after that.

Now I do have another bug though when running train.py:

INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
Traceback (most recent call last):
  File "train.py", line 452, in <module>
    join=True,
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/content/hubert/train.py", line 106, in train
    train=True,
  File "/content/hubert/hubert/dataset.py", line 25, in __init__
    with open(root / "lengths.json") as file:
FileNotFoundError: [Errno 2] No such file or directory: '/content/dataset/lengths.json'

youssefabdelm commented 1 year ago

Sorry I just saw the hubert encoding repo. I tried to add files just as you described, the path in the key and the number of samples in the value. I also made sure to remove the .wav extension as it shows on the repo (but not sure if this is what is desired, so it looks like this: "train/32" as opposed to: "/content/dataset/train/32.wav" in the key.

Now I'm getting this error when running again:

INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
INFO:__mp_main__:Loading checkpoint from /content/acoustic-model/checkpoint/hubert-soft-0321fd7e.pt
Traceback (most recent call last):
  File "train.py", line 452, in <module>
    join=True,
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/content/hubert/train.py", line 143, in train
    logger=logger,
  File "/content/hubert/hubert/utils.py", line 55, in load_checkpoint
    hubert.load_state_dict(checkpoint["hubert"])
KeyError: 'hubert'

bshall / acoustic-model

Information about a complete training pipeline? #3