Closed agonzalezd closed 2 years ago
Hi @agonzalezd, thanks for the interest. I'm working on the README.md
now and will have it done by the end of the week. I'll let you know as soon as I push it.
Any updates on a training README? I'm very impressed by this framework, we would very much like to be able to train our own model for testing.
@agonzalezd and @eschmidbauer, sorry about the long delay. I've updated the README with instructions on how to train the acoustic model on LJSpeech. It should be pretty simple to adapt to your own datasets. Let me know if you need anything else.
Good luck with the training!
@bshall thank you so much! i've got a model training already. One thing, I was hoping to finetune off of the model you have posted, but i get an error when i try to resume from it.
python3 train.py --resume ./pretrained/hubert-soft-aa8d82f5.pt custom_dataset custom_model
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:__mp_main__:Loading checkpoint from pretrained/hubert-soft-aa8d82f5.pt
Traceback (most recent call last):
File "/root/bshall/acoustic-model/train.py", line 306, in <module>
mp.spawn(
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/root/bshall/acoustic-model/train.py", line 121, in train
global_step, best_loss = load_checkpoint(
File "/root/bshall/acoustic-model/acoustic/utils.py", line 83, in load_checkpoint
acoustic.load_state_dict(checkpoint["acoustic-model"])
KeyError: 'acoustic-model'
@eschmidbauer, glad you've got it working. I did some editing of the checkpoint to decrease the upload size so it doesn't have the optimizer state either. I think the easiest way to get the fine-tuning working for now would be to replace the load_checkpoint
function with:
def load_checkpoint(
load_path,
acoustic,
optimizer,
rank,
logger,
):
logger.info(f"Loading checkpoint from {load_path}")
checkpoint = torch.load(load_path, map_location={"cuda:0": f"cuda:{rank}"})
acoustic.load_state_dict(checkpoint)
return 0, float("inf")
Then after you're training process saves a checkpoint you can revert back to the original load_checkpoint
function and use that going forward.
I'll add some fine-tuning functionality more properly early next week.
Hi, everyone
This is so nice! Thanks for the effort and sharing your code with us!
By the way, is there any requirements file for the required python packages? Only torch and torchaudio are needed?
Thanks in advance!
Hi- quick update. Trying to resume training and i get the following error:
python3 train.py --resume ljspeech_model/model-best.pt LJSpeech-1.1 ./ljspeech_model2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
INFO:__mp_main__:Loading checkpoint from ljspeech_model/model-best.pt
Traceback (most recent call last):
File "/root/bshall/acoustic-model/train.py", line 306, in <module>
mp.spawn(
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/root/bshall/acoustic-model/train.py", line 121, in train
global_step, best_loss = load_checkpoint(
File "/root/bshall/acoustic-model/acoustic/utils.py", line 83, in load_checkpoint
acoustic.load_state_dict(checkpoint["acoustic-model"])
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
Missing key(s) in state_dict: "module.encoder.prenet.net.0.weight", "module.encoder.prenet.net.0.bias", "module.encoder.prenet.net.3.weight", "module.encoder.prenet.net.3.bias", "module.encoder.convs.0.weight", "module.encoder.convs.0.bias", "module.encoder.convs.3.weight", "module.encoder.convs.3.bias", "module.encoder.convs.4.weight", "module.encoder.convs.4.bias", "module.encoder.convs.7.weight", "module.encoder.convs.7.bias", "module.decoder.prenet.net.0.weight", "module.decoder.prenet.net.0.bias", "module.decoder.prenet.net.3.weight", "module.decoder.prenet.net.3.bias", "module.decoder.lstm1.weight_ih_l0", "module.decoder.lstm1.weight_hh_l0", "module.decoder.lstm1.bias_ih_l0", "module.decoder.lstm1.bias_hh_l0", "module.decoder.lstm2.weight_ih_l0", "module.decoder.lstm2.weight_hh_l0", "module.decoder.lstm2.bias_ih_l0", "module.decoder.lstm2.bias_hh_l0", "module.decoder.lstm3.weight_ih_l0", "module.decoder.lstm3.weight_hh_l0", "module.decoder.lstm3.bias_ih_l0", "module.decoder.lstm3.bias_hh_l0", "module.decoder.proj.weight".
Unexpected key(s) in state_dict: "encoder.prenet.net.0.weight", "encoder.prenet.net.0.bias", "encoder.prenet.net.3.weight", "encoder.prenet.net.3.bias", "encoder.convs.0.weight", "encoder.convs.0.bias", "encoder.convs.3.weight", "encoder.convs.3.bias", "encoder.convs.4.weight", "encoder.convs.4.bias", "encoder.convs.7.weight", "encoder.convs.7.bias", "decoder.prenet.net.0.weight", "decoder.prenet.net.0.bias", "decoder.prenet.net.3.weight", "decoder.prenet.net.3.bias", "decoder.lstm1.weight_ih_l0", "decoder.lstm1.weight_hh_l0", "decoder.lstm1.bias_ih_l0", "decoder.lstm1.bias_hh_l0", "decoder.lstm2.weight_ih_l0", "decoder.lstm2.weight_hh_l0", "decoder.lstm2.bias_ih_l0", "decoder.lstm2.bias_hh_l0", "decoder.lstm3.weight_ih_l0", "decoder.lstm3.weight_hh_l0", "decoder.lstm3.bias_ih_l0", "decoder.lstm3.bias_hh_l0", "decoder.proj.weight".
@eschmidbauer, sorry about that. I've re-uploaded the pretrained checkpoints and updated the load_checkpoint
function to handle resuming from the pretrained weights. Would you mind trying again and letting me know if you have any issues?
Hi- quick update. im still unable to "resume" training off of my own checkpoint but I am able to resume off of the one you provided here. Below is the output i get trying to resume from my own checkpoint (trained from scratch)
python3 train.py --resume ljspeech/acoustic.pt LJSpeech-1.1 ./ljspeech_model
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 6 nodes.
INFO:__mp_main__:Loading checkpoint from pretrained_models/acoustic.pt
Traceback (most recent call last):
File "/root/bshall/acoustic-model/train.py", line 306, in <module>
mp.spawn(
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/root/bshall/acoustic-model/train.py", line 121, in train
global_step, best_loss = load_checkpoint(
File "/root/bshall/acoustic-model/acoustic/utils.py", line 83, in load_checkpoint
acoustic.load_state_dict(checkpoint["acoustic-model"])
File "/root/miniconda3/envs/softvc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
Missing key(s) in state_dict: "module.encoder.prenet.net.0.weight", "module.encoder.prenet.net.0.bias", "module.encoder.prenet.net.3.weight", "module.encoder.prenet.net.3.bias", "module.encoder.convs.0.weight", "module.encoder.convs.0.bias", "module.encoder.convs.3.weight", "module.encoder.convs.3.bias", "module.encoder.convs.4.weight", "module.encoder.convs.4.bias", "module.encoder.convs.7.weight", "module.encoder.convs.7.bias", "module.decoder.prenet.net.0.weight", "module.decoder.prenet.net.0.bias", "module.decoder.prenet.net.3.weight", "module.decoder.prenet.net.3.bias", "module.decoder.lstm1.weight_ih_l0", "module.decoder.lstm1.weight_hh_l0", "module.decoder.lstm1.bias_ih_l0", "module.decoder.lstm1.bias_hh_l0", "module.decoder.lstm2.weight_ih_l0", "module.decoder.lstm2.weight_hh_l0", "module.decoder.lstm2.bias_ih_l0", "module.decoder.lstm2.bias_hh_l0", "module.decoder.lstm3.weight_ih_l0", "module.decoder.lstm3.weight_hh_l0", "module.decoder.lstm3.bias_ih_l0", "module.decoder.lstm3.bias_hh_l0", "module.decoder.proj.weight".
Unexpected key(s) in state_dict: "encoder.prenet.net.0.weight", "encoder.prenet.net.0.bias", "encoder.prenet.net.3.weight", "encoder.prenet.net.3.bias", "encoder.convs.0.weight", "encoder.convs.0.bias", "encoder.convs.3.weight", "encoder.convs.3.bias", "encoder.convs.4.weight", "encoder.convs.4.bias", "encoder.convs.7.weight", "encoder.convs.7.bias", "decoder.prenet.net.0.weight", "decoder.prenet.net.0.bias", "decoder.prenet.net.3.weight", "decoder.prenet.net.3.bias", "decoder.lstm1.weight_ih_l0", "decoder.lstm1.weight_hh_l0", "decoder.lstm1.bias_ih_l0", "decoder.lstm1.bias_hh_l0", "decoder.lstm2.weight_ih_l0", "decoder.lstm2.weight_hh_l0", "decoder.lstm2.bias_ih_l0", "decoder.lstm2.bias_hh_l0", "decoder.lstm3.weight_ih_l0", "decoder.lstm3.weight_hh_l0", "decoder.lstm3.bias_ih_l0", "decoder.lstm3.bias_hh_l0", "decoder.proj.weight".
@eschmidbauer, thanks for the update! I've fixed the save_checkpoint
function so you should now be able to resume from your own checkpoints as well. Sorry about these little bugs.
Hi @agonzalezd Thanks so much for making this incredible model!
I'm also trying to fine-tune and am getting this error, would be curious to know what I'm doing wrong:
!python train.py --resume /content/acoustic-model/checkpoint/hubert-soft-0321fd7e.pt /content/dataset/ /content/new_checkpoint/
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:root:Initializing MLIR with module: _site_initialize_0
DEBUG:root:Registering dialects from initializer <module 'jaxlib.mlir._mlir_libs._site_initialize_0' from '/usr/local/lib/python3.7/dist-packages/jaxlib/mlir/_mlir_libs/_site_initialize_0.so'>
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
INFO:__mp_main__:Loading checkpoint from /content/acoustic-model/checkpoint/hubert-soft-0321fd7e.pt
Traceback (most recent call last):
File "train.py", line 310, in <module>
join=True,
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/content/acoustic-model/train.py", line 135, in train
n_epochs = STEPS // len(train_loader) + 1
ZeroDivisionError: integer division or modulo by zero
I converted all my .wav files to 16kHz, mono channel. My dataset size is about 30 minutes, each split into 5-10 second chunks.
My folder structure looks like this: -dataset --wavs ---train ---test ---dev --mels --soft
Edit: I printed the len(train_dataset) and it gives me 1, and len(train_loader) gives me 0.
When I ran mels.py I got:
Extracting features for /content/dataset/wavs
100% 234/234 [00:01<00:00, 191.27it/s]
Wrote 1 utterances, 725 frames (0.00 hours)
Not sure if this is the expected output or not.
I think I may have fixed that issue. My dataset had a long string at the front like: testingtestingtesting$1, testingtestingtesting$2, and so on.
I renamed all files to simply numbers like: 1,2,3,...
And this is what I get when running mels.py:
Extracting features for /content/dataset/wavs
100% 234/234 [00:01<00:00, 181.79it/s]
Wrote 230 utterances, 171887 frames (0.48 hours)
I reran encode.py after that.
Now I do have another bug though when running train.py:
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
Traceback (most recent call last):
File "train.py", line 452, in <module>
join=True,
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/content/hubert/train.py", line 106, in train
train=True,
File "/content/hubert/hubert/dataset.py", line 25, in __init__
with open(root / "lengths.json") as file:
FileNotFoundError: [Errno 2] No such file or directory: '/content/dataset/lengths.json'
Sorry I just saw the hubert encoding repo. I tried to add files just as you described, the path in the key and the number of samples in the value. I also made sure to remove the .wav
extension as it shows on the repo (but not sure if this is what is desired, so it looks like this: "train/32" as opposed to: "/content/dataset/train/32.wav" in the key.
Now I'm getting this error when running again:
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
INFO:__mp_main__:Loading checkpoint from /content/acoustic-model/checkpoint/hubert-soft-0321fd7e.pt
Traceback (most recent call last):
File "train.py", line 452, in <module>
join=True,
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/content/hubert/train.py", line 143, in train
logger=logger,
File "/content/hubert/hubert/utils.py", line 55, in load_checkpoint
hubert.load_state_dict(checkpoint["hubert"])
KeyError: 'hubert'
Greetings.
I am aware of the existence of the different repositories for the generation of a voice conversion model. However, few information about a whole training pipeline is covered in the repositories. Could the
README.md
file be extended with information for training a voice conversion model from scratch? Similar to the information provided in your parallel repository hubert, in order to perform a full training pipeline for a voice conversion model. Information such as:requirements.txt
filepreprocess.py -i foo -o bar
, thentrain.py -i bar -o model_output
...Thanks in advance for your time.