OlaWod / FreeVC

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
MIT License
603 stars 111 forks source link

Error while finetuning #56

Closed MaN0bhiR closed 1 year ago

MaN0bhiR commented 1 year ago

Hi, I am trying to finetune FreeVC-s model with a small dataset formatted in vctk format. I have done necessary changes in config file. I am also not using SR-based augmentation. But when I run train.py it throws me this error.

Traceback (most recent call last):
  File "train.py", line 284, in <module>
    main()
  File "train.py", line 49, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/content/FreeVC/train.py", line 115, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/content/FreeVC/train.py", line 136, in train_and_evaluate
    for batch_idx, items in enumerate(train_loader):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 543, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
    return self.collate_fn(data)
  File "/content/FreeVC/data_utils.py", line 164, in __call__
    c_padded[i, :, :c.size(1)] = c
RuntimeError: The expanded size of the tensor (399) must match the existing size (487) at non-singleton dimension 1.  Target sizes: [1024, 399].  Tensor sizes: [1024, 487]

PS: In my dataset few speakers have wav files at 22050 sampling rate , while other speakers wav files are at 32000 sampling rate. But I have used downsample.py to ensure that all of them are sampled down to 16000.

OlaWod commented 1 year ago

seems that the wavlm feature length is not consistent with spectrogram length. delete all '.spec.pt' files, make sure the wavs in the wav dir (named as 'DUMMY') are all 16kHz, and run again?

MaN0bhiR commented 1 year ago

well ,I have .pt files in dataset/wavlm folders (not .spec.pt) which are generated using preprocess_ssl.py (I also confirmed if I am giving downsampled 16K wav folder in --in_dir argument). I'll try excluding folders with different sampling rate to see if that solves the issue

LeMoyenAge commented 1 year ago

I also had same problem too. Use former version of data_util.py, train.py didn't contain TextAudioSpeakerCollate() will temporarily solve those problem

MaN0bhiR commented 1 year ago

I also had same problem too. Use former version of data_util.py, train.py didn't contain TextAudioSpeakerCollate() will temporarily solve those problem

Thank you , it worked. @OlaWod does using older version of these files effect performance?

OlaWod commented 1 year ago

I also had same problem too. Use former version of data_util.py, train.py didn't contain TextAudioSpeakerCollate() will temporarily solve those problem

Thank you , it worked. @OlaWod does using older version of these files effect performance?

yes, it has more distortions according to my small scale test.

MaN0bhiR commented 1 year ago

yes, it has more distortions according to my small scale test.

The model I fine-tuned is messing up the content and pronunciation but voices are really accurate. Do you think using newer version of data_utils would help? or is it simply the case of overfitting?

OlaWod commented 1 year ago

yes, it has more distortions according to my small scale test.

The model I fine-tuned is messing up the content and pronunciation but voices are really accurate. Do you think using newer version of data_utils would help? or is it simply the case of overfitting?

i think it more likely because of overfitting

LeMoyenAge commented 1 year ago

yes, it has more distortions according to my small scale test.

The model I fine-tuned is messing up the content and pronunciation but voices are really accurate. Do you think using newer version of data_utils would help? or is it simply the case of overfitting?

https://github.com/OlaWod/FreeVC/issues/57#issuecomment-1426857492

Try this in release version.

MaN0bhiR commented 1 year ago

Hey , after finetuning , though the voices are more accurate , the content is kind of messed up. But when I run the same conversion without fine tuning (on provided pre-trained models) the content is preserved while the voices are inaccurate. Do you think more training (finetuning for little longer) would help? My dataset contains sound files from 10 different speakers with 2hr duration , is it too less? If not , how many epochs of training is ideal for that?

vinypan commented 8 months ago

Hello @MaN0bhiR , how did you end up improving the fine-tuned quality of converted voices? Are 2hrs duration of audios enough for fine-tuning?