bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
431 stars 64 forks source link

Hints on improvements for training and matching #33

Closed asusdisciple closed 7 months ago

asusdisciple commented 8 months ago

First of all thanks for the great model! I tested it extensively by now and ran across a few problems and performance issues which you might can help with.

  1. Matching takes a lot of time with big datasets (1000+ 2min files), since it is not multi-gpu, do you intend to change that in the future?

  2. General behaviour: For training in general, it seems to be better to have a few big files rather than many small files (2min vs 10sec). I think this might be related to the overhead introduced by all the small .pt "models". Can you confirm this or believe this is plausible?

  3. My biggest issue so far is when I try to fine tune the hifi-gan vocoder. My Notebook with a A4500 seems to be on par, even outperforming my DGX Station with 4 x V100 32GB GPUs, which is strange.

I identified the following things:

During validation the operation is performed on all files rather than in a batch (1000+ files). The station and the notebook are both about equally fast in validating all files. However the station uses 4 GPUs which are working at 100% all the time and should be a lot faster. Since this is really slow how often do you think should I perform validation?

During batch training the notebook also outperforms the station by a little bit completing one epoch in about 40sec (station 48sec). However when I look at nvidia-smi on the station the GPU usage is at 0% all the time.

Unfortunately there seem to be some serious issues with a multi-gpu approach. If I only use one GPU on the station one epoch takes about 17 seconds. Maybe you have an idea what goes wrong here?

Edit: When I monitored the epochs in a multi-gpu setting it seems the epoch itself was trained really fast in 5 seconds. However before the progressbar appears there seems to happen some loading which takes the other 50 seconds. Do you know which process is responsible for that time gap or how to minmize it?

123

What I tried so far: I tried to use different batch sizes and adjusted the number of workers in the config, but it did no really change results that much,

asusdisciple commented 8 months ago

I was able to solve the issue. The problem lies with the train_loader class. Most of the "training time" results from the heavy dataloading in a multi gpu setting. By default the workers are dismissed after every epoch, so the dataset needs to be loaded again. By using persistent_workers=True the training time can be reduced to 13 sec/Epoch on 4 GPUs. Its still not optimal since actual training is only performed during a few seconds of this time frame, but it is still a large improvement.

Another thing I noticed was to set the batch_size down to a rather small size, since the loading to the device in

     for i, batch in pb:

            if rank == 0:
                start_b = time.time()
            x, y, _, y_mel = batch
            x = x.to(device, non_blocking=True)
            y = y.to(device, non_blocking=True)
            y_mel = y_mel.to(device, non_blocking=True)
            y = y.unsqueeze(1)

took a long time. Now one epoch was trained in 7 seconds. Just wanted to let you guys know, in case you run into these problems.

HninLwin-byte commented 8 months ago

I was fine-tuning the model with my own dataset, I got this error. If someone encountered the same error, please share me how to solve this problem. checkpoints directory : /content/drive/MyDrive/data/knn-vc/pertained_model /usr/local/lib/python3.10/dist-packages/torchaudio/transforms/_transforms.py:580: UserWarning: Argument 'onesided' has been deprecated and has no influence on the behavior of this module. warnings.warn( Epoch: 1: 0% 0/100 [00:00<?, ?it/s] 0% 0/31 [00:00<?, ?it/s]Before padding - Wav shape: torch.Size([1, 7040]) After padding - Wav shape: torch.Size([1, 7744]) Before padding - Wav shape: torch.Size([1, 7040]) After padding - Wav shape: torch.Size([1, 7744]) Before padding - Wav shape: torch.Size([1, 0]) Before padding - Wav shape: torch.Size([1, 7040]) After padding - Wav shape: torch.Size([1, 7744]) 0% 0/31 [00:01<?, ?it/s] Epoch: 1: 0% 0/100 [00:01<?, ?it/s] Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/train.py", line 342, in main() File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/train.py", line 338, in main train(0, a, h) File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/train.py", line 145, in train for i, batch in pb: File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1182, in iter for obj in iterable: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data return self._process_data(data) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data data.reraise() File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 694, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/meldataset.py", line 203, in getitem mel_loss = self.alt_melspec(audio) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/content/drive/.shortcut-targets-by-id/1PdgEkagsNCmi1R4J43LiU8ygwHM4ooqv/data/knn-vc/hifigan/meldataset.py", line 77, in forward wav = F.pad(wav, ((self.n_fft - self.hop_size) // 2, (self.n_fft - self.hop_size) // 2), "reflect") RuntimeError: Expected 2D or 3D (batch mode) tensor with possibly 0 batch size and other non-zero dimensions for input, but got: [1, 0]

RF5 commented 8 months ago

Hi @asusdisciple , thanks for the suggestions. We have added the persistent workers trick to the training code now.

And @HninLwin-byte thanks for your issue, it looks like one of your audio files might be corrupt or less than the minimum allowable length (about 160ms). I would double check all files in your dataset are not corrupt / readable, and at least 160ms long.