bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
450 stars 65 forks source link

Maybe mention memory consumption in readme.md? #29

Closed asusdisciple closed 11 months ago

asusdisciple commented 11 months ago

I just tried to use and test your model, unfortunately I only have a GPU with 16GB of RAM. Apparently WavLM takes about 12 GB and HifiGAN needs another 5 GB so you need at least 20GB of RAM to run inference. Would be nice to clarify that in the requirements section :)

asusdisciple commented 11 months ago

Also I found this to be a bug. I tried to do inference of your standard example on a cluster, thats what i got:

Traceback (most recent call last):
  File "/raid/asus/p_knn-vc/knn-vc/inf_test.py", line 12, in <module>
    matching_set = knn_vc.get_matching_set(ref_wav_paths)
  File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 66, in get_matching_set
    feats.append(self.get_features(p, weights=self.weighting if weights is None else weights, vad_trigger_level=vad_trigger_level))
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 117, in get_features
    features = self.wavlm.extract_features(wav_input_16khz, output_layer=SPEAKER_INFORMATION_LAYER, ret_layer_results=False)[0]
  File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/WavLM.py", line 364, in extract_features
    x, layer_results = self.encoder(
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/WavLM.py", line 565, in forward
    x, layer_results = self.extract_features(x, padding_mask, streaming_mask, layer)
  File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/WavLM.py", line 598, in extract_features
    x, z, pos_bias = layer(x, self_attn_padding_mask=padding_mask, need_weights=False,
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/WavLM.py", line 693, in forward
    x, attn, pos_bias = self.self_attn(
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/modules.py", line 505, in forward
    position_bias = self.compute_bias(tgt_len, src_len)
  File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/modules.py", line 453, in compute_bias
    values = self.relative_attention_bias(relative_position_bucket)
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 103.11 GiB. GPU 0 has a total capacty of 31.74 GiB of which 12.00 GiB is free. Including non-PyTorch memory, this process has 19.71 GiB memory in use. Of the allocated memory 14.69 GiB is allocated by PyTorch, and 4.64 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
RF5 commented 11 months ago

Hi @asusdisciple , I think I might know the cause of your problems: what is the length of the audio you are using as reference and doing inference on? The memory usage of WavLM scales quadratically with input audio, so using very long audio clips for the input or reference will cause an OOM error.

If you stick to references under 30s, or chunk any inputs into smaller segments and stitch the WavLM features together after computing them, you should not see any OOM issues. All evaluations for the original paper were done on a single RTX 2070 SUPER gpu with only 8GB memory, using the current codebase here. It was evaluated on Librispeech test sets, which ensure that the input audios are well under a minute long.

Hope that helps!

asusdisciple commented 11 months ago

@RF5 you are completely right - I used 13minute audio files which did not work at all, after cutting them to two minute pieces it worked just fine! Thanks for the clarification.