Closed asusdisciple closed 11 months ago
Also I found this to be a bug. I tried to do inference of your standard example on a cluster, thats what i got:
Traceback (most recent call last):
File "/raid/asus/p_knn-vc/knn-vc/inf_test.py", line 12, in <module>
matching_set = knn_vc.get_matching_set(ref_wav_paths)
File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 66, in get_matching_set
feats.append(self.get_features(p, weights=self.weighting if weights is None else weights, vad_trigger_level=vad_trigger_level))
File "/raid/asus/v1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 117, in get_features
features = self.wavlm.extract_features(wav_input_16khz, output_layer=SPEAKER_INFORMATION_LAYER, ret_layer_results=False)[0]
File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/WavLM.py", line 364, in extract_features
x, layer_results = self.encoder(
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/WavLM.py", line 565, in forward
x, layer_results = self.extract_features(x, padding_mask, streaming_mask, layer)
File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/WavLM.py", line 598, in extract_features
x, z, pos_bias = layer(x, self_attn_padding_mask=padding_mask, need_weights=False,
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/WavLM.py", line 693, in forward
x, attn, pos_bias = self.self_attn(
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/modules.py", line 505, in forward
position_bias = self.compute_bias(tgt_len, src_len)
File "/home/asus/.cache/torch/hub/bshall_knn-vc_master/wavlm/modules.py", line 453, in compute_bias
values = self.relative_attention_bias(relative_position_bucket)
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/raid/asus/v1/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 103.11 GiB. GPU 0 has a total capacty of 31.74 GiB of which 12.00 GiB is free. Including non-PyTorch memory, this process has 19.71 GiB memory in use. Of the allocated memory 14.69 GiB is allocated by PyTorch, and 4.64 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hi @asusdisciple , I think I might know the cause of your problems: what is the length of the audio you are using as reference and doing inference on? The memory usage of WavLM scales quadratically with input audio, so using very long audio clips for the input or reference will cause an OOM error.
If you stick to references under 30s, or chunk any inputs into smaller segments and stitch the WavLM features together after computing them, you should not see any OOM issues. All evaluations for the original paper were done on a single RTX 2070 SUPER gpu with only 8GB memory, using the current codebase here. It was evaluated on Librispeech test sets, which ensure that the input audios are well under a minute long.
Hope that helps!
@RF5 you are completely right - I used 13minute audio files which did not work at all, after cutting them to two minute pieces it worked just fine! Thanks for the clarification.
I just tried to use and test your model, unfortunately I only have a GPU with 16GB of RAM. Apparently WavLM takes about 12 GB and HifiGAN needs another 5 GB so you need at least 20GB of RAM to run inference. Would be nice to clarify that in the requirements section :)