PlayVoice / whisper-vits-svc

Core Engine of Singing Voice Conversion & Singing Voice Clone
https://huggingface.co/spaces/maxmax20160403/sovits5.0
MIT License
2.62k stars 921 forks source link

svc_trainer.py error #130

Closed bukhalmae145 closed 5 months ago

bukhalmae145 commented 11 months ago

python svc_trainer.py -c configs/base.yaml -n sovits5.0 Batch size per GPU : 8 /Users/workstation/Music/whisper-vits-svc/whisper-vits-svc/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") ----------10---------- 2023-10-15 12:28:02,861 - INFO - Start from 32k pretrain model: ./vits_pretrain/sovits5.0.pretrain.pth 2023-10-15 12:28:03,123 - INFO - Starting new training run. ----------373---------- Validation loop: 0%| | 0/2 [00:00<?, ?it/s]/Users/workstation/Music/whisper-vits-svc/vits/attentions.py:319: UserWarning: MPS: The constant padding of more than 3 dimensions is not currently supported natively. It uses View Ops default implementation to run. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Pad.mm:474.) x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]])) /Users/workstation/Music/whisper-vits-svc/whisper-vits-svc/lib/python3.8/site-packages/torch/functional.py:660: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/SpectralOps.cpp:879.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /Users/workstation/Music/whisper-vits-svc/whisper-vits-svc/lib/python3.8/site-packages/torch/functional.py:660: UserWarning: The operator 'aten::_fft_r2c' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] Validation loop: 100%|████████████████████████████| 2/2 [00:33<00:00, 16.92s/it] Loading train data: 0%| | 0/48 [00:00<?, ?it/s]/Users/workstation/Music/whisper-vits-svc/whisper-vits-svc/lib/python3.8/site-packages/torch/functional.py:660: UserWarning: A window was not provided. A rectangular window will be applied,which is known to cause spectral leakage. Other windows such as torch.hann_window or torch.hamming_window can are recommended to reduce spectral leakage.To suppress this warning and use a rectangular window, explicitly set window=torch.ones(n_fft, device=<device>). (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/SpectralOps.cpp:843.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] Loading train data: 0%| | 0/48 [01:16<?, ?it/s] Traceback (most recent call last): File "svc_trainer.py", line 41, in train(0, args, args.checkpoint_path, hp, hp_str) File "/Users/workstation/Music/whisper-vits-svc/vits_extend/train.py", line 223, in train loss_g.backward() File "/Users/workstation/Music/whisper-vits-svc/whisper-vits-svc/lib/python3.8/site-packages/torch/_tensor.py", line 503, in backward torch.autograd.backward( File "/Users/workstation/Music/whisper-vits-svc/whisper-vits-svc/lib/python3.8/site-packages/torch/autograd/init.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Unsupported type byte size: ComplexFloat

MaxMax2016 commented 11 months ago

pytorch ver should be >= 1.9

bukhalmae145 commented 11 months ago

pytorch ver should be >= 1.9

I downloaded the latest version of Pytorch.

MaxMax2016 commented 11 months ago

Sorry, I have no experience with MacOS.

bukhalmae145 commented 11 months ago

Sorry, I have no experience with MacOS.

Sorry to bother you but I have another question. Can I use korean hubert model(https://huggingface.co/team-lucid/hubert-base-korean) for any of your svc models? Because the pronunciation of the model I trained with the current Hubert Model seems weird and awkward.

MaxMax2016 commented 11 months ago

yes, but you must train pretrain model use the hubert-base-korean with many singers' data.

bukhalmae145 commented 11 months ago

yes, but you must train pretrain model use the hubert-base-korean with many singers' data.

So you basically mean that I can't use the model from the link above directly?

MaxMax2016 commented 11 months ago

In theory, it can't be used.

bukhalmae145 commented 11 months ago

In theory, it can't be used.

Can you please let me know the procedures specifically so that I can adjust the korean hubert model in Grad-SVC?

MaxMax2016 commented 11 months ago

It takes some time

bukhalmae145 commented 11 months ago

It takes some time

Yeah sure please..

MaxMax2016 commented 11 months ago

2 days later, i will give a demo.

MaxMax2016 commented 11 months ago
import sys,os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import numpy as np
import argparse
import torch
import librosa

from transformers import HubertModel

def load_audio(file: str, sr: int = 16000):
    x, sr = librosa.load(file, sr=sr)
    return x

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        print(feats.shape)
        print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-w", "--wav", help="wav", dest="wav")
    parser.add_argument("-v", "--vec", help="vec", dest="vec")
    args = parser.parse_args()
    print(args.wav)
    print(args.vec)

    wavPath = args.wav
    vecPath = args.vec

    device = "cuda" if torch.cuda.is_available() else "cpu"
    hubert = load_model('./hubert-base-korean', device)
    pred_vec(hubert, wavPath, vecPath, device)

hubert-korean

prepare/preprocess_hubert.py should be changed as the same.

and

configs: n_vecs: 256 should be changed to 768

bukhalmae145 commented 11 months ago
import sys,os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import numpy as np
import argparse
import torch
import librosa

from transformers import HubertModel

def load_audio(file: str, sr: int = 16000):
    x, sr = librosa.load(file, sr=sr)
    return x

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        print(feats.shape)
        print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-w", "--wav", help="wav", dest="wav")
    parser.add_argument("-v", "--vec", help="vec", dest="vec")
    args = parser.parse_args()
    print(args.wav)
    print(args.vec)

    wavPath = args.wav
    vecPath = args.vec

    device = "cuda" if torch.cuda.is_available() else "cpu"
    hubert = load_model('./hubert-base-korean', device)
    pred_vec(hubert, wavPath, vecPath, device)

hubert-korean

prepare/preprocess_hubert.py should be changed as the same.

and

configs: n_vecs: 256 should be changed to 768

I got this error code: python prepare/preprocess_hubert.py None None Some weights of the model checkpoint at ./hubert-base-korean were not used when initializing HubertModel: ['encoder.pos_conv_embed.conv.weight_g', 'encoder.pos_conv_embed.conv.weight_v']

MaxMax2016 commented 11 months ago

have you changed prepare/preprocess_hubert.py as the same:

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        print(feats.shape)
        print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)
bukhalmae145 commented 11 months ago

have you changed prepare/preprocess_hubert.py as the same:

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        print(feats.shape)
        print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)
Screenshot 2023-10-17 at 6 16 12 PM Screenshot 2023-10-17 at 6 17 15 PM

Traceback (most recent call last): File "prepare/preprocess_hubert.py", line 53, in pred_vec(hubert, wavPath, vecPath, device) File "prepare/preprocess_hubert.py", line 26, in pred_vec audio = load_audio(wavPath) File "prepare/preprocess_hubert.py", line 12, in load_audio x, sr = librosa.load(file, sr=sr) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/librosa/core/audio.py", line 183, in load y, sr_native = audioread_load(path, offset, duration, dtype) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), *kw) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/librosa/util/decorators.py", line 59, in __wrapper return func(args, **kwargs) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/librosa/core/audio.py", line 239, in audioread_load reader = audioread.audio_open(path) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/audioread/init__.py", line 127, in audio_open return BackendClass(path) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/audioread/rawread.py", line 59, in init__ self._fh = open(filename, 'rb') IsADirectoryError: [Errno 21] Is a directory: 'data_gvc/waves-16k/'

MaxMax2016 commented 11 months ago
import sys,os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import numpy as np
import argparse
import torch
import librosa

from tqdm import tqdm
from transformers import HubertModel

def load_audio(file: str, sr: int = 16000):
    x, sr = librosa.load(file, sr=sr)
    return x

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        # print(feats.shape)
        # print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-w", "--wav", help="wav", dest="wav", required=True)
    parser.add_argument("-v", "--vec", help="vec", dest="vec", required=True)

    args = parser.parse_args()
    print(args.wav)
    print(args.vec)
    os.makedirs(args.vec, exist_ok=True)

    wavPath = args.wav
    vecPath = args.vec

    device = "cuda" if torch.cuda.is_available() else "cpu"
    hubert = load_model('./hubert-base-korean', device)

    for spks in os.listdir(wavPath):
        if os.path.isdir(f"./{wavPath}/{spks}"):
            os.makedirs(f"./{vecPath}/{spks}", exist_ok=True)

            files = [f for f in os.listdir(f"./{wavPath}/{spks}") if f.endswith(".wav")]
            for file in tqdm(files, desc=f'Processing vec {spks}'):
                file = file[:-4]
                pred_vec(hubert, f"{wavPath}/{spks}/{file}.wav", f"{vecPath}/{spks}/{file}.vec", device)
bukhalmae145 commented 11 months ago
import sys,os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import numpy as np
import argparse
import torch
import librosa

from tqdm import tqdm
from transformers import HubertModel

def load_audio(file: str, sr: int = 16000):
    x, sr = librosa.load(file, sr=sr)
    return x

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        # print(feats.shape)
        # print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-w", "--wav", help="wav", dest="wav", required=True)
    parser.add_argument("-v", "--vec", help="vec", dest="vec", required=True)

    args = parser.parse_args()
    print(args.wav)
    print(args.vec)
    os.makedirs(args.vec, exist_ok=True)

    wavPath = args.wav
    vecPath = args.vec

    device = "cuda" if torch.cuda.is_available() else "cpu"
    hubert = load_model('./hubert-base-korean', device)

    for spks in os.listdir(wavPath):
        if os.path.isdir(f"./{wavPath}/{spks}"):
            os.makedirs(f"./{vecPath}/{spks}", exist_ok=True)

            files = [f for f in os.listdir(f"./{wavPath}/{spks}") if f.endswith(".wav")]
            for file in tqdm(files, desc=f'Processing vec {spks}'):
                file = file[:-4]
                pred_vec(hubert, f"{wavPath}/{spks}/{file}.wav", f"{vecPath}/{spks}/{file}.vec", device)

The preprocess_hubert.py problem is solved but I got this error message when I operated gvc_trainer.py: Traceback (most recent call last): File "gvc_trainer.py", line 30, in train(hps, args.checkpoint_path) File "/Users/workstation/Music/Grad-SVC/grad_extend/train.py", line 46, in train load_model(model, checkpoint['model']) File "/Users/workstation/Music/Grad-SVC/grad_extend/utils.py", line 24, in load_model model.load_state_dict(new_state_dict) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for GradTTS: size mismatch for encoder.prenet.conv_layers.0.weight: copying a param with shape torch.Size([192, 256, 5]) from checkpoint, the shape in current model is torch.Size([192, 768, 5]).

MaxMax2016 commented 11 months ago

model

use this model

bukhalmae145 commented 11 months ago

model

Screenshot 2023-10-17 at 6 38 17 PM

use this model

I already have put those models in the project. (Actually I feel really sorry about you that you really work hard to solve the problem for me. I appreciate it! :) )

MaxMax2016 commented 11 months ago

maybe check the md5:

md5sum hubert-base-korean/pytorch_model.bin

be042a8b2ed7126c03b1159f86893b8c hubert-base-korean/pytorch_model.bin

md5sum hubert-base-korean/config.json

3bec78d9502a1446df4afae5320b450e hubert-base-korean/config.json

MaxMax2016 commented 11 months ago

md5sum is a command to cumpute md5 number of a file, it is unique for every file. so same files have same md5 numbers.

bukhalmae145 commented 11 months ago

md5sum is a command to cumpute md5 number of a file, it is unique for every file. so same files have same md5 numbers.

md5 hubert-base-korean/pytorch_model.bin MD5 (hubert-base-korean/pytorch_model.bin) = be042a8b2ed7126c03b1159f86893b8c md5 hubert-base-korean/config.json MD5 (hubert-base-korean/config.json) = 50e9057abdd7d9944bbfa920cd480596 got this on my terminal

MaxMax2016 commented 11 months ago

re-download hubert-base-korean/config.json

bukhalmae145 commented 11 months ago

re-download hubert-base-korean/config.json

I get same md5 numbers even though re downloaded the config.json

MaxMax2016 commented 11 months ago

20231017175301

"hidden_size": 768 is right?

bukhalmae145 commented 11 months ago

20231017175301

"hidden_size": 768 is right?

Screenshot 2023-10-17 at 6 57 39 PM

I still get this error message: Traceback (most recent call last): File "gvc_trainer.py", line 30, in train(hps, args.checkpoint_path) File "/Users/workstation/Music/Grad-SVC/grad_extend/train.py", line 46, in train load_model(model, checkpoint['model']) File "/Users/workstation/Music/Grad-SVC/grad_extend/utils.py", line 24, in load_model model.load_state_dict(new_state_dict) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for GradTTS: size mismatch for encoder.prenet.conv_layers.0.weight: copying a param with shape torch.Size([192, 256, 5]) from checkpoint, the shape in current model is torch.Size([192, 768, 5]).

MaxMax2016 commented 11 months ago

change config/base.yaml n_vecs: 256->n_vecs: 768 20231017175301

bukhalmae145 commented 11 months ago

change config/base.yaml n_vecs: 256->n_vecs: 768 20231017175301

Screenshot 2023-10-17 at 7 01 59 PM

Already have done it. Can I have the whole Grad-SVC project folder you have?

MaxMax2016 commented 11 months ago

pretrain: "grad_pretrain/gvc.pretrain.pth" is based on 256 hubert, and it can not be used any more so set pretrain: "" and you need more than 10000 wavs to train your model.

bukhalmae145 commented 11 months ago

change config/base.yaml n_vecs: 256->n_vecs: 768 20231017175301

Screenshot 2023-10-17 at 7 01 59 PM

Already have done it.

pretrain: "grad_pretrain/gvc.pretrain.pth" is based on 256 hubert, and it can not be used any more so set pretrain: "" and you need more than 10000 wavs to train your model.

What is the minimum length and maximum length of wavs file? And do I have to put breathing sound wav files in the dataset?

MaxMax2016 commented 11 months ago

2s < length < 20s breathing sound wav files can not too much and need more epochs for you to train your own model full_epochs: 500 fast_epochs: 400

bukhalmae145 commented 11 months ago

2s < length < 20s breathing sound wav files can not too much and need more epochs for you to train your own model full_epochs: 500 fast_epochs: 400

How many Epochs would generate the best quality? And what is the difference between Full and Fast Epochs?

MaxMax2016 commented 11 months ago

so i don't know too. Fast Epochs is training transformer only, and full is training transformer and diffusion.

bukhalmae145 commented 11 months ago

2s < length < 20s breathing sound wav files can not too much and need more epochs for you to train your own model full_epochs: 500 fast_epochs: 400

How many Epochs would generate the best quality?

so i don't know too. Fast Epochs is training transformer only, and full is training transformer and diffusion.

While I operated gvc_inference.py I got this message: Traceback (most recent call last): File "hubert/inference.py", line 57, in for spks in os.listdir(wavPath): NotADirectoryError: [Errno 20] Not a directory: 'test.wav' Auto run : python pitch/inference.py -w test.wav -p gvc_tmp.pit.csv test.wav gvc_tmp.pit.csv Initializing Grad-TTS... /Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") Number of encoder parameters = 17.48m Number of decoder parameters = 16.87m Temperature: 1.015 Traceback (most recent call last): File "gvc_inference.py", line 217, in main(args) File "gvc_inference.py", line 98, in main vec = np.load(args.vec) File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/numpy/lib/npyio.py", line 405, in load fid = stack.enter_context(open(os_fspath(file), "rb")) IsADirectoryError: [Errno 21] Is a directory: 'gvc_tmp.vec.npy'

Should ./hubert/inference.py look the same as ./prepare/preprocess_hubert.py?

MaxMax2016 commented 11 months ago

NotADirectoryError: [Errno 20] Not a directory: 'test.wav',you should set the real path of your wave.

bukhalmae145 commented 11 months ago

NotADirectoryError: [Errno 20] Not a directory: 'test.wav',you should set the real path of your wave.

Well, I have the file test.wav in my directory. And this is my ./hubert/inference.py `import sys,os sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(file)))) import numpy as np import argparse import torch import librosa

from tqdm import tqdm from transformers import HubertModel

def load_audio(file: str, sr: int = 16000): x, sr = librosa.load(file, sr=sr) return x

def load_model(path, device): model = HubertModel.from_pretrained(path) model.eval() if not (device == "cpu"): model.half() model.to(device) return model

def pred_vec(model, wavPath, vecPath, device): audio = load_audio(wavPath) feats = audio feats = torch.from_numpy(feats).to(device) feats = feats[None, :] if not (device == "cpu"): feats = feats.half() with torch.no_grad(): vec = model(feats).last_hidden_state vec = vec.squeeze().data.cpu().float().numpy()

print(feats.shape)

    # print(vec.shape)   # [length, dim=768] hop=320
np.save(vecPath, vec, allow_pickle=False)

if name == "main": parser = argparse.ArgumentParser() parser.add_argument("-w", "--wav", help="wav", dest="wav", required=True) parser.add_argument("-v", "--vec", help="vec", dest="vec", required=True)

args = parser.parse_args()
print(args.wav)
print(args.vec)
os.makedirs(args.vec, exist_ok=True)

wavPath = args.wav
vecPath = args.vec

device = "mps" if torch.backends.mps.is_available() else "cpu"
hubert = load_model('./hubert-base-korean', device)

for spks in os.listdir(wavPath):
    if os.path.isdir(f"./{wavPath}/{spks}"):
        os.makedirs(f"./{vecPath}/{spks}", exist_ok=True)

        files = [f for f in os.listdir(f"./{wavPath}/{spks}") if f.endswith(".wav")]
        for file in tqdm(files, desc=f'Processing vec {spks}'):
            file = file[:-4]
            pred_vec(hubert, f"{wavPath}/{spks}/{file}.wav", f"{vecPath}/{spks}/{file}.vec", device)

`

MaxMax2016 commented 11 months ago

Should ./hubert/inference.py look the same as ./prepare/preprocess_hubert.py? no

import sys,os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import numpy as np
import argparse
import torch
import librosa

from transformers import HubertModel

def load_audio(file: str, sr: int = 16000):
    x, sr = librosa.load(file, sr=sr)
    return x

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        print(feats.shape)
        print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-w", "--wav", help="wav", dest="wav")
    parser.add_argument("-v", "--vec", help="vec", dest="vec")
    args = parser.parse_args()
    print(args.wav)
    print(args.vec)

    wavPath = args.wav
    vecPath = args.vec

    device = "cuda" if torch.cuda.is_available() else "cpu"
    hubert = load_model('./hubert-base-korean', device)
    pred_vec(hubert, wavPath, vecPath, device)
bukhalmae145 commented 11 months ago

Should ./hubert/inference.py look the same as ./prepare/preprocess_hubert.py? no

Is it the same code as the original Grad-SVC project?

MaxMax2016 commented 11 months ago
import sys,os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import numpy as np
import argparse
import torch
import librosa

from transformers import HubertModel

def load_audio(file: str, sr: int = 16000):
    x, sr = librosa.load(file, sr=sr)
    return x

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        print(feats.shape)
        print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-w", "--wav", help="wav", dest="wav")
    parser.add_argument("-v", "--vec", help="vec", dest="vec")
    args = parser.parse_args()
    print(args.wav)
    print(args.vec)

    wavPath = args.wav
    vecPath = args.vec

    device = "cuda" if torch.cuda.is_available() else "cpu"
    hubert = load_model('./hubert-base-korean', device)
    pred_vec(hubert, wavPath, vecPath, device)
bukhalmae145 commented 11 months ago
import sys,os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import numpy as np
import argparse
import torch
import librosa

from transformers import HubertModel

def load_audio(file: str, sr: int = 16000):
    x, sr = librosa.load(file, sr=sr)
    return x

def load_model(path, device):
    model = HubertModel.from_pretrained(path)
    model.eval()
    if not (device == "cpu"):
        model.half()
    model.to(device)
    return model

def pred_vec(model, wavPath, vecPath, device):
    audio = load_audio(wavPath)
    feats = audio
    feats = torch.from_numpy(feats).to(device)
    feats = feats[None, :]
    if not (device == "cpu"):
        feats = feats.half()
    with torch.no_grad():
        vec = model(feats).last_hidden_state
        vec = vec.squeeze().data.cpu().float().numpy()
        print(feats.shape)
        print(vec.shape)   # [length, dim=768] hop=320
    np.save(vecPath, vec, allow_pickle=False)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-w", "--wav", help="wav", dest="wav")
    parser.add_argument("-v", "--vec", help="vec", dest="vec")
    args = parser.parse_args()
    print(args.wav)
    print(args.vec)

    wavPath = args.wav
    vecPath = args.vec

    device = "cuda" if torch.cuda.is_available() else "cpu"
    hubert = load_model('./hubert-base-korean', device)
    pred_vec(hubert, wavPath, vecPath, device)

I still get this message :( Traceback (most recent call last): File "hubert/inference.py", line 53, in pred_vec(hubert, wavPath, vecPath, device) File "hubert/inference.py", line 37, in pred_vec np.save(vecPath, vec, allow_pickle=False) File "<__array_function__ internals>", line 200, in save File "/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/numpy/lib/npyio.py", line 518, in save file_ctx = open(file, "wb") IsADirectoryError: [Errno 21] Is a directory: 'test.vec.npy'

MaxMax2016 commented 11 months ago

rm test.vec.npy and re-test

bukhalmae145 commented 11 months ago

rm test.vec.npy and re-test

I works!! I apreciate you with all of my heart!!

bukhalmae145 commented 11 months ago

New problem.. The wav file I exported with the trained model sounds weird.. Is it true that I have to leave it blank in "pretrain:" in base.yaml? And what is the gvc_pretrained.pth file that has been exported with gvc.pth?

Screenshot 2023-10-17 at 11 35 42 PM

https://drive.google.com/file/d/1djlKYFexiTaSa75QyGYqOfHZxeVnwcSU/view?usp=sharing (Audio file)

Some weights of the model checkpoint at ./hubert-base-korean were not used when initializing HubertModel: ['encoder.pos_conv_embed.conv.weight_g', 'encoder.pos_conv_embed.conv.weight_v']

MaxMax2016 commented 11 months ago

I don't think you've finished your training

Some weights of the model checkpoint at ./hubert-base-korean were not used when initializing HubertModel:
['encoder.pos_conv_embed.conv.weight_g', 'encoder.pos_conv_embed.conv.weight_v']

I didn't have that problem

bukhalmae145 commented 11 months ago

I don't think you've finished your training

Some weights of the model checkpoint at ./hubert-base-korean were not used when initializing HubertModel:
['encoder.pos_conv_embed.conv.weight_g', 'encoder.pos_conv_embed.conv.weight_v']

I didn't have that problem generated_dec_0

I get this message whenever I use Hubert Model. (Preprocessing, Inferencing) Even the message comes out on the terminal, the process still continues. And I get those weird wav files. Can I have the exact same code you have used?