PlayVoice / whisper-vits-svc

Core Engine of Singing Voice Conversion & Singing Voice Clone
https://huggingface.co/spaces/maxmax20160403/sovits5.0
MIT License
2.6k stars 919 forks source link

USP推理初次測試 #83

Open Taiwan1912 opened 1 year ago

Taiwan1912 commented 1 year ago

【岩崎良美《涼風》Cover by 岩崎宏美 |Sovits5.0 Bigvgan-mix-v2 USP 推理-哔哩哔哩】 https://b23.tv/mKdq4qL

效果很好,一次就能直出無怪音。 對比前一版本改善許多

zhyoung24 commented 12 months ago

请问这个USP有具体的代码实现吗?

Taiwan1912 commented 12 months ago

現在的版本5.0應該都是自帶USP推理了

zhyoung24 commented 12 months ago

具体是在哪部分实现的呢,我看推理代码也就是读取pitch,然后传入inference,没有看到其他操作了

Taiwan1912 commented 12 months ago

這個要問作者了,說明文檔沒有具體說明這點,只有提到本項目採用USP推理

MaxMax2016 commented 12 months ago

具体是在哪部分实现的呢,我看推理代码也就是读取pitch,然后传入inference,没有看到其他操作了

这这样的,原来crepe出来的pitch,需要经过UV去掉picth;现在测试,不经过UV去掉pitch好些。

原来的

def compute_f0_sing(filename, device):
    audio, sr = librosa.load(filename, sr=16000)
    assert sr == 16000
    audio = torch.tensor(np.copy(audio))[None]
    # Here we'll use a 20 millisecond hop length
    hop_length = 320
    fmin = 50
    fmax = 1000
    model = "full"
    batch_size = 512
    pitch, periodicity = torchcrepe.predict(
        audio,
        sr,
        hop_length,
        fmin,
        fmax,
        model,
        batch_size=batch_size,
        device=device,
        return_periodicity=True,
    )
    pitch = np.repeat(pitch, 2, -1)  # 320 -> 160 * 2
    periodicity = np.repeat(periodicity, 2, -1)  # 320 -> 160 * 2
    # CREPE was not trained on silent audio. some error on silent need filter.
    periodicity = torchcrepe.filter.median(periodicity, 9)
    pitch = torchcrepe.filter.mean(pitch, 9)
    pitch[periodicity < 0.1] = 0
    pitch = pitch.squeeze(0)
    return pitch

现在的USP方式

def compute_f0_sing(filename, device):
    audio, sr = librosa.load(filename, sr=16000)
    assert sr == 16000
    audio = torch.tensor(np.copy(audio))[None]
    audio = audio + torch.randn_like(audio) * 0.001
    # Here we'll use a 20 millisecond hop length
    hop_length = 320
    fmin = 50
    fmax = 1000
    model = "full"
    batch_size = 512
    pitch = crepe.predict(
        audio,
        sr,
        hop_length,
        fmin,
        fmax,
        model,
        batch_size=batch_size,
        device=device,
        return_periodicity=False,
    )
    pitch = np.repeat(pitch, 2, -1)  # 320 -> 160 * 2
    pitch = crepe.filter.mean(pitch, 5)
    pitch = pitch.squeeze(0)
    return pitch

就是把这个删除了:pitch[periodicity < 0.1] = 0

zhyoung24 commented 12 months ago

哦哦,多谢。我看preprocess的代码里还是保留了这个pitch[periodicity < 0.1] = 0,所以是训练的时候还是置0,只有推理时用USP?这样不匹配的方式反而更好?

MaxMax2016 commented 12 months ago

测试下来是这样,预处理用USP会导致推理的时候不可控,比如长音从中间断开

panxin801 commented 5 days ago

您好,这个问题中,我的音频pitch经过usp以后,再推理出来会给呼吸声加上音高成为听起来类似哮喘的声音,请问您遇到这个问题了吗,如何解决的呢?