facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.36k stars 6.4k forks source link

Textless NLP Vocoder Give Empty Array #4031

Open zubeyirgenc opened 2 years ago

zubeyirgenc commented 2 years ago

❓ Questions and Help

When i try to use fairseq's textless nlp implementation with pretrained weigths that given links in readme files but when i give my test set, it returns empty wav file. When i inspect the code i realize the mel spectrogram predict but the audio is return Nan

def synthesize_audio(model, waveglow, denoiser, inp, lab=None, strength=0.0):
    assert inp.size(0) == 1
    inp = inp.cuda()
    if lab is not None:
        lab = torch.LongTensor(1).cuda().fill_(lab)

    with torch.no_grad():
        _, mel, _, ali, has_eos = model.inference(inp, lab, ret_has_eos=True)
        print("mel: ",mel)
        aud = waveglow.infer(mel, sigma=0.666)
        print("aud: ",aud)

When i go into waveglow's inference, it break down in for loop

         for k in reversed(range(self.n_flows)):
            n_half = int(audio.size(1)/2)
            audio_0 = audio[:,:n_half,:]
            print("audio_0: ",audio_0)
            audio_1 = audio[:,n_half:,:]
            print("audio_1: ",audio_1)

            output = self.WN[k]((audio_0, spect))
            print("output: ",output)

            s = output[:, n_half:, :]
            print("s: ",s)
            b = output[:, :n_half, :]
            print("b: ",b)
            audio_1 = (audio_1 - b)/torch.exp(s)
            print("audio_1: ",audio_1)
            audio = torch.cat([audio_0, audio_1],1)
            print("audio: ",audio)

            audio = self.convinv[k](audio, reverse=True)
            print("audio: ",audio)

It turn into loop 9 or 10 times but after this all arrays become Nan or infinite. I tried to change pretrained models hubert to cpc but it changes nothing. When i try to use griffin-lim, im not sure but maybe because of mel channel number too few, it create wav file include only sizzle. How can i observe models result after mel? Maybe i use a different vocoder but the repository construct for this and i dont know what is the vocoder parameters like hann window or hop size, can i find this parameters?

giymen commented 2 years ago

Did you solve this problem? I am having the exact same situation with wav2vec2.0, k=100 checkpoints.

Features: (output of wav2vec2.0) [[ 8.144833 -15.159848 -7.4920955 ... 2.5299425 -0.90555054 -0.347736 ] [ 2.3151884 -7.0690145 -2.5165465 ... 6.244416 6.663908 2.930407 ] [ 12.45649 -5.7856936 -1.6048135 ... 13.753728 -5.6796875 0.8940346 ] ... [ 1.4279835 -13.357233 -2.4323854 ... -1.4477346 4.150449 7.5033054 ] [ 4.333761 -8.829683 -0.5041002 ... 1.6009731 3.1656103 4.9252124 ] [ 10.381403 -17.971233 -5.801667 ... 3.591822 -2.0576916 2.273886 ]] Units: (output of quantization) 48 32 25 64 3 64 64 64 32 32 89 32 32 32 32 25 30 25 10 9 11 11 67 67 67 67 75 17 17 17 68 22 22 81 17 56 56 63 60 77 77 92 92 21 41 21 76 76 76 75 75 80 80 81 83 83 84 68 16 22 65 96 57 68 16 58 58 58 24 24 21 21 21 91 21 21 76 76 46 75 13 55 59 68 22 22 81 81 49 49 86 68 68 58 99 99 99 75 13 13 58 2 16 65 35 39 9 68 68 91 91 91 91 97 80 80 46 46 80 80 52 16 91 16 99 65 99 46 80 13 59 60 58 43 43 43 43 43 43 43 43 20 20 70 20 87 88 88 89 10 87 89 11 67 67 67 75 55 81 83 83 84 77 72 14 14 44 44 56 56 57 77 92 92 21 21 72 72 71 56 56 68 78 73 58 73 22 22 81 81 25 56 25 62 62 21 91 21 21 21 62 21 25 25 25 87 87 77 77 91 5 5 5 91 91 91 91 71 91 91 16 16 16 66 62 42 68 41 41 75 59 91 91 41 91 22 65 59 2 60 59 2 46 75 59 16 16 65 56 8 68 98 98 97 97 62 62 60 60 54 43 43 43 43 43 43 43 43 20 20 20 20 87 88 23 23 23 53 23 89 10 10 10 87 11 60 60 91 91 72 29 62 99 99 99 99 99 64 99 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 70 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 99 25 25 25 25 25 25 87 25 25 87 25 87 30 10 10 89 11 68 91 91 91 91 91 25 35 25 25 25 25 25 25 25 25 25 10 87 10 89 11 90 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 25 20 20 25 25 64 64 25 64 64 87 64 64 64 64 64 64 89 64 39 64 64 64 64 64 64 64 64 53 89 64 64 87 47 89 64 47 53 64 53 53 53 85 53 85 85 85 Mel: (output of tacotron2) tensor([[[ -7.9844, -8.2812, -8.2344, ..., -9.1094, -8.9609, -8.6484], [ -7.3203, -7.4414, -7.2930, ..., -7.9023, -7.8398, -7.6328], [ -6.7773, -6.7891, -6.6406, ..., -6.8477, -6.8789, -6.7344], ..., [ -8.2031, -7.7539, -7.9609, ..., -9.9609, -10.0234, -9.8906], [ -8.1719, -7.6875, -7.8320, ..., -9.9844, -10.0312, -9.9062], [ -8.1875, -7.6211, -7.7031, ..., -9.9844, -10.0156, -9.9219]]], device='cuda:0', dtype=torch.float16) Audio: (output of waveglow) tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', dtype=torch.float16)

zubeyirgenc commented 2 years ago

I change only the torch.float16 to torch.float32 and sloved. Because when use torch.float16 this parameters vanishing and in somewhere try to divide zero. I hope it helps your problem too.

giymen commented 2 years ago

Yes, thank you. This solved my problem too.