why "audio = audio.astype('int16')" is uesd ?

NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis

BSD 3-Clause "New" or "Revised" License

2.29k stars 530 forks source link

why "audio = audio.astype('int16')" is uesd ? #271

Closed wsywsywsywsywsy979 closed 1 year ago

wsywsywsywsywsy979 commented 2 years ago

I'm just curious about this code("audio = audio.astype('int16')") 's effect , because if I remove this code the result is very worse(including many noise). I'm a newer in tts , so I want to know whether it's a trick to transofmer float to int.

NinoSkopac commented 2 years ago

@authors can this gentleman and a scholar get a response please?

NinoSkopac commented 2 years ago

@wsywsywsywsywsy979 my guess is that it increases integer precision. Eg from default int8 to int16, so when your remove it the variable can store less int precision. Note I've never worked with Python nor ai.

wsywsywsywsywsy979 commented 2 years ago

@wsywsywsywsywsy979 my guess is that it increases integer precision. Eg from default int8 to int16, so when your remove it the variable can store less int precision. Note I've never worked with Python nor ai.

First, I'm very grateful for your answer ：） ,but before changing the type of 'audio', its type is ‘float’, that what I can't understand why turn 'float' into 'int' can get very surprising result : from noise to great synthetic speech .

realratchet commented 1 year ago

I'm pretty sure it's because it uses scipy for loading/saving audio instead of torchaudio. Scipy uses np.int16 for wavfiles, but you need floating point math to actually train stuff. There's a lot of conversions from int16->float->int16 in there with MAX_WAV_VALUE = 32768.0 multiplier, which is the maximum value a signed int16 can take. This normalizes the input to [-1;1] from int16.

wsywsywsywsywsy979 commented 1 year ago

I'm pretty sure it's because it uses scipy for loading/saving audio instead of torchaudio. Scipy uses np.int16 for wavfiles, but you need floating point math to actually train stuff. There's a lot of conversions from int16->float->int16 in there with MAX_WAV_VALUE = 32768.0 multiplier, which is the maximum value a signed int16 can take. This normalizes the input to [-1;1] from int16.

oh, I got it, thank you very much!

wsywsywsywsywsy979 commented 1 year ago

Thanks to everyone who helps me understand this issue, thank you !