Sxjdwang / TalkLip

373 stars 34 forks source link

There is a bug that is ignored in the wav data processing. #12

Open Ironieser opened 1 year ago

Ironieser commented 1 year ago

f the input audio is multi-channel, the loaded wav data will be [16kt, X], where X is the channel’s number. Then utilizing L160 to extract the spectrogram will increase the T\X times in temporal space.

So, the users need to ensure that the input audio only has one channel,

ffmpeg  -i input.wav -ac 1  -ar 16000 output.wav  # -ac is set the number of audio channels

or revise L160 to the following function.

from python_speech_features import logfbank
if len(wav_data.shape)>1:
    audio_feats = logfbank(wav_data[:,0], samplerate=sample_rate).astype(np.float32)  # [T, F]
    audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32)  # [T, F]

Some issues met the same bug. Such as #7

Additionally, the ignore this potential situation too.

Sxjdwang commented 1 year ago

Thanks for your contribution!