There is a bug that is ignored in the wav data processing.

f the input audio is multi-channel, the loaded wav data will be [16kt, X], where X is the channel’s number. Then utilizing L160 to extract the spectrogram will increase the T\X times in temporal space.

https://github.com/Sxjdwang/TalkLip/blob/main/inf_demo.py#L160

So, the users need to ensure that the input audio only has one channel,

ffmpeg  -i input.wav -ac 1  -ar 16000 output.wav  # -ac is set the number of audio channels

or revise L160 to the following function.

from python_speech_features import logfbank
if len(wav_data.shape)>1:
    audio_feats = logfbank(wav_data[:,0], samplerate=sample_rate).astype(np.float32)  # [T, F]
else:
    audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32)  # [T, F]

Some issues met the same bug. Such as #7

Additionally, the train.py ignore this potential situation too.

Sxjdwang / TalkLip

There is a bug that is ignored in the wav data processing. #12