Open Ironieser opened 1 year ago
f the input audio is multi-channel, the loaded wav data will be [16kt, X], where X is the channel’s number. Then utilizing L160 to extract the spectrogram will increase the T\X times in temporal space.
https://github.com/Sxjdwang/TalkLip/blob/main/inf_demo.py#L160
So, the users need to ensure that the input audio only has one channel,
ffmpeg -i input.wav -ac 1 -ar 16000 output.wav # -ac is set the number of audio channels
or revise L160 to the following function.
from python_speech_features import logfbank if len(wav_data.shape)>1: audio_feats = logfbank(wav_data[:,0], samplerate=sample_rate).astype(np.float32) # [T, F] else: audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32) # [T, F]
Some issues met the same bug. Such as #7
Additionally, the train.py ignore this potential situation too.
Thanks for your contribution!
f the input audio is multi-channel, the loaded wav data will be [16kt, X], where X is the channel’s number. Then utilizing L160 to extract the spectrogram will increase the T\X times in temporal space.
https://github.com/Sxjdwang/TalkLip/blob/main/inf_demo.py#L160
So, the users need to ensure that the input audio only has one channel,
or revise L160 to the following function.
Some issues met the same bug. Such as #7
Additionally, the train.py ignore this potential situation too.