Sxjdwang / TalkLip

373 stars 34 forks source link

Is a typo or bug? #16

Open Ironieser opened 1 year ago

Ironieser commented 1 year ago

In the paper, the implementation detail indicts that

Audio wavforms are preprocessed to mel-spectrogram with hop and window lengths, and mel bins are 12.5 ms, 50 ms, and 80.

But hop and window lengths, and mel bins are 10 ms, 25 ms, and 26 in the function 'def fre_audio' of "info_demo.py" and "class Talklipdata".

# train.py
L231: audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32)  # [T, F] 
# info_demo.py
L160: audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32)  # [T, F] 

The codes utilize the default values.

Sxjdwang commented 1 year ago

In the paper, I implement two audio encoders. The local encoder is preprocessed with the argument in the paper. But the global encoder, which is provided in the github, complies with the argument in AV-hubert.

Ironieser commented 1 year ago

In the paper, I implement two audio encoders. The local encoder is preprocessed with the argument in the paper. But the global encoder, which is provided in the github, complies with the argument in AV-hubert.↳

Get it, and thx for your work.

And there is another question about fine-tuning. The paper indicts that it only fine-tunes the last three layers of transformer blocks of the audio encoder during the TFG training. But I can't follow this in the train.py. It is more like freezing the full audio encoder when "self.ft == True". https://github.com/Sxjdwang/TalkLip/blob/main/models/talklip.py#L122-L123

Could you give me any hint? Thx :D