Sxjdwang / TalkLip

405 stars 36 forks source link

the output video frames will increase unexpectly #7

Open weizmann opened 1 year ago

weizmann commented 1 year ago

I have tried with inf_demo.py, but I found that the frame count of the output video was doubled.

The input video file is 10s/25fps/250frames, but I found the duration of the output video file is 20s/25fps/501frames.

I find the length of audio features array is 501.

Maybe the audio/video frames are not aligned in my case. I am not sure if there are some fps/sample rate constraint in your project.

Waiting for your reply, thank you.


You can find my input/output video/audio files in the following linkage.

talklip-issue.zip

I run the inf_demo.py with the following command: python inf_demo.py --video_path ./input.mp4 --wav_path ./input.wav --ckpt_path ./checkpoints/global_contrastive.pth --avhubert_root /root/workspace/av_hubert

ffmpeg version is 4.2.3: image

some debug logs: image

weizmann commented 1 year ago

I change the logfbank with adding winstep=0.02 (default is 0.01, just tried to hardcode here), this could make the audio frame to 250 image

But the output is still not correct, neither. (lip sync is not correct and with frame lags)

https://user-images.githubusercontent.com/2306111/234510806-dabae7ed-e9b5-4d85-94e9-d16f813595ca.mp4

Sxjdwang commented 1 year ago

Hi, the problem is that the desired sampling rate of a audio file is 16khz, but that of your audio file is 44.1khz. I recommend you downsample your audio file to 16khz. Besides, as videos of LRS2 are 25 fps, so I set fps of output is 25. I will modify it to automatically compatible with fps of input videos

Sxjdwang commented 1 year ago

Sorry, my code can only work with videos with 25 fps, as audio encoder will output audio embedding of 25 fps.

devsvarun commented 1 year ago

Video is of 25 fps and audio is of 16khz but still the output video frames are more than input.

Sxjdwang commented 1 year ago

Video is of 25 fps and audio is of 16khz but still the output video frames are more than input.

Could you provide details of the problem you face? Such as the input video and the output video

Ironieser commented 1 year ago

you could refer to #12

Ironieser commented 1 year ago

I change the logfbank with adding winstep=0.02 (default is 0.01, just tried to hardcode here), this could make the audio frame to 250 image

But the output is still not correct, neither. (lip sync is not correct and with frame lags)↳

talklip.mp4

frame lag is a bug too. The padding step is unreasonable and needs revising. I will update in later.