Closed roodrallec closed 5 years ago
@roodrallec our code had been written before theirs were released, and we chose the filterbank according to their paper, which I believe in the original version was 13. Or at lease in SyncNet was 13 :)
Turns out it doesn't make a difference anyways, as the filterbank information is not used. Thanks for the reply though :)
https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS/blob/c0233ace95be15fb1665dfcd056d82117822a797/preprocess/savemfcc.m#L7
In the Readme it is suggested that you use a similar audio pre-processing as Zimmerman et al. However, they use 40 filterbank channels across their code (e.g. in the yousaidthat repository https://github.com/joonson/yousaidthat/blob/98b51812894497cb6c2b65a7ae147067609fc6ca/run_demo.m#L22) I was wondering if there was a reason for choosing 13, or if it had just been mixed up with the number of cepstral coefficients.
Thanks,