DanielMengLiu / AudioVisualLip

20 stars 1 forks source link

Question for the files #1

Closed CloudWalkerW closed 5 months ago

CloudWalkerW commented 6 months ago

Hi, I am doing research on audio-visaul speaker verification, and I have some questions for this repo:

Is there any descriptions for the files in this repo about how to reproduce your experiments or if I would like to do evaluation, which file should I access with? Also, will you provide a pretrained model in the future? Thanks a lot!

DanielMengLiu commented 6 months ago

Hi, we provide some data extraction scripts which could be found in ./preprocessing. After getting the lip data of training sets and test sets, you could run ./main_audiovisuallip_DATASET_CM.py for training and testing with only switching the stage in the code. When doing this, be sure to change the ./conf/config_audiovisuallip_DATASET_new.yaml to your own configuration.

As for the pretrained model, we released the pretrained audio-only and visual-only models. We would like to provide more in the future, but still need time. Thanks for your interests.

CloudWalkerW commented 6 months ago

Hi, thanks for your reply! I would like to ask about what happens in ./preprocessing, cause I want to use it on my own subset of Voxceleb2. Also I would like to test it without cutting the lip part, that is, use the whole face as input. How do your preprocessing separate the audio and video parts? Also, I would like to ask some attributes in your "savedsolution_audiovisuallip_vox_CM_pretrained.yaml", for example, train_manifest, test_lrs3, test_lomgrid, are they all necessary? What does the manifest do?

Thanks for your reply again!

CloudWalkerW commented 6 months ago

Hi, I followed your code to run, and it seems to be runnable. However, the accuracy remains at 0.00% while EER decreses very slowly. The only thing I dind't do is that for the visual part, I used the whole face. Is there anything else to do besides create manifest in the preprocess part? Btw I'm using savedsolution_audiovisuallip_vox_CM_pretrained.yaml as my config. Or maybe can I get other way to contact you, so I can ask more details. Thanks a lot!

DanielMengLiu commented 6 months ago

Hi, sorry for late reply since I'm quite busy with the thesis paper recently. This repo is realized for audio-visual speaker verification using lips, so you could follow the order of files in the preprocess directory, i.e., 1. face detection 2. get mean face 3.crop mouth from video 4.generate test trials. We have decribed the preprocessing procedure in our paper.

May I confirm that what you wanna do is face recognition rather than lip biometrics? As you descried, I think you may only wanna train an audio-visual speaker verification framework using face and audio? If so, this repo may not satisfy your need well, because all the models were trained on lips and we didn't focus much on the face.

If you wanna do face research, I recommend ResNet for extracting visual embeddings which would be good enough. Also, I would recommend several related papers for you:

@inproceedings{tao2020audio, title={Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network}, author={Tao, Ruijie and Das, Rohan Kumar and Li, Haizhou}, booktitle={Proc. Interspeech 2020}, pages={2242--2246} }

@article{qian2021audio, title={Audio-Visual Deep Neural Network for Robust Person Verification}, author={Qian, Yanmin and Chen, Zhengyang and Wang, Shuai}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={29}, pages={1079--1092}, year={2021}, publisher={IEEE} }

@article{sadjadi20202019, title={The 2019 NIST audio-visual speaker recognition evaluation}, author={Sadjadi, Seyed Omid and Greenberg, Craig S and Singer, Elliot and Reynolds, Douglas A and Mason, Lisa and Hernandez-Cordero, Jaime}, journal={Proc. Speaker Odyssey (submitted), Tokyo, Japan}, year={2020} }

@inproceedings{sari2021multi, title={A Multi-View Approach to Audio-Visual Speaker Verification}, author={Sar{\i}, Leda and Singh, Kritika and Zhou, Jiatong and Torresani, Lorenzo and Singhal, Nayan and Saraf, Yatharth}, booktitle={Proc. ICASSP 2021}, pages={6194--6198} } %,year={2021}, organization={IEEE}

@article{sun2022learning, title={Learning Audio-Visual embedding for Wild Person Verification}, author={Sun, Peiwen and Zhang, Shanshan and Liu, Zishan and Yuan, Yougen and Zhang, Taotao and Zhang, Honggang and Hu, Pengfei}, journal={arXiv preprint arXiv:2209.04093}, year={2022} }

CloudWalkerW commented 6 months ago

Hello, thanks for your reply!

Yes, I wanna train an audio-visual speaker verification framework using face and audio. I thought that using the whole face could provide more informations than only lips as used in visual verification, so I supposed your model could still at least work for me since there isn't any pretrained model used?

Also, may I ask how much time you spend to train your model, as it's not a very light model? Like how many hours on what kind of setups (GPU...etc.).

Thanks for your suggestions of these repos, however yours is the only one that I could find a github repo. As it's a bit hard for me to build a model from 0 to 100, I was looking forward to adjust from yours. Still thanks for your suggestions! Good luck for your thesis.

DanielMengLiu commented 6 months ago

You're welcome. Since you are doing face and audio, there's no need using lip pretrained models. There will be a mismatch which I've already tried. Since face focuses on global static appearance-related speaker characteristics, however, lip sequence learns local dynamic semantic-related speaker feature. Temporal dynamics is a disturb for the former, while the keypoint for the latter. In your case, no need to do this. What I recommend (the resnet101) is quite suitable for you which will be very fast, and there're pretrained models I think. Please follow VGG, and read the papers. For my case, I trained on 8*2080Ti for 2 days.