KAIST-AILab / SyncVSR

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization (Interspeech 2024)
https://www.isca-archive.org/interspeech_2024/ahn24_interspeech.pdf
MIT License
14 stars 1 forks source link

Chinese dataset #10

Closed daiyingjie2024 closed 1 week ago

daiyingjie2024 commented 1 week ago

I’m sorry to disturb you. I have read your paper in detail, and it’s really great. I am particularly interested in the LRW-1000 dataset. Previously, I noticed you mentioned that there might be issues with audio-video synchronization in these Chinese datasets when answering someone else's question. How serious is this problem? I’m unsure whether I should continue experimenting with LRW-1000, as I’m worried it might be a waste of time and not yield effective results. Should I look for other datasets with better audio-video synchronization? Additionally, I have another question: would you be willing to share the code related to LRW-1000? (Including preprocessing and such.) (If there are any inappropriate word choices in my previous question, I hope you can forgive me; my English is not very good. Thank you very much!)

snoop2head commented 1 week ago

Hello @daiyingjie2024 ,

Thank you for your interest in our work! Please feel free to ask any questions.

Quick Recap

When it comes to Mandarin, lip reading can be inherently more challenging due to the high occurrence of homophenes caused by shared 拼音 (pinyin). SyncVSR’s goal is to leverage audio information to make the encoder’s latent space more discriminative, particularly on homophenes. This challenge in VSR for Mandarin can possibly be mitigated using our framework, as demonstrated in the CAS-VSR-W1K dataset.

Dataset Recommendations

However, like other semi-supervised VSR approaches, our method requires datasets where video and audio are well-aligned. You might consider using the CN-CVS series for training and evaluating on the CNVSRC series.

If your training dataset lacks transcriptions, you can generate pseudo-labels using available Chinese ASR models. Tools like Whisper models, which we used for transcribing VoxCeleb2, are a good starting point. As demonstrated in AutoAVSR, the choice of ASR model for transcription doesn't significantly impact performance. Code for transcribing with whisper models are uploaded in LRS/video/preprocess/transcribe_whisper.py.

Additional Tips

For using a Chinese neural audio quantizer, you might explore wav2vec 2.0 models such as kehanlu/mandarin-wav2vec2 and TencentGameMate/chinese-wav2vec2-base. Guidance on using wav2vec 2.0's quantizer can be found in this issue comment.

Code for LRW-1000

We've used the same training code as for LRW/video but utilized a preprocessing script from prepare_lrw1000.py.

Thank you, and don't hesitate to reach out if you need further assistance!

daiyingjie2024 commented 1 week ago

Thank you so much for your detailed reply. In fact, I’m currently looking into the prepare_lrw1000.py file that you mentioned, but I’m not sure what processing was done to the dataset before running prepare_lrw1000.py. The author didn’t provide the code for directly processing the original dataset, which is quite confusing for me (as a beginner, I find it difficult to handle these tasks). If it wouldn’t trouble you too much, I would like to ask if you could share the relevant code; if it’s too complicated, I shouldn’t bother you—please forgive me. Thank you again for your patient help.

Additionally, I have another question: Do you think the architecture for lip reading tasks has already become standardized? I might not have read many papers, but it seems like the visual front end is mostly 3D-ResNet, and the sequence back end is usually Conformer or TCN.

daiyingjie2024 commented 1 week ago

Sorry, I missed one more question. I'm having some issues while trying to run your code. Is the version of Fairseq the latest one? Is it v0.12.2?

snoop2head commented 1 week ago

Please open separate issues! I think you can additionally file 1) model architecture related question, and 2) installation related issue.