KAIST-AILab / SyncVSR

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization (Interspeech 2024)
https://www.isca-archive.org/interspeech_2024/ahn24_interspeech.pdf
MIT License
14 stars 1 forks source link

Request for Accuracy Graphs to Address Replication Issues #14

Closed davidingram123 closed 4 hours ago

davidingram123 commented 5 hours ago

Hi, I read your paper and found it very interesting. However, I encountered some issues with the code, and I am unable to replicate your results. Could you share the "train/accuracy_top1" and "val/accuracy_top1" graphs over 200 epochs (or more) with me? This would give me a reference, as I noticed that when running the code, the "val/accuracy_top1" graph tends to plateau between 40 and 50 epochs.(It might be that I missed a data augmentation strategy, but it shouldn't cause such a significant drop in accuracy, right?) W B Chart 2024_10_19 13_11_17 W B Chart 2024_10_19 13_13_34

davidingram123 commented 5 hours ago

The configuration file I used is bert-12l-512d_LRW_96_bf16_rrc_WB, with some minor modifications, no more than two.

snoop2head commented 5 hours ago

@davidingram123

Data augmentation strategy is crucial as stated by Ma et al 2022, so please, stick with our implementations for replication purpose.

🔗 WandB Report

Screenshot 2024-10-19 at 2 31 41 PM Screenshot 2024-10-19 at 2 31 55 PM

val/accuracy_top1 plateau around 95.1% ~ 95.2% and train/accuracy_top1 oscillates around 80%. For more details, please refer to our attached wandb training log of word-level VSR above.

davidingram123 commented 4 hours ago

Thank you for your help.

snoop2head commented 4 hours ago

Your welcome! Please file another issue if you need any further help! @davidingram123