KAIST-AILab / SyncVSR

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization (Interspeech 2024)
https://www.isca-archive.org/interspeech_2024/ahn24_interspeech.pdf
MIT License
26 stars 2 forks source link

[Training] The issue regarding inference #18

Closed davidingram123 closed 1 month ago

davidingram123 commented 1 month ago

Hello, I tried to use the LRW_CKPT_epoch_167_step_213864.ckpt you provided for inference, but I only achieved a top-1 accuracy of 70%. What could be the issue? I am currently regenerating the npy and pkl files. I randomly checked over a dozen images, and I think they seem fine. However, it appears that the LRW_CKPT_epoch_167_step_213864.ckpt is not able to reach the performance stated in your paper. I'm not sure if the issue is with the pkl files. Do you have any good methods to check if the pkl files are correct? What do you think is the reason that I am unable to achieve the performance mentioned in the paper using LRW_CKPT_epoch_167_step_213864.ckpt?

snoop2head commented 1 month ago

Hi @davidingram123 Did you use config ./LRW/video/config/bert-12l-512d_LRW_96_bf16_rrc_WB.yaml?

davidingram123 commented 1 month ago

@snoop2head yes

davidingram123 commented 1 month ago

@snoop2head In the YAML file you just mentioned, use_word_boundary: false, is that incorrect? I changed it to true when running the inference file.

I think you mixed up the bert-12l-512d_LRW_96_bf16_rrc_WB.yaml and bert-12l-512d_LRW_96_bf16_rrc_noWB.yaml, so I directly swapped the names of the two files.

snoop2head commented 1 month ago

@davidingram123

I think mine still works!

wandb: Run data is saved locally in ./wandb/run-20241023_034217-1vpf8nhg
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run vq-transformer_lambda10_bf16_rrc_TimeMaskFixed
wandb: ⭐️ View project at https://wandb.ai/quoqa-nlp/cross-modal-sync
wandb: πŸš€ View run at https://wandb.ai/quoqa-nlp/cross-modal-sync/runs/1vpf8nhg
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
using x-transformers bert implementation
using x-transformers bert implementation
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA H100 80GB HBM3') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing DataLoader 0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 261/261 [00:14<00:00, 18.48it/s]
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   test/accuracy_top1        0.949720025062561
   test/accuracy_top5       0.9932799935340881
   test/loss_category       0.20327399671077728
     test/loss_total        0.20327399671077728
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Updated the config according to your findings about use_word_boundary

snoop2head commented 1 month ago

@davidingram123 I refactored the code in ed6b97885f321781fd19d1ab4a2d2f9a17768226 commit with the following changes:

  1. replicating the result (94.972%) mentioned above
  2. resolving fairseq dependency import error
  3. integrating pre-tokenized audio tokens that are available at the release section
davidingram123 commented 1 month ago

@snoop2head Hello, I have generated the npy and pkl files again using your code, and rerun the inference code, but the effect remains the same, identical. I can’t figure out why this is happening. 2024-10-23 121621 What do you think could be the issue? I generated the corresponding pkl file according to your code, and both times the generation was the same. If it’s not a problem with the pkl file, then what else could it be?

snoop2head commented 1 month ago

@davidingram123 Hm... I am not sure for now. What about training and validation accuracy?

davidingram123 commented 1 month ago

@snoop2head I didn’t train or validate; I just performed inference using your bert-12l-512d_LRW_96_bf16_rrc_WB.yaml.

snoop2head commented 1 month ago

@davidingram123 Yeah, I do understand that. I asked train/val accuracy to check the sanity of preprocessing procedure by observing its intermediary metrics from training. Loss and validation accuracy should be similar to what we've covered in issue #14.

davidingram123 commented 1 month ago

@snoop2head Sorry, I haven’t run the program yet. I will let it run for a day or two and then get back to you with the details on the training and validation accuracy. Thank you for your help.

snoop2head commented 1 month ago

@davidingram123 No problems! I will double check preprocessing procedure in meantime

davidingram123 commented 1 month ago

@snoop2head Hello, I’ve been running the training code for some time now, and it seems to be performing well, but it doesn’t seem to reach the level reported in the paper. Here is the link: wandb.

Additionally, I’m curious why I can’t achieve the results you demonstrated using the checkpoint weight-audio-v1/LRW_CKPT_epoch_167_step_213864.ckpt.

Finally, I also have a question about #11. Could you reopen it? I would like to ask some related questions.

snoop2head commented 1 month ago

@davidingram123 I reopened the issue! can you make the wandb log public so that I can get access to?

davidingram123 commented 1 month ago

@snoop2head wandbsorry

snoop2head commented 1 month ago

@davidingram123 It turns out that I've uploaded wrong version of preprocessing code, similar to the cause of Issue #16 I've double checked the commit da5055ec7f367d4813d65d3daa2cbb1222e5cfc4 and please do the following procedures:

  1. git pull to reflect the updates
  2. Do NOT run another coordinate extraction (to save time), but please simply run python preprocess_pkl.py to override the previous pkl files with newly cropped pkl files.
  3. Commence the inference with python inference.py ./config/bert-12l-512d_LRW_96_bf16_rrc_WB.yaml and you will get the result as below:
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   test/accuracy_top1        0.948199987411499
   test/accuracy_top5        0.993120014667511
     test/loss_audio        3.2306928634643555
   test/loss_category       0.20880813896656036
     test/loss_total         32.5157470703125
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

I am pretty sure that this fix will also resolve train / validation performance issue. Thank you for the feedback!

snoop2head commented 1 month ago

Here's a simple visualization before the change and after the change in da5055ec7f367d4813d65d3daa2cbb1222e5cfc4 for your reference.

output1 output2

davidingram123 commented 1 month ago

@snoop2head Thank you for your help; I got the same result. 2024-10-24 123102

snoop2head commented 1 month ago

@davidingram123 Thank you as well! Without your feedback, I wouldn't have discovered those issues.