KAIST-AILab / SyncVSR

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization (Interspeech 2024)
https://www.isca-archive.org/interspeech_2024/ahn24_interspeech.pdf
MIT License
14 stars 1 forks source link

How and Where Should the Audio-Token .plk File Be Used in the Inference Process? #13

Open SUMIN080 opened 2 days ago

SUMIN080 commented 2 days ago

I have two types of files:

  1. The audio-token .plk file obtained through Audio Tokens Preparation.
  2. The .pkl file generated after preprocessing the LRW dataset into MP4, which contains audio, text, and video data.

When I ran ./LRW/src/inference.py using the second .pkl file, the following results were produced:

   Test metric             DataLoader 0

──────────────────────────────────────────────────────────────── test/accuracy_top1 0.0 test/accuracy_top5 0.0 test/loss_audio 5.454221248626709 test/loss_category 9.774922370910645 test/loss_total 64.317138671875 ────────────────────────────────────────────────────────────────

The results seem incorrect, and I did not use the audio-token .plk file in this process. I’m wondering where and how the audio-token file should be used.

snoop2head commented 2 days ago

Dear @SUMIN080 ,

Thank you for filing an issue! Even if I double checked it myself, I was wondering if it is replicated in other people's development environment.

To begin with, you don't need to use audio as inputs (nor as targets) when inferencing. Audio is for training purpose only.

There are several questions to ask, which will be helpful for us to comprehend your situation.

  1. Test accuracy can't be 0.0 even if the model power is extremely weak, because it is more likely to yield random prediction which is 0.2%. Did you use our pretrained weight uploaded in the release section? If you use the checkpoint, it will yield 94.984% of test accuracy.
  2. Did you run preprocess_roi.py first and then run preprocess_pkl.py?
  3. How does the preprocessed LRW dataset look like? Is it cropped properly?
snoop2head commented 6 hours ago

Please refer to our inference log on the test set, based on the trained model with run name i9umrm1x mentioned in Issue #14.

🔗 Wandb Log

Screenshot 2024-10-19 at 2 42 39 PM
 1 Using 16bit native Automatic Mixed Precision (AMP)
 2 GPU available: True (cuda), used: True
 3 TPU available: False, using: 0 TPU cores
 4 IPU available: False, using: 0 IPUs
 5 HPU available: False, using: 0 HPUs
 6 `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
 7 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
 8 using vq neural audio codec
 9 using x-transformers bert implementation
10 ----------------------------------------------------------------------------------------------------
11 distributed_backend=nccl
12 All distributed processes registered. Starting with 4 processes
13 ----------------------------------------------------------------------------------------------------
14 You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
15 Restoring states from the checkpoint path at /root/CMTS-VSR/cross-modal-sync/i9umrm1x/checkpoints/epoch=167-step=213864.ckpt
16 Testing DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66/66 [00:45<00:00,  1.45it/s]
17 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
18 Loaded model weights from checkpoint at /root/CMTS-VSR/cross-modal-sync/i9umrm1x/checkpoints/epoch=167-step=213864.ckpt
19 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
20        Test metric             DataLoader 0
21 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
22    test/accuracy_top1       0.9498400092124939
23    test/accuracy_top5       0.9933199882507324
24    test/loss_category       0.20326192677021027
25 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────