This repository is the baseline code for CNVSRC2023 (Chinese Continuous Visual Speech Recognition Challenge 2023).
The code in this repository is based on the SOTA method mpc001/auto_avsr on the LRS3 dataset. We have added some configuration files to run this code on CN-CVS and the datasets provided in this challenge. Additionally, we have removed some code that is not needed for running this baseline and modified the implementation of some functionalities.
git clone git@github.com:MKT-Dataoceanai/CNVSRC2023Baseline.git
cd CNVSRC2023Baseline
git submodule init
git submodule update
cd tools/face_detection
git lfs pull
cd ../../
If you don't have git-lfs installed, Please go to here to download the model file and put it into CNVSRC2023Baseline/tools/face_detection/ibug/face_detection/retina_face/weights/
Model name | md5sum |
---|---|
Resnet50_Final.pth | bce939bc22d8cec91229716dd932e56e |
mobilenet0.25_Final.pth | d308262876f997c63f79c7805b2cdab0 |
# create environment
conda create -y -n cnvsrc python==3.10.11
conda activate cnvsrc
# install pytorch torchaudio torchvision
conda install pytorch-lightning==1.9.3 pytorch==2.0.1 torchaudio==2.0.2 torchvision==0.15.2 torchmetrics==0.11.2 pytorch-cuda==11.8 cudatoolkit==11.8 -c pytorch -c nvidia -y
# install fairseq
cd tools/fairseq/
pip install --editable .
# install face_alignment
cd ../face_alignment/
pip install --editable .
# install face_detection
cd ../face_detection/
pip install --editable .
# install other dependence
cd ../../
pip install -r reqs.txt
Download and preprocess the dataset. See the instructions in the preparation folder.
Download the models into path pretrained_models/
We use tensorboard as the logger.
Tensorboard logging files will be writen to outputs/$DATE/$TIME/tblog/
Please modify the specified yaml
configuration file in main.py
to select the training configuration.
The conf folder lists the configuration files required for this baseline.
Before running any training or testing, please make sure to modify the code_root_dir
and data_root_dir
in the corresponding yaml
file to the path of this repository and the path where the dataset is located, respectively.
data.dataset
specifies the path of the dataset and the path of the .csv
files.
Taking cncvs
as an example:
After data preprocessing is complete, please set ${data_root_dir}
to the parent directory of cncvs/
and copy data/cncvs/*.csv to ${data_root_dir}/cncvs/*.csv
.
At this point, the folder structure of ${data_root_dir}
is as follows:
${data_root_dir}
└── cncvs
├── test.csv
├── train.csv
├── valid.csv
├── news
| ├── n001
| | └── video
| | ├── ...
| | └── n001_00001_001.mp4
| ├── n002
| ├── ...
| └── n028
└── speech
├── s00001
├── s00002
├── ...
└── s02529
The contents of *.csv
at this point are as follows:
cncvs,news/n001/video/n001_00001_001.mp4,x,x x x x x x x x x x x
Each line contains the following content:
${dataset_name},${video_filename},${frame_count},${token_idx_split_by_blank}
The configuration file train_cnvsrc-multi.yaml provides an example configuration for fine-tuning a model pretrained on CN-CVS.
The ckpt_path
in the configuration file specifies the path of the pretrained model. The remove_ctc
option indicates whether to use the classification layer of the pretrained model.
python main.py
[Stage 1] Train the model using the subset that includes only short utterances lasting no more than 4 seconds (100 frames). We set optimizer.lr
to 0.0002 at the first stage.
By setting data.max_length=100
in config files to ensure training use only short utterances.
See train_cncvs_4s.yaml for more information.
[Stage 2] Use the checkpoint from stage 1 to initialise the model and train the model with the full dataset.
See train_cncvs_4s_30s.yaml for more information.
You can use either main.py
or predict.py
to run testing as soon as you choose the proper config filename in the python file.
We prefer using predict.py
to observe the real-time output text.
The table below contains the CER performance of Models on their own task.
Download model files from huggingface or modelscope。
Task | Training Data | CER on Dev | CER on Eval | File Name |
---|---|---|---|---|
Pre-train | CN-CVS (<4s) | / | / | model_avg_14_23_cncvs_4s.pth |
Pre-train | CN-CVS (full) | / | / | model_avg_last10_cncvs_4s_30s.pth |
Single-speaker VSR (T1) | CN-CVS + CNVSRC-Single.Dev | 48.57% | 48.60% | model_avg_last5_cncvs_cnvsrc-single.pth |
Multi-speaker VSR (T2) | CN-CVS + CNVSRC-Multi.Dev | 58.77% | 58.37% | model_avg_last5_cncvs_cnvsrc-multi.pth |
It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.
[Chen Chen](chenchen[at]cslt.org)
[CNVSRC2023](cnvsrc[at]cnceleb.org)