HLTCHKUST / CI-AVSR

Code repository for the Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR) dataset.
Creative Commons Zero v1.0 Universal
37 stars 0 forks source link

how to preprocess the data for model training? #2

Open yan159yan opened 2 years ago

yan159yan commented 2 years ago

Good work for the visual-audio data. is there any parameter configuration for the "preprocess_data.py"?

SamuelCahyawijaya commented 2 years ago

Hi @yan159yan: Thank you for your interest in our work. For the preprocess_data.py we use it to run the preprocessing before running evaluation on the eval.py.

As an example, for running the evaluation for the dataset/mm_test_metadata.csv using the pretrained Wav2Vec model CAiRE/wav2vec2-large-xlsr-53-cantonese, you can run the preprocessing and the evaluation in this way:

python preprocess_data.py \
    --output_dir=<CACHE_DIR_PATH>\
    --model_name_or_path=CAiRE/wav2vec2-large-xlsr-53-cantonese \
    --test_manifest_path=dataset/mm_test_metadata_noisy.csv \
    --preprocessing_num_workers=32 \
    --seed=0 --use_video \
    --audio_column_name=audio_path \
    --text_column_name=text_path \
    --video_column_name=lip_image_path

python eval.py \
    --output_dir=<OUTPUT_DIR_PATH>     \
    --model_name_or_path=CAiRE/wav2vec2-large-xlsr-53-cantonese     \
    --test_manifest_path=<CACHE_DIR_PATH>/preprocess_data.arrow   \
    --num_workers=8 \
    --preprocessing_num_workers=8 \
    --use_video    \
    --audio_column_name=audio_path \
    --text_column_name=text_path  \
    --video_column_name=lip_image_path     \
    --per_device_eval_batch_size=16     \
    --dataloader_num_workers=32 \
    --seed=0 \
    --logging_strategy=steps \
    --logging_steps=10 \
    --report_to=tensorboard     \
    --evaluation_strategy=epoch \
    --eval_steps=1 \
    --eval_accumulation_steps=100 

Note that --use_video is used to also include the the lip image data. If you don't need the visual part, you can remove that argument.

Hope it helps!