Toy example inspired by kaldi for dummies.
This tutorial is a very hands-on pratical introduction to kaldi (a modern toolkit used for ASR and other Speech Processing tasks). The only pre-requisite is having kaldi installed.
It just slightly deviates from the kaldi for dummies tutorial (https://kaldi-asr.org/doc/kaldi_for_dummies.html), having the data already prepared and adding an extra like getting the best transcriptions generated by the ASR system.
In order to train the model and decode after cloning the repository there is just 1 thing you should need to do: 1- cd into pedro_scripts and run "python format_wavscp.py" Then to train/decode/get results just: ./run.sh
You might need to change DATA_ROOT in path.sh if you did not clone this repo in the directory kaldi/egs.
To get the transcriptions generated by the ASR system type the following:
../../src/latbin/lattice-best-path ark:'gunzip -c exp/tri1/decode/lat.1.gz |' 'ark,t:| utils/int2sym.pl -f 2- exp/tri1/graph/words.txt > out.txt'
Directories:
Command to convert a directory of files to wav:
for f in *.m4a; do ffmpeg -i "$f" "${f/%m4a/wav}"; done
Command to downsample(It is necessary to create an extra directory for the downsampling cause the same filename cannot be used as input and output file, otherwise an error will happen):
mkdir tmp; for file in *.wav; do sox ${file} -r 16000 ./tmp/${file}; done
data: This will be the directory used in the experiment with the data already downsampled and convert to wav. Untouched is only for pedagogic purposes. Inside this directory in the train and test folder there are already the 5 required files mentioned in the kaldi for dummies tutorial (spk2gender,wav.scp,text,utt2spk and corpus.txt). In the local folder there are the 4 files needed for the language data (lexicon.txt, nonsilence_phones.txt, silence_phones.txt and optional silence). score.sh in local allows us to get metrics such as WER and SER to evaluate the system.
conf: This folder contains 2 files (taken from the kaldi for dummies tutorial. decode.config which has information related to the beam used in the decoding and mfcc.config which has to do with the feature extraction process.
pedro_scripts: This directory contains my scripts. format_wavscp.py formats the wav.scp files since the full paths of the audio files are needed in the wav.scp in the train and test folder and they change according to the machine where they are located. sort_dir , sorts the content of all files in a given a directory, since kaldi typically requires the content of the files to be sorted.