SlangLab-NU / torgo_inference

0 stars 2 forks source link

Train unigram and trigram LMs on the Europarl corpus #17

Open aanchan opened 12 months ago

aanchan commented 12 months ago

WWW As a researcher wanting to understand the impact of language modelling on isolated words vs sentences on the Torgo dataset, I would like to train a unigram language model as well as a trigram language model using the KenLM toolkit. We trained previously a 5-gram language model following the tutorial from here: https://huggingface.co/blog/wav2vec2-with-ngram

References: https://github.com/kpu/kenlm Training and inference code for Torgo: https://github.com/SlangLab-NU/links

AC A notebook or a script to train the said language models and upload them to huggingface hub along with noting perplexity of these language models. These are to be then used to run inference/decoding on some of our pre-trained acoustic models.