SlangLab-NU / torgo_inference

0 stars 2 forks source link

Train a baseline spelling corrector based on hallucinated data #18

Open aanchan opened 1 year ago

aanchan commented 1 year ago

WWW As a reasearcher wanting to understand the impact of synthetic data used for training a spelling corrector, I would like to train a sequence to sequence model based on BART from the artificially generated data from the Tatoeba corpus.

AC Training notebook or script to train a seq2seq model, there is code already in the repository for training seq2seq based on BART for the TORGO transcripts.

jindaznb commented 12 months ago

sentence-level: https://colab.research.google.com/drive/1fWZmiD5_8960gb-7h94B8mxZbPsJ_y08 Before training on Testset: Word Error Rate (WER): 66.26% Character Error Rate (CER): 46.43%

After training: WER: 61.93% CER: 42.05%

word-level: https://colab.research.google.com/drive/12nDcUrxZ76qoV1AhdJvbl-HK7-pXQ2RP before: Word Error Rate (WER): 38.08% Character Error Rate (CER): 27.04%

after training: WER: 18.45% CER: 13.03%