Train a baseline spelling corrector based on hallucinated data

WWW As a reasearcher wanting to understand the impact of synthetic data used for training a spelling corrector, I would like to train a sequence to sequence model based on BART from the artificially generated data from the Tatoeba corpus.

There is existing code for rudimentary generation of erroneous phrases : https://github.com/SlangLab-NU/torgo_inference/tree/add_noise_v0
The spelling correction notebooks we trained earlier are here : https://github.com/SlangLab-NU/torgo_inference/tree/main/Machine%20Translation

AC Training notebook or script to train a seq2seq model, there is code already in the repository for training seq2seq based on BART for the TORGO transcripts.

SlangLab-NU / torgo_inference

Train a baseline spelling corrector based on hallucinated data #18