The dataloader was set up to train a neural punctuator (old and unused model), but for the language model, we don't need to provide labels. Instead, we just need to prepend a start-of-string token to the input to generate the labels, and append an end-of-string token to the input to make the lengths the same.
The scripts.clean_wikitext module also generates two sets of files per dataset (train, valid, test) where the labels have punctuation and the inputs have it removed. We want to keep that labels logic, and use it to only generate a single file per dataset.
The dataloader was set up to train a neural punctuator (old and unused model), but for the language model, we don't need to provide labels. Instead, we just need to prepend a start-of-string token to the input to generate the labels, and append an end-of-string token to the input to make the lengths the same.
The
scripts.clean_wikitext
module also generates two sets of files per dataset (train, valid, test) where the labels have punctuation and the inputs have it removed. We want to keep that labels logic, and use it to only generate a single file per dataset.