danielwatson6 / hate-speech-project

6 stars 1 forks source link

Fix wikitext data loader and cleaning script #9

Closed danielwatson6 closed 4 years ago

danielwatson6 commented 4 years ago

The dataloader was set up to train a neural punctuator (old and unused model), but for the language model, we don't need to provide labels. Instead, we just need to prepend a start-of-string token to the input to generate the labels, and append an end-of-string token to the input to make the lengths the same.

The scripts.clean_wikitext module also generates two sets of files per dataset (train, valid, test) where the labels have punctuation and the inputs have it removed. We want to keep that labels logic, and use it to only generate a single file per dataset.