DRAGNLabs / 301r_retnet

2 stars 1 forks source link

Hugging Face Tokenizer and Datasets #15

Closed nprisbrey closed 8 months ago

nprisbrey commented 8 months ago

This pull request overhauls the way we load data and create DataLoaders. Due to the significance of the change and conflicting namespaces, I've replaced the datasets.py file with a new load_data.py file.

The old code in datasets.py was hard-coded to only accept one version of wikitext. The new data loading code allows the user to choose any text dataset on Hugging Face and -- through the --dataset-name, --dataset-subset, and --dataset-feature command-line arguments -- use the dataset to train the model.

As for the tokenizer, I've implemented the BPE tokenizer quite similarly to the Hugging Face tutorial. Let it be noted that there are 3 other tokenizers supported by Hugging Face that could be implemented in the future (if desired).

I've chosen to only train the tokenizer on the training dataset, thereby maintaining the integrity of the validation and testing datasets as data that hasn't been seen before.

While writing this code, I explored appending an end-of-sequence token to sequences but it became messy. As stated in the RetNet paper, however, I have included a beginning-of-sequence token for each sequence. Due to these changes, I see this PR as resolving Issue https://github.com/DRAGNLabs/301r_retnet/issues/6.

I've also updated the scripts to train the models to include all the possible arguments, be alphabetized, and include numbers closer to the scale that we'll want to implement in a data run.

All formatting, comments, and doc strings have been updated, though I haven't touched the README.md. It is my feeling that the README needs a large overhaul and would be worth completing in a separate Pull Request.

nprisbrey commented 8 months ago

@KimballNJardine, you recognize that the 5 comments above are for outdated versions of the code, correct? The referenced code has already been updated or removed in later commits.