Closed amacfie closed 4 years ago
This repo is not actively maintained and there are some rough edges, but roughly: You need to create a new encoder (Hugging Face has a good tokenizer library that's better than the one used here) that encodes chars into tokens, then you need to modify the create_tfrecords.py script to encode your text with the new encoder. Finally, change the "n_ctx" parameter to however many tokens are in your new char vocabulary and train the model on your new data.
Is there a way to build a character-level model?