Use character based tokenization instead of word-based

Instead of a shallower model predicting every word, use a deeper model to predict characters.

You may get better Model accuracy with a character based model

Less input features/output variables, which could increase overall layer density and improve training
Better training data, which can be easier to parse while also providing more data per sample for the model to train against
We can account for things like ampersands and punctuation, letting the model figure it out

The only challenge would be what to do with hyperlinks, but it could be fun to see what the AI comes up with

Ranked strategies for incorporation:

Create a separate, protected branch for old model and continue with new model on the main branch
Create a separate branch for new model and keep old one
Have both exist in the same code and have a selector in program options to switch between them
Create a separate repo for the new model
Make a monorepo with both projects in separate environments
Replace old model

andydevs / robot-trump