chathasphere / script-monkey

Text Generation with Neural Nets
MIT License
2 stars 0 forks source link

About

Inspired by the Char-RNN of Andrej Karpathy, in this project I train a Pytorch LSTM (long short term memory) RNN (recurrent neural network) on text sequences in order to create parody text. The LSTM learns a probabilistic model of text sequences, from which characters are sampled to generate new text.

Why Script Monkey?

Because it's conjectured that given enough monkey-years and typewriters, it would be possible to replicate the complete works of Shakespeare. The example I give trains on Shakespeare's complete works, though it's easily adaptable to other large (think a few MB) text corpuses.

How it works

Text preprocessing

A corpus is split into sequences of a fixed length (say, 100 characters). Each sequence yields an input sequence and a target sequence, with the latter offset forward by one timestep. So the sequence "TO BE OR NOT TO BE" means an input "TO BE OR NOT TO B" and a target "O BE OR NOT TO BE". Of course, we first have to vectorize text sequences by mapping unique characters to integers (about 80 in Shakespeare's works, including punctuation and escape characters). For efficiency, sequences are grouped in batches of fixed size.

Training

Input sequences get one-hot encoded and fed into a Char-RNN neural network with the following architecture:

Suggested Reading