Inspired by the Char-RNN of Andrej Karpathy, in this project I train a Pytorch LSTM (long short term memory) RNN (recurrent neural network) on text sequences in order to create parody text. The LSTM learns a probabilistic model of text sequences, from which characters are sampled to generate new text.
Because it's conjectured that given enough monkey-years and typewriters, it would be possible to replicate the complete works of Shakespeare. The example I give trains on Shakespeare's complete works, though it's easily adaptable to other large (think a few MB) text corpuses.
A corpus is split into sequences of a fixed length (say, 100 characters). Each sequence yields an input sequence and a target sequence, with the latter offset forward by one timestep. So the sequence "TO BE OR NOT TO BE" means an input "TO BE OR NOT TO B" and a target "O BE OR NOT TO BE". Of course, we first have to vectorize text sequences by mapping unique characters to integers (about 80 in Shakespeare's works, including punctuation and escape characters). For efficiency, sequences are grouped in batches of fixed size.
Input sequences get one-hot encoded and fed into a Char-RNN neural network with the following architecture:
Using the sample
function in helpers.py
, you can generate sample strings of a designated length given a "prime string" that initializes the RNN's hidden state (viz. the models long and short term memory) and a probability distribution of characters.
Each subsequent iteration randomly selects a character from this distribution and updates the hidden state/distribution.