Closed matt-gardner closed 7 years ago
Also this is harder than it might first appear I think, because we need the data to arrive in a continuous stream, and we also need the model state to propagate across batches - I think this is possible using stateful=True
in Keras' LSTM etc, but i'm not exactly sure.
Usually language models add begin / end of sentence tokens so each sentence is <s> w0 w1 ... wn </s>
. Then then predict next token for <s> w0 ... wn
so the targets are w0 ... wn </s>
.
You could pad your batches to the maximum length of a sentence in a batch, or as Mark said and is more common fill each batch exactly by splitting training sentences and carrying over the LSTM state from one batch to the next.
Yeah, I'll add the start and stop tokens like @matt-peters suggested. The stateful=True
bit goes in the model code, not here, so probably the decision about whether to use these as individual sentence instances, or put them in a continuous stream, should also belong to the model. I think the right way to handle this is to make a LanguageModelingDataset
(or just a method inside of models.language_modeling
), that can transform a list of SentenceInstances
into a continuous stream, and have the model decide whether to do the transformation inside its load_dataset_from_files
method, similar to how the multiple choice memory network code does a transformation here.
Using them as a continuous stream means that the sentences should be ordered, right? It doesn't make much sense to me to keep LSTM state while jumping to a new random sentence. Oh, and if you're keeping it across batches, you need to a be lot more careful about how you're actually constructing the batches - no randomly shuffling the batches; instead you want to stratify the sampling so that each batch has randomized instances so that batch_i[j]
and batch_i[k]
are uncorrelated, but batch_i[j]
is the continuation of batch_{i-1}[j]
... Yeah, getting this right in the modeling code is going to be a bit of work, requiring overriding some of the DataGenerator
methods. And maybe that should be instead of the load_dataset_from_files
technique I mentioned above, because that would mess things up a bit...
But anyway, getting the data generator right here is for another PR. We can still train a language model on single sentences with this as it is, as a proof of concept. Can either of you point me to a paper that looks at performance differences between training with single sentences vs. training with streams of sentences?
Also @roys174, FYI, I'm putting some basic language modeling ability into this code. Hopefully you'll eventually find it useful. If you have input on what you want to be able to do with this, that'd be helpful.
Yes, they should definitely be ordered - I think how this is normally done is to split your dataset up into batch_size
chunks and then to create a batch, take a element from each of these chunks.
Here is an example of that: https://github.com/DeNeutoy/bayesian-rnn/blob/master/reader.py#L105
In terms of papers, i'm fairly sure that most results which are benchmarked in terms of perplexity on something like PTB always use data as a continuous stream (i.e. per word perplexity on the entire corpus) rather than average per word perplexity over mini-batches and I imagine the difference is fairly drastic, as you would be resetting the lstm state every batch_size
steps in your corpus, but @matt-peters may know of a paper comparing the two approaches.
I don't know of any papers that compare the results from the two approaches. If you are treating the data as a continuous stream and filling batches exactly without padding keeping the LSTM states from batch to batch it is important otherwise you'll reset the states in the middle of a sentence. Otherwise, I don't see anything detrimental about handling padding/masking correctly and reseting the states after each sentence, just an efficiency hit (both due to using dynamic RNNs vs static ones in tensorflow, and computation for padding).
FWIW, we have other optimized LM code that we plan to open source that handles these issues -- so since this is just for proof of concept / testing here I personally wouldn't worry about these issues too much.
@DeNeutoy, I fixed a couple of things; do you want to take another look, or is this good to merge?
Yep changes look good, FFTM.
Duh, thanks, that's a good point =). You can tell I haven't really done language modeling before... I should also worry explicitly about start and stop tokens, probably...