allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 132 forks source link

Added a simple language modeling instance #368

Closed matt-gardner closed 7 years ago

matt-gardner commented 7 years ago

Duh, thanks, that's a good point =). You can tell I haven't really done language modeling before... I should also worry explicitly about start and stop tokens, probably...

DeNeutoy commented 7 years ago

Also this is harder than it might first appear I think, because we need the data to arrive in a continuous stream, and we also need the model state to propagate across batches - I think this is possible using stateful=True in Keras' LSTM etc, but i'm not exactly sure.

matt-peters commented 7 years ago

Usually language models add begin / end of sentence tokens so each sentence is <s> w0 w1 ... wn </s>. Then then predict next token for <s> w0 ... wn so the targets are w0 ... wn </s>.

You could pad your batches to the maximum length of a sentence in a batch, or as Mark said and is more common fill each batch exactly by splitting training sentences and carrying over the LSTM state from one batch to the next.

matt-gardner commented 7 years ago

Yeah, I'll add the start and stop tokens like @matt-peters suggested. The stateful=True bit goes in the model code, not here, so probably the decision about whether to use these as individual sentence instances, or put them in a continuous stream, should also belong to the model. I think the right way to handle this is to make a LanguageModelingDataset (or just a method inside of models.language_modeling), that can transform a list of SentenceInstances into a continuous stream, and have the model decide whether to do the transformation inside its load_dataset_from_files method, similar to how the multiple choice memory network code does a transformation here.

Using them as a continuous stream means that the sentences should be ordered, right? It doesn't make much sense to me to keep LSTM state while jumping to a new random sentence. Oh, and if you're keeping it across batches, you need to a be lot more careful about how you're actually constructing the batches - no randomly shuffling the batches; instead you want to stratify the sampling so that each batch has randomized instances so that batch_i[j] and batch_i[k] are uncorrelated, but batch_i[j] is the continuation of batch_{i-1}[j]... Yeah, getting this right in the modeling code is going to be a bit of work, requiring overriding some of the DataGenerator methods. And maybe that should be instead of the load_dataset_from_files technique I mentioned above, because that would mess things up a bit...

But anyway, getting the data generator right here is for another PR. We can still train a language model on single sentences with this as it is, as a proof of concept. Can either of you point me to a paper that looks at performance differences between training with single sentences vs. training with streams of sentences?

matt-gardner commented 7 years ago

Also @roys174, FYI, I'm putting some basic language modeling ability into this code. Hopefully you'll eventually find it useful. If you have input on what you want to be able to do with this, that'd be helpful.

DeNeutoy commented 7 years ago

Yes, they should definitely be ordered - I think how this is normally done is to split your dataset up into batch_size chunks and then to create a batch, take a element from each of these chunks.

Here is an example of that: https://github.com/DeNeutoy/bayesian-rnn/blob/master/reader.py#L105

In terms of papers, i'm fairly sure that most results which are benchmarked in terms of perplexity on something like PTB always use data as a continuous stream (i.e. per word perplexity on the entire corpus) rather than average per word perplexity over mini-batches and I imagine the difference is fairly drastic, as you would be resetting the lstm state every batch_size steps in your corpus, but @matt-peters may know of a paper comparing the two approaches.

matt-peters commented 7 years ago

I don't know of any papers that compare the results from the two approaches. If you are treating the data as a continuous stream and filling batches exactly without padding keeping the LSTM states from batch to batch it is important otherwise you'll reset the states in the middle of a sentence. Otherwise, I don't see anything detrimental about handling padding/masking correctly and reseting the states after each sentence, just an efficiency hit (both due to using dynamic RNNs vs static ones in tensorflow, and computation for padding).

FWIW, we have other optimized LM code that we plan to open source that handles these issues -- so since this is just for proof of concept / testing here I personally wouldn't worry about these issues too much.

matt-gardner commented 7 years ago

@DeNeutoy, I fixed a couple of things; do you want to take another look, or is this good to merge?

DeNeutoy commented 7 years ago

Yep changes look good, FFTM.