Closed DeNeutoy closed 7 years ago
Hmm, I'm not sure there's a lot of value in doing it that way. If you want to just keep the code separate for now, that's fine. Also, it appears that your repo is private on your own personal account; I don't have access to it.
So, I think this work is great, and it'd be good to add it to deep_qa. But it should be added as a python module with all necessary files, not a git submodule. You don't need to worry about doing that soon (focusing on getting results first), but at some point it'd be nice to do. I'm going to close this PR.
This module provides two sequence to sequence models, with multi-gpu training support, tensorboard logging and beam search decoding.
The first model 'Seq2SeqAttentionModel' is a plain multi-layer sequence to sequence model with attention.
The second model 'Seq2SeqCopyModel' is similar, apart from the fact that the output distribution is a interpolation between the vocab and the unique words in the article (source of the model, in our case the Squad paragraph).
The code is run using the
seq2seq_attention.py
file at the outermost level of the directory. This has 3 modes: one for training, one for evaluation on the dev set and one for beam search decoding. When this script is started up, it runs indefinitely and the three different options are designed to be run simultaneously, with theeval
anddecode
modes waiting for the training mode to generate model files before beginning their work.Another key concept here is how the beam search works. In order to do the search, we need to sample from the output distribution at time t and re-embed the result for the decoder to generate the distribution at time t + 1. In order to do this, the number of timesteps of the decoder is set to 1 and the
encode_top_state
(which retrieves the last state of the last layer of the encoder and the first state of the decoder) anddecode_topk
(which returns the top k most likely words from the output distribution, along with the next decoder state) are called iteratively.Finally, the copy mechanism is implemented in the
create_combined_distribution
method. The main weird part is it's use of thetf.scatter_nd
function, which takes a list of indexes, a list of values and a size, and creates a tensor zeros of shape: size with the values in the list assigned to the indexes. The map function just does this over the batch but in parallel , rather than using a for loop.