Closed nelson-liu closed 7 years ago
Also, I'm pretty concerned about how masking is treated in lines 120 through 126 - I think it's wrong. Can you just use MatrixAttention instead of the MaskedBatchDot and then MaskedSoftmax?
I tried using MatrixAttention before, but ran into memory issues from tiling the matrix (hence why i swapped to maskedbatchdot and maskedsoftmax)
The trouble with how you did it, though, is that MaskedBatchDot
returns None
for a mask, opting instead for zeros in the similarity matrix. Then you pass those zeros through MaskedSoftmax
with no mask. It's not doing what you think it's doing. Instead, I'd recommend a BatchDot
layer that handles the mask properly, and then you can pass the result into MaskedSoftmax
to get the right behavior.
i see, yeah that makes sense. instead of creating a batchdot
layer, could i just add a mask to the maskedbatchdot
layer?
If you're returning a mask, there's no reason to use masked_batch_dot
; you can just use K.batch_dot
.
Just one high-level comment: you can just specify the default key in the seq2seq_encoder params, and give the right fallback behavior when you get the seq2seq encoder on lines 109 and 115. This would make it so you don't have to repeat yourself so much in the config file.
I was trying to favor being explicit over implicit, but yeah i agree.
If you're returning a mask, there's no reason to use masked_batch_dot; you can just use K.batch_dot.
that makes sense, I'll make a BatchDot
layer that properly passes on a mask.
though im not sure how i feel about the naming --- MaskedBatchDot
doesn't pass forward a mask, while BatchDot
does.
Honestly, we probably don't need the MaskedBatchDot layer or function, because we should just be propagating the mask correctly. It's really unlikely that you'd use the output of a batch dot without some other mask-using operating first (like a softmax). We can leave them there, but it's probably a good idea to leave notes that they probably aren't necessary.
This PR won't work until the others are all merged, but this PR adds the gated attention reader model.
todo: