Allow Copynet models to use multiple decoder layers

shimengfeng commented 4 years ago

Is your feature request related to a problem? Please describe. Current Copynet architecture only allows one LSTM decoder layer. This similar to #4451

Describe the solution you'd like Give users the choice of having multiple decoder layers to have a more comprehensive copynet encoder-decoder architecture.

Describe alternatives you've considered N/A

Additional context N/A

JohnGiorgi commented 3 years ago

This would be great!

annajung commented 3 years ago

Hi all, I am planning on working on this and will update the issue again when a PR is ready for review. Thanks!

annajung commented 3 years ago

Hi @JohnGiorgi I was wondering if you could help undertstand the copy score functionality changes when there is more than one decoder layer. Would it be correct to grab the last layer from state["decoder_hidden"]?

https://github.com/allenai/allennlp-models/blob/abc7e1991127540af4a9cd3f7d49129401ae47b9/allennlp_models/generation/models/copynet_seq2seq.py#L357

JohnGiorgi commented 3 years ago

@annajung Sorry I am getting back so late! Are the decoder layers all RNNs? If so I think it makes the most sense to use the last hidden vector output from the last decoder layer.

If the decoder can be something else (like a transformer), I think the answer is a little more complicated.

annajung commented 3 years ago

No worries! @JohnGiorgi, @epwalsh gave the same feedback on the open PR. PTAL when you get a chance :) https://github.com/allenai/allennlp-models/pull/245

JohnGiorgi commented 3 years ago

@annajung Looks good :) I left a comment!

I think I originally thought the goal was to add a composable decoder to the copynet model, similar to what already exists for the ComposedSeq2Seq so that you could easily experiment with the different decoders implemented in AllenNLP.

But re-reading #4451 and #4766 it seems I might have understood, and what I was thinking is probably tricker to implement.

annajung commented 3 years ago

oh interesting! Do you still want to merge the open PR or look into a composable decoder?

epwalsh commented 3 years ago

@JohnGiorgi @annajung I like this idea of adding a composable decoder to CopyNet, which could also support multi-layer LSTMs. If you either of you would like to look into this, I think we should hold off on https://github.com/allenai/allennlp-models/pull/245.

JohnGiorgi commented 3 years ago

@annajung What do you think? I would be game to try and implement it. I am excited to try out a transformer decoder layer within copynet for my own project. I should be able to start next week!

annajung commented 3 years ago

sure that works, no worries! I can close my PR in favor of a composable decoder. thanks both for the feedback!

JohnGiorgi commented 3 years ago

Okay I have made a decent amount of progress on this. I have replaced self._decoder_cell with self.decoder, which is a SeqDecoder subclass. Now I am trying to figure out how to replace the following:

https://github.com/allenai/allennlp-models/blob/f1de60fc5eab482dfd58d47ca160b497c69b4519/allennlp_models/generation/models/copynet_seq2seq.py#L339-L342

state["decoder_hidden"], state["decoder_context"] = self._decoder_cell(
    projected_decoder_input.float(),
    (state["decoder_hidden"].float(), state["decoder_context"].float()),
)

It looks like I will have to drop down to the forward method of self._decoder._decoder_net, something like:

new_state, state["decoder_hidden"] = self._decoder._decoder_net(
    previous_state=state,
    encoder_outputs=state["encoder_outputs"],
    source_mask=state["source_mask"],
    previous_steps_predictions=last_predictions
)
state["decoder_context"] = new_state.get("decoder_context", None)

I am unsure about this approach for a couple of reasons. Partly because it looks like there are differences in how each decoder (LstmCellDecoderNet, StackedSelfAttentionDecoderNet) implements this function, but also because I don't know how projected_decoder_input fits in.

Any help is welcome! @epwalsh @annajung

epwalsh commented 3 years ago

@JohnGiorgi one thing to keep in mind is that we want to keep backwards compatibility. So a CopyNet model trained before with a _decoder_cell should still work.

And if the DecoderNet API doesn't really fit, then I think it's okay to create a new abstraction that works better here. Something like CopyNetDecoder.

JohnGiorgi commented 3 years ago

@JohnGiorgi one thing to keep in mind is that we want to keep backwards compatibility. So a CopyNet model trained before with a _decoder_cell should still work.

I see. So the main considerations here are

The "old" CopyNet configs can be loaded in the "new" CopyNet.
the state_dict of the old CopyNet can be loaded into this new CopyNet. I think the only place the state_dicts would differ is that the old CopyNet will have these parameters, and the new CopyNet wont:

0 | 2021-04-13 06:57:14,602 - INFO - allennlp.common.util - _decoder_cell.weight_ih
0 | 2021-04-13 06:57:14,602 - INFO - allennlp.common.util - _decoder_cell.weight_hh
0 | 2021-04-13 06:57:14,602 - INFO - allennlp.common.util - _decoder_cell.bias_ih
0 | 2021-04-13 06:57:14,603 - INFO - allennlp.common.util - _decoder_cell.bias_hh

One solution I can think of:

keep the current arguments of the old CopyNet (e.g. attention, beam_size, etc) so old configs can be loaded
add a new, optional argument, probably called decoder (like ComposedSeq2Seq) that is a DecoderNet.
if decoder is not None, init it with the params passed to CopyNet (e.g. attention, beam_size, etc).
otherwise, init an LSTMCell as is currently done.

I think this would achieve backwards compatibility. Let me know if you spot a mistake or have a better solution, though!

And if the DecoderNet API doesn't really fit, then I think it's okay to create a new abstraction that works better here. Something like CopyNetDecoder.

Okay, I think we may have to (or do some subclassing). I don't see a way for me to use the two currently implemented DecoderNets without modification. Will work on this.

allenai / allennlp

Allow Copynet models to use multiple decoder layers #4766