OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.76k stars 2.25k forks source link

Directly using (attentional) decoder #594

Closed ctlaltdefeat closed 6 years ago

ctlaltdefeat commented 6 years ago

Hello, despite scouring the docs for a while I'm having trouble understanding how to adapt the library to my needs.

In my application, my input is a sequence of images which I have already preprocessed. That is, I have a bunch of tensors of shape [source_len, channels, height, width]. The outputs are textual tokens, but I've already preprocessed everything, so that each output is of shape [target_len] and starts and ends with special tokens (I can also one-hot encode them to [target_len, num_of_different_tokens] if need be, as num_of_different_tokens is not large).

I've built my own encoder to my liking for the image sequence, which applies a bunch of 3D (spatio-temporal) convolutions followed by an RNN. I'd now like to use a decoder with attention that uses the encoder's outputs (and, when training, target outputs to feed as inputs). Hopefully I'd like to train this end-to-end with OpenNMT's machinery, and decode at evaluation with a beam search etc. The main problem I'm facing is trying to disentangle the different input/outputs that the library focuses on (using torchtext etc...) and to apply just the pure seq2seq.

A couple of the specific issues I'm having so far:

  1. Initializing a StdRNNDecoder (for example) requires specifying embeddings, and I haven't been able to wrap my head around the Embeddings object in this library or how to define it for my usecase. I don't think I technically need (the number of different tokens is rather small and I'm happy to simply use one-hot encoded vectors), but I'm fine defining it if I can figure out how.
  2. Adapting my custom encoder to be an EncoderBase object. For example, the forward iteration expects padded sequences of sparse indices `[src_len x batch x nfeat]. What does this mean here?
  3. In general, how do I transform/wrap my inputs and outputs to be able to interface with NMTModel, if necessary?

I realise this is a rather open-ended question, but I would appreciate assistance if possible.

srush commented 6 years ago

Sure! Good question.

Just out of curiosity, does our interface not work for you? It sounds like you should be able to just modify this file to your liking

https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/modules/ImageEncoder.py

And then use the rest of the pipeline as is?

srush commented 6 years ago

Also should mention this file gives an example of using the model as a library

http://opennmt.net/OpenNMT-py/Library.html

Think it should still work, let us know if there are issues. Something like this should work:

emb_size = 10
rnn_size = 6
# Specify the core model.
encoder_embeddings = onmt.modules.Embeddings(emb_size, len(vocab["src"]),
                                             word_padding_idx=src_padding)

encoder = onmt.modules.RNNEncoder(hidden_size=rnn_size, num_layers=1,
                                 rnn_type="LSTM", bidirectional=True,
                                 embeddings=encoder_embeddings)

decoder_embeddings = onmt.modules.Embeddings(emb_size, len(vocab["tgt"]),
                                             word_padding_idx=tgt_padding)
decoder = onmt.modules.InputFeedRNNDecoder(hidden_size=rnn_size, num_layers=1,
                                           bidirectional_encoder=True,
                                           rnn_type="LSTM", embeddings=decoder_embeddings)
model = onmt.modules.NMTModel(encoder, decoder)

# Specify the tgt word generator and loss computation module
model.generator = nn.Sequential(
            nn.Linear(rnn_size, len(vocab["tgt"])),
            nn.LogSoftmax())
loss = onmt.Loss.NMTLossCompute(model.generator, vocab["tgt"])
ctlaltdefeat commented 6 years ago

I've got the encoder like this:

class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        self.gru = nn.GRU(input_size=512, hidden_size=256, num_layers=2, batch_first=True, bidirectional=True)
        self.conv1 = nn.Conv3d(in_channels=1, out_channels=128, kernel_size=(2, 3, 3), stride=2)
        self.bn1 = nn.BatchNorm3d(128)
        self.conv2 = nn.Conv3d(in_channels=128, out_channels=256, kernel_size=(2, 3, 3), stride=2)
        self.bn2 = nn.BatchNorm3d(256)
        self.conv3 = nn.Conv2d(in_channels=256, out_channels=512, kernel_size=(3, 3), stride=2)
        self.bn3 = nn.BatchNorm2d(512)
        self.conv4 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=(3, 3), stride=2)
        self.bn4 = nn.BatchNorm2d(512)
        self.conv5 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=(3, 3), stride=2)
        self.bn5 = nn.BatchNorm2d(512)
        self.fc1 = nn.Linear(7680, 512)

    def forward(self, x, lengths=None):
        x = F.leaky_relu(self.bn1(self.conv1(x)))
        x = F.leaky_relu(self.bn2(self.conv2(x)))
        lst = []
        for i in x:
            d = i.permute(1, 0, 2, 3)
            d = F.leaky_relu(self.bn3(self.conv3(d)))
            d = F.leaky_relu(self.bn4(self.conv4(d)))
            d = F.leaky_relu(self.bn5(self.conv5(d)))
            d = d.view(len(d), -1)
            d = self.fc1(d)
            lst.append(d)
        output, hidden = self.gru(torch.stack(lst))
        return hidden, output

I then have:

encoder = Encoder()
decoder_embeddings = onmt.modules.Embeddings(8, len(distinct_tokens),
                                             word_padding_idx=-1)
decoder = onmt.modules.StdRNNDecoder(hidden_size=512, num_layers=2,
                                           bidirectional_encoder=True,
                                           rnn_type="GRU", embeddings=decoder_embeddings)
model = onmt.modules.NMTModel(encoder, decoder)

Does this look about right?

Trying to run model(src=Variable(torch.randn(1, 1, 120, 220, 150)), tgt=torch.LongTensor([1, 2, 3]).unsqueeze(1), lengths=None) gives an error at line 306 of onmt's models.py (as of current master), because tgt doesn't have the correct dimensions. Indeed, similar to my question 2, I'm not sure what nfeat means in this context. Why should tgt not be [tgt_len x batch]?

srush commented 6 years ago

Oh, so you should unsqueeze one more dimension. tgt should be [tgt_len x batch x 1]

ctlaltdefeat commented 6 years ago

Thanks. I think the documentation may be conflicting, because forward for NMTModel expects [tgt_len x batch] according to the docstring, but there is no unsqueezing.

Moving on, there seems to be a problem with embeddings. Simply running this code with the current master branch throws RuntimeError: save_for_backward can only save input or output tensors, but argument 0 doesn't satisfy this condition:

import torch
import onmt
emb = onmt.modules.Embeddings(5, 5, word_padding_idx=-1)
input = torch.LongTensor([1, 2, 3]).unsqueeze(1).unsqueeze(1)
emb(input)
sebastianGehrmann commented 6 years ago

Wrap in in (torch.autograd.)Variable and you should be good to go!

ctlaltdefeat commented 6 years ago

Right! Just testing your awareness... I've got the NMTModel now completing forward successfully.

I think I can now train the model using built-in pytorch functionality, but I'm still confused about how to beamsearch. The classes that deal with this seem to expect torchtext objects, which once again I'm not using.

srush commented 6 years ago

Can you be more specific?

ctlaltdefeat commented 6 years ago

The Translator and Translation expect fields (dict of Fields): data fields, which are presumably torchtext entities. I'm not quite sure how to use them as I'm not using torchtext. This is partly because I am doing my own preprocessing on a custom target dataset and so don't see a need for torchtext, and partly because I don't find torchtext's documentation to be very clear.

If necessary, I may have to wrap my stuff with torchtext. On the face of it though, beam search is a mechanism agnostic of data domains, so design-wise perhaps it could be good to decouple it from torchtext.

vince62s commented 6 years ago

closing this, lack of activity. reopen if needed.