esalesky / visrep

This repository contains an extension of fairseq for pixel / visual representations for machine translation.
https://arxiv.org/abs/2104.08211
MIT License
34 stars 5 forks source link

Access encoder, decoder tensors from standard transformer #2

Closed long21wt closed 2 years ago

long21wt commented 2 years ago

Hello,

Currently, the encode method in hub_interface.py only returns tensors from image slices and decode is only called in translate and is also not the decoder of the transformer model.

Is there an easier way to access encoded tensors than _build_batches, which are actually generated from VisualTextDataset? Including decoded tensors generated in SequenceGenerator and called in inference_step.

Thank you

esalesky commented 2 years ago

Hi! To clarify, are you looking for the tensor sequences of intermediate layers (for example, after the encoder?), or for an easier way to access the input and output tensors?

Intermediate layers and outputs are something that fairseq as a whole doesn't currently expose easily — you'd need to modify the transformer methods themselves to return intermediate layers you're interested in, and then you could return them through the hub interface. This is not something I plan to add in the near-term.

The hub_interface.py code for visrep follows other fairseq interfaces — for example, check out the default fairseq HubInterface. The encode function does tokenization and converts the input into the tensor which is passed to the model (which for this case means take a string and generates a tensor corresponding to the rendered image of it), and decode detokenizes the output tensor from the model to a sentence string. The encoder and decoder aren't accessed separately, and the encoded tensors from VisualTextDataset are those returned by encode().

If you want the tensors for the model inputs and outputs, you can get the tensor sequence for the input by calling encode() directly, and for the final output by modifying L62 in hub_interface.py to return not just the output string:

return [self.decode(hypos[0]["tokens"]) for hypos in batched_hypos]
--> return [(self.decode(hypos[0]["tokens"]), hypos[0])  for hypos in batched_hypos]

where hypos[0] is the best output hypothesis chosen by beam search, and will look something like this:

[{'tokens': tensor([ 57,   8,   7,  11, 749,   5,   2]), 'score': tensor(-0.4421), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-1.6331, -0.2708, -0.1837, -0.4788, -0.0630, -0.2784, -0.1871])}, {'tokens': tensor([152,   8,   7,  11, 749,   5,   2]), 'score': tensor(-0.4499), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-1.6986, -0.3032, -0.1806, -0.4487, -0.0603, -0.2711, -0.1871])}, {'tokens': tensor([100,  16,  11, 749,   5,   2]), 'score': tensor(-0.4618), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-1.6489, -0.2012, -0.4071, -0.0563, -0.2692, -0.1882])}, {'tokens': tensor([ 14,  22,   8,   7,  11, 749,   5,   2]), 'score': tensor(-0.5500), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-2.5914, -0.3630, -0.2739, -0.1726, -0.4729, -0.0613, -0.2734, -0.1919])}, {'tokens': tensor([ 20,  13,   8,   7,  11, 749,   5,   2]), 'score': tensor(-0.6997), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-3.1400, -0.9755, -0.3081, -0.1903, -0.4878, -0.0541, -0.2568, -0.1851])}]
long21wt commented 2 years ago

are you looking for the tensor sequences of intermediate layers (for example, after the encoder?), or for an easier way to access the input and output tensors?

I interested in intermediate layers and outputs of the model. I can use VisualTextTransformerEncoder like this:

self.models[0].encoder(batch['net_input']['src_tokens'], batch['net_input']['src_lengths'])

The TransformerDecoder is more complicated since it require prev_output_tokens, which are created in SequenceGenerator

Anyway, thanks for your answer.