Closed long21wt closed 2 years ago
Hi! To clarify, are you looking for the tensor sequences of intermediate layers (for example, after the encoder?), or for an easier way to access the input and output tensors?
Intermediate layers and outputs are something that fairseq as a whole doesn't currently expose easily — you'd need to modify the transformer methods themselves to return intermediate layers you're interested in, and then you could return them through the hub interface. This is not something I plan to add in the near-term.
The hub_interface.py
code for visrep follows other fairseq interfaces — for example, check out the default fairseq HubInterface. The encode
function does tokenization and converts the input into the tensor which is passed to the model (which for this case means take a string and generates a tensor corresponding to the rendered image of it), and decode
detokenizes the output tensor from the model to a sentence string. The encoder and decoder aren't accessed separately, and the encoded tensors from VisualTextDataset are those returned by encode()
.
If you want the tensors for the model inputs and outputs, you can get the tensor sequence for the input by calling encode()
directly, and for the final output by modifying L62 in hub_interface.py
to return not just the output string:
return [self.decode(hypos[0]["tokens"]) for hypos in batched_hypos]
--> return [(self.decode(hypos[0]["tokens"]), hypos[0]) for hypos in batched_hypos]
where hypos[0] is the best output hypothesis chosen by beam search, and will look something like this:
[{'tokens': tensor([ 57, 8, 7, 11, 749, 5, 2]), 'score': tensor(-0.4421), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-1.6331, -0.2708, -0.1837, -0.4788, -0.0630, -0.2784, -0.1871])}, {'tokens': tensor([152, 8, 7, 11, 749, 5, 2]), 'score': tensor(-0.4499), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-1.6986, -0.3032, -0.1806, -0.4487, -0.0603, -0.2711, -0.1871])}, {'tokens': tensor([100, 16, 11, 749, 5, 2]), 'score': tensor(-0.4618), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-1.6489, -0.2012, -0.4071, -0.0563, -0.2692, -0.1882])}, {'tokens': tensor([ 14, 22, 8, 7, 11, 749, 5, 2]), 'score': tensor(-0.5500), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-2.5914, -0.3630, -0.2739, -0.1726, -0.4729, -0.0613, -0.2734, -0.1919])}, {'tokens': tensor([ 20, 13, 8, 7, 11, 749, 5, 2]), 'score': tensor(-0.6997), 'attention': tensor([]), 'alignment': tensor([]), 'positional_scores': tensor([-3.1400, -0.9755, -0.3081, -0.1903, -0.4878, -0.0541, -0.2568, -0.1851])}]
are you looking for the tensor sequences of intermediate layers (for example, after the encoder?), or for an easier way to access the input and output tensors?
I interested in intermediate layers and outputs of the model. I can use VisualTextTransformerEncoder
like this:
self.models[0].encoder(batch['net_input']['src_tokens'], batch['net_input']['src_lengths'])
The TransformerDecoder
is more complicated since it require prev_output_tokens
, which are created in SequenceGenerator
Anyway, thanks for your answer.
Hello,
Currently, the
encode
method inhub_interface.py
only returns tensors from image slices anddecode
is only called intranslate
and is also not the decoder of the transformer model.Is there an easier way to access encoded tensors than
_build_batches
, which are actually generated fromVisualTextDataset
? Including decoded tensors generated inSequenceGenerator
and called ininference_step
.Thank you