facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.46k stars 6.41k forks source link

BeamSize gets truncated to the size of Vocabulary #358

Closed hadyelsahar closed 2 years ago

hadyelsahar commented 6 years ago

Hey all.

There's this problem that when my target vocabulary is smaller than the beam size. The beam size gets truncated.

Why that can be a problem? Many sequence labeling tasks move from non-autoregressive decoders to autoregressive decoders. For example, a POS-tagger would make benefit from previous decisions. Or in case of extractive sentence compression as in case of [Fillipova et al. 2015] (https://aclweb.org/anthology/D/D15/D15-1042.pdf) for those applications you cannot experiment using beam size larger than vocab size which can be very small (2) in case of sentence compression.

How / why does that happen? Given B is the beamsize and V is the vocabulary size If B < V During the early steps of decoding the selection for beam augmentation is smaller than the beam size. i.e. B > V^t which means we don't have enough candidates to add to the beam. This forces you guys to truncate the beam size to the vocab size.

This is happens in those two lines: https://github.com/pytorch/fairseq/blob/master/fairseq/sequence_generator.py#L127 https://github.com/pytorch/fairseq/blob/master/fairseq/search.py#L70

My Proposal to Fix that: In the early steps of decoding when V^t < B we augment our prediction probabilities by some -inf scores. we won't need that lateron anyways when t is large enough to make V^t > B

class BeamSearch(Search):

    def __init__(self, tgt_dict):
        super().__init__(tgt_dict)

    def step(self, step, lprobs, scores):
        super()._init_buffers(lprobs)
        bsz, beam_size, vocab_size = lprobs.size()

        if step == 0:
            # at the first step all hypotheses are equally likely, so use
            # only the first beam
            lprobs = lprobs[:, ::beam_size, :].contiguous()
        else:
            # make probs contain cumulative scores for each hypothesis
            lprobs.add_(scores[:, :, step - 1].unsqueeze(-1))

        ### A quick fix to alleviate reducing beam size to vocab size
        # we fix that by augmenting the lprobs if it's smaller than the beam size
        if lprobs.view(bsz, -1).shape[1] -1 < beam_size*2:
            lprobs_aug_n = beam_size*2 - lprobs.view(bsz, -1).shape[1]-1
            lprobs_aug = torch.ones(lprobs.shape[0], lprobs.shape[1], lprobs_aug_n)
            lprobs_aug = lprobs_aug.type(lprobs.type())
            lprobs_aug = lprobs_aug * - math.inf
            lprobs_aug = torch.cat((lprobs, lprobs_aug), -1)
        else:
            lprobs_aug = lprobs

        torch.topk(
            lprobs_aug.view(bsz, -1),
            k=min(
                # Take the best 2 x beam_size predictions. We'll choose the first
                # beam_size of these which don't predict eos to continue with.
                beam_size * 2,
                lprobs_aug.view(bsz, -1).size(1) - 1,  # -1 so we never select pad
            ),
            out=(self.scores_buf, self.indices_buf),
        )

        torch.div(self.indices_buf, vocab_size, out=self.beams_buf)

        self.indices_buf.fmod_(vocab_size)
        return self.scores_buf, self.indices_buf, self.beams_buf

I have tried it locally and it seems to be working fine just some additional sequence_generator.py. Would like to hear your thoughts about that guys?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!