asyml / texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
https://asyml.io
Apache License 2.0
2.39k stars 372 forks source link

Beam search decoding during inference doesn't generate good text. #265

Open fabrahman opened 4 years ago

fabrahman commented 4 years ago

Hi,

I have trained a model using Reinforcement learning. When I use "beam search" to generate text, it generates all

"raeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraera"

However, when I use greedy or topk sampling the generation is like:

Sam was watching a movie. He was very focused on the action. He fell asleep. Sam's glasses fell off his face <|endoftext|>eraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraera

I used the tx.utils.strip_eos to strip anything after <|endoftext|>.

1- I am not sure why beam search is performing this way? I would appreciate your help. following is my piece of code for doing decoding using beam search:

    def _infer_beam_ids(context_name):
        # beam-search
        predictions = decoder(
            beam_width=10,
            length_penalty=config_train.length_penalty,
            embedding=_embedding_fn,
            context=batch['%s_ids' % context_name],
            context_sequence_length=batch['%s_len' % context_name],
            max_decoding_length=max_decoding_length,
            end_token=end_token,
            mode=tf.estimator.ModeKeys.PREDICT)

        beam_output_ids = tx.utils.varlength_roll(predictions["sample_id"][:, :, 0], -batch['%s_len' % context_name], axis=1)

        return beam_output_ids

    beam_search_ids = _infer_beam_ids('x1')

2- Is it better to use beam search for a model which is trained in a self-critical fashion, right?

I would appreciate if you can help me with these.

fabrahman commented 4 years ago

Hi, Anyone has a thought on this? In another experiment, I used my trained model to generate with beam search and it generates the same output for different inputs. And it's weird that the greedy result is good but not the beam search. Am I correctly calling the beam decoding method?

jchwenger commented 4 years ago

That is in fact a feature of beam search, see this discussion, this implementation and this paper! Temperature-based random sampling and/or top_p (nucleus) sampling are in my experience always preferable to beam search.

The root cause of the failure of beam search is that 1) a repetitive sequence will have a higher probability than any other, since the more you repeat, the more likely the next token will be (from the perspective of the network), and so will be chosen by beam search; 2) if you ask a network to assign probabilities to human text the distribution is actually highly irregular (not the most likely sentence, but a stream where some steps are extremely likely, others extremely random). Lovely graphs and explanations in the paper!

fabrahman commented 4 years ago

That is in fact a feature of beam search, see this discussion, this implementation and this paper! Temperature-based random sampling and/or top_p (nucleus) sampling are in my experience always preferable to beam search.

The root cause of the failure of beam search is that 1) a repetitive sequence will have a higher probability than any other, since the more you repeat, the more likely the next token will be (from the perspective of the network), and so will be chosen by beam search; 2) if you ask a network to assign probabilities to human text the distribution is actually highly irregular (not the most likely sentence, but a stream where some steps are extremely likely, others extremely random). Lovely graphs and explanations in the paper!

@jchwenger Thanks for you reply. I understand and I agree that sampling methods works much better. But this performance that I reported here is not accepted from beam search, it didn't even generate anything meaningful. Beside, for any input it generates the same thing. Also greedy decoding is working fine, so isn't it weird that beam search cannot generate anything? I am thinking maybe there is some issue with my way of using it or the implementation. Also, I heard and saw in many papers that when using Self-critical reinforcement learning, it's better to use beam at inference.

jchwenger commented 4 years ago

My pleasure! From the network's perspective meaningful is not particularly relevant. If it's a character-level or bpe model this repetition of characters over and over might still be the one with the highest probability from the model's perspective, and that is, out of all possible outputs of the network, what beam search will attempt to pick. Beyond that, however, and how beam search is successfully used in other papers, I'm afraid I can't help you.