"infer_greedy" and "infer_sample" for GPT2 Decoder cannot work correctly

hanfeiyu commented 4 years ago

Hello there,

I encountered some issues when I was using GPT2 decoder for generation. train_greedy worked well but infer_greedy and infer_sample were always throwing me errors like:

ValueError: Trainable variable created when calling a template after the first time, perhaps you used tf.Variable when you meant tf.get_variable: [<tf.Variable 'generator/decoder/layer_0/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_0/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_0/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_0/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_1/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_1/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_1/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_1/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_2/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_2/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_2/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_2/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_3/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_3/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_3/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_3/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_4/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_4/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_4/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_4/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_5/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_5/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_5/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_5/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_6/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_6/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_6/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_6/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_7/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_7/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_7/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_7/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_8/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_8/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_8/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_8/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_9/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_9/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_9/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_9/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_10/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_10/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_10/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_10/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_11/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_11/gamma:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_11/past_poswise_ln/beta:0' shape=(768,) dtype=float32_ref>, <tf.Variable 'generator/decoder/layer_11/past_poswise_ln/gamma:0' shape=(768,) dtype=float32_ref>]

Then I tested GPT2 decoder using gpt2_decoder_test.py, which is provided along with gpt2_decoder.pyunder the same folder modules/decoder.

I modified the codes to test three different decoding strategies like this:

    def test_decode_train(self):
        r"""Tests train_greedy.
        """
        hparams = {
#             "pretrained_model_name": None
            "pretrained_model_name": "gpt2-small"
        }
        decoder = GPT2Decoder(hparams=hparams)

        max_time = 8
        batch_size = 16
        inputs = tf.random_uniform([batch_size, max_time],
                                   maxval=50257, dtype=tf.int32)
        outputs = decoder(inputs=inputs)

        with self.test_session() as sess:
            sess.run(tf.global_variables_initializer())
            outputs_ = sess.run(outputs)
            self.assertEqual(outputs_.logits.shape, (batch_size,
                                                     max_time, 50257))
            self.assertEqual(outputs_.sample_id.shape, (batch_size, max_time))

    def test_decode_infer_greedy(self):
        r"""Tests infer_greedy
        """
        hparams = {
#             "pretrained_model_name": None
            "pretrained_model_name": "gpt2-small"
        }
        decoder = GPT2Decoder(hparams=hparams)

        start_tokens = tf.fill([16], 1)
        end_token = 2
        outputs, length = decoder(max_decoding_length=4,
                                  start_tokens=start_tokens,
                                  end_token=end_token,
                                  decoding_strategy="infer_greedy")

        with self.test_session() as sess:
            sess.run(tf.global_variables_initializer())
            outputs_ = sess.run(outputs)
            self.assertIsInstance(outputs_, TransformerDecoderOutput)

    def test_decode_infer_sample(self):
        r"""Tests infer_sample
        """
        hparams = {
#             "pretrained_model_name": None
            "pretrained_model_name": "gpt2-small"
        }
        decoder = GPT2Decoder(hparams=hparams)

        start_tokens = tf.fill([16], 1)
        end_token = 2
        outputs, length = decoder(max_decoding_length=4,
                                  start_tokens=start_tokens,
                                  end_token=end_token,
                                  decoding_strategy="infer_sample")

        with self.test_session() as sess:
            sess.run(tf.global_variables_initializer())
            outputs_ = sess.run(outputs)
            self.assertIsInstance(outputs_, TransformerDecoderOutput)

The result was that "train_greedy" passed, both "infer_greedy" and "infer_sample" failed the unittest, errors were:

KeyError: 'gpt2_decoder/layer_9/past_poswise_ln/gamma\n\noriginally defined at:\n  File "/usr/local/lib/python3.7/site-packages/texar/tf/modules/decoders/gpt2_decoder_test.py", line 153, in test_decode_infer_sample\n    decoder = GPT2Decoder(hparams=hparams)\n  File "/usr/local/lib/python3.7/site-packages/texar/tf/modules/decoders/gpt2_decoder.py", line 69, in __init__\n    super().__init__(hparams=hparams)\n  File "/usr/local/lib/python3.7/site-packages/texar/tf/module_base.py", line 84, in __init__\n    create_scope_now_=True)\n  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/template.py", line 160, in make_template\n    **kwargs)\n'

Here is my environment:

tensorflow             1.14.0            
tensorflow-estimator   1.14.0            
tensorflow-probability 0.7.0 
texar                  0.2.4

I don't know if it's just only me suffering from this issue, or anyone else is facing the same problem as well. Can somebody help check this issue out? Thanks.

hanfeiyu commented 4 years ago

Second update.

After digging into the codebases, I suspect that there is something wrong with the interaction between GPT2 model setup and "dynamic_decode".

Defferent to "train_greedy", which just simply passes inputs to the "self_attention_layer" and get outputs, "infer_sample" and "infer_greedy" require dynamically decoding using "dynamic_decode", and this is where the error happens.

Maybe "dynamic_decode" requires creating new trainable variables when decoding, but GPT2 template has already been set up before calling "dynamic_decode", eventually results in

ValueError: Trainable variable created when calling a template after the first time, perhaps you used tf.Variable when you meant tf.get_variable

I'm still not able to get it work though issue is pinpointed (hopefully), and it seems that no one is in this community...

hanfeiyu commented 4 years ago

Final update.

Now it's finally working.

I overarched the GPT2Decoder codebases bypassing the initialization of super class ModuleBase and moving all the funtions/properties I need from ModuleBase into my own GPT2 module.

ModuleBase initialization will call tf.make_template to make the template before real decoding with dynamic_decode, which will then kill any possibility of creating new trainable variables once the template is made. Not quite sure why TransformerDecoder still works even after initializing the its super class ModuleBase.

Another thing worth to be mentioned here is that I had to change the following names of the tensor_map when calling _init_from_checkpoint to load the weights from GPT2 cache.

"ln_1/b": 'layer_{}/beta',
"ln_1/g": 'layer_{}/gamma',
"ln_2/b": 'layer_{}/past_poswise_ln/beta',
"ln_2/g": 'layer_{}/past_poswise_ln/gamma',

If not doing this, GPT2 cannot map its names of tensor to the weights from ckpt.model correctly. I had no choice but to adjust the default naming since I gave up on inheriting ModuleBase.

Now I'm closing this issue, I'm glad to see it's working even though nobody helped me out lol. Texar is still a great library for NLP anyway.

asyml / texar

"infer_greedy" and "infer_sample" for GPT2 Decoder cannot work correctly #273