While training the model can see all words (beside the last one)

talbaumel commented 7 years ago

Lets say a sentence in the data set is (1,2,3,4) Then prepare_data function will create: X = (1,2,3) Y = (2,3,4)

While predicting 2 and 3 your model can copy them from the input

anantzoid commented 7 years ago

When convolving inputs, the zero-padding added to the top rows of input layer makes sure that a hidden state does not contain information from future words.

ruotianluo commented 7 years ago

I feel like zero padding should be used in every convolution layer. like this https://github.com/openai/pixel-cnn/blob/master/pixel_cnn_pp/nn.py#L296.

anantzoid commented 7 years ago

@ruotianluo Zero padding is used in every layer to keep the layer size same: https://github.com/anantzoid/Language-Modeling-GatedCNN/blob/master/model.py#L62 The zero padding I referred to in the above comment is the extra padding required to prevent the filter from seeing the future words.

wangwang110 commented 7 years ago

only padding mask_layer[:,0:conf.filter_h/2,:] = 0 can prevent the filter from seeing the future words? why not (conf.filter_h-1)

wangwang110 commented 7 years ago

only padding at the first layer can prevent the filter from seeing the future words? sorry,i can't understand it, can you tell me in detail . thank you very much.

thangduong commented 7 years ago

Yes, I have the same concern here. I output some trace messages:

xbatch[0] = [[ 1 1 3 13 123 5 12 152 7 84 129 21 106 48 5 14 89 30 6 140 6] [ 57 88 5 25 60 23 2 4 1 1 3 13 51 10 22 136 68 28 105 6 52] [104 121 11 54 10 134 10 138 22 64 151 47 133 69 2 4 1 1 3 13 97]]

ybatch[0] = [[ 1 3 13 123 5 12 152 7 84 129 21 106 48 5 14 89 30 6 140 6 118] [ 88 5 25 60 23 2 4 1 1 3 13 51 10 22 136 68 28 105 6 52 90] [121 11 54 10 134 10 138 22 64 151 47 133 69 2 4 1 1 3 13 97 46]]

always comes out to 1. I have changed the batch size to 3 so it's easier to look at. Everything else is default. I looked at the model code, and it's basically trying to take (for example): [ 1 1 3 13 123 5 12 152 7 84 129 21 106 48 5 14 89 30 6 140 6] into [ 1 3 13 123 5 12 152 7 84 129 21 106 48 5 14 89 30 6 140 6 118] Here is the gist of the model: tf.reset_default_graph() self.X = tf.placeholder(shape=[conf.batch_size, conf.context_size-1], dtype=tf.int32, name="X") self.y = tf.placeholder(shape=[conf.batch_size, conf.context_size-1], dtype=tf.int32, name="y") embed = self.create_embeddings(self.X, conf) h, res_input = embed, embed for i in range(conf.num_layers): fanin_depth = h.get_shape()[-1] filter_size = conf.filter_size if i < conf.num_layers-1 else 1 shape = (conf.filter_h, conf.filter_w, fanin_depth, filter_size) with tf.variable_scope("layer_%d"%i): conv_w = self.conv_op(h, shape, "linear") conv_v = self.conv_op(h, shape, "gated") h = conv_w * tf.sigmoid(conv_v) if i % conf.block_size == 0: h += res_input res_input = h h = tf.reshape(h, (-1, conf.embedding_size)) y_shape = self.y.get_shape().as_list() self.y = tf.reshape(self.y, (y_shape[0] * y_shape[1], 1)) softmax_w = tf.get_variable("softmax_w", [conf.vocab_size, conf.embedding_size], tf.float32, tf.random_normal_initializer(0.0, 0.1)) softmax_b = tf.get_variable("softmax_b", [conf.vocab_size], tf.float32, tf.constant_initializer(1.0)) #Preferance: NCE Loss, heirarchial softmax, adaptive softmax self.loss = tf.reduce_mean(tf.nn.nce_loss(softmax_w, softmax_b, h, self.y, conf.num_sampled, conf.vocab_size)) trainer = tf.train.MomentumOptimizer(conf.learning_rate, conf.momentum) gradients = trainer.compute_gradients(self.loss) clipped_gradients = [(tf.clip_by_value(_[0], -conf.grad_clip, conf.grad_clip), _[1]) for _ in gradients] self.optimizer = trainer.apply_gradients(clipped_gradients) self.perplexity = tf.exp(self.loss) self.create_summaries() What is this zero padding you're talking about? Are you talking about mask_layer: mask_layer[:,0:conf.filter_h/2,:] = 0 embed *= mask_layer Unless I am mistaken, this only zeroes out the the first few, which, you actually want to keep because that's the history. Basically, a model that does y = x[1:]+[0] would do quite well... I would guess, then, that the gating layer is just allowing for a cleaner shift. I feel like there is something I am missing. Maybe you can clarify? Seems this implementation is bogus. Original implementation's for Torch. The paper doesn't describe how data preparation's done.

sonack commented 6 years ago

@thangduong I agree with you. I found the mask and padding is only applied on the embedding layer, while the subsequent conv layers are not. I guess it may cause the future information will be peeked during the middle conv layers. What do you think about it now?

sonack commented 6 years ago

@ruotianluo Zero padding is used in every layer to keep the layer size same: https://github.com/anantzoid/Language-Modeling-GatedCNN/blob/master/model.py#L62 The zero padding I referred to in the above comment is the extra padding required to prevent the filter from seeing the future words.

Do you mean there you used SAME padding, which would add zero padding to produce same size output and also prevent next conv layer to see the future information? If so, I dont think it is correct. Because the SAME padding in tensorflow should pad at left and right side as evenly as possible, if not possible, then let the right paddings more 1. But, if your want to prevent seesing future, the paddings should all be at the left side.

qixiang109 commented 6 years ago

@sonack I agree with you. We need padding filter_size-1 zeros to the left with each layer.

sonack commented 6 years ago

@qixiang109 Are you working on this gated cnn？do you have successfully reproduce the paper’s result？I hope we can communicate with each other ：)

qixiang109 commented 6 years ago

@sonack 不好意思很久没看这边，我还没具体做过gated cnn语言模型的实验，但是我很确定补0的方式

anantzoid / Language-Modeling-GatedCNN

While training the model can see all words (beside the last one) #2