brightmart / bert_language_understanding

Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN
960 stars 211 forks source link

About "scaled_dot_product_attention_batch" #9

Closed yuanxiaosc closed 6 years ago

yuanxiaosc commented 6 years ago

Code implementation of Scaled Dot-Product Attention

2. dot product of Q,K

In your code, it is "dot_product = dot_product * (1.0 / tf.sqrt(tf.cast(self.d_model, tf.float32))) ".

I think that should be "dot_product = dot_product * (1.0 / tf.sqrt(tf.cast(self.d_model/h, tf.float32))) " according to the paper.

brightmart commented 6 years ago

o.k. got it. thank you.