attention socre problem

Hi, thank you for your awesome work and code firstly!

When I am transformering your Pytorch code to Tensorflow, I encountered one question.

In your code, you handle the attention mask with visual matrix in bert_encoder.py , and then in your multi_headed_attn.py, you have the following code in the line 59

scores = scores + mask

I am wandering if that corresponds to the attention socre function (5) in your paper? the mask is the addtional M?

Thank you in advance for your responese

autoliuweijie / K-BERT

attention socre problem #54