"key_padding_mask" in attention mechanism not be implemented?

huajiang123 commented 3 years ago

Hi, arghosh ! The idea of the paper is so amazing , and the code is so beautiful. Here I want to confirm some details about your code . I found that the sequence length of a student is 200 in your setting, and use 0 as padding number. In your implemention of attention mechanism I just found the upper triangular matrix as mask to ignore the influence of the time steps after current time step, but the padding values of the sequence may also should be ignored, that means these values should not be used to compute attention score. Well , thanks for you if you could solve my problem. : )

arghosh commented 3 years ago

Hi huajian123. Thank you for your interest. In L204, i defined an upper triangular matrix of ones, in L205 mask is inverted (so lower triangular matrix of ones). in L305, I set values to -inf to all future items (including padded indices). However, for computing prediction at those padded time indices, it computes attention over possibly padded indices. But it does not matter since, prediction loss does not flow through those padded timesteps. Hope my explanation helps.

huajiang123 commented 3 years ago

Okay, arghosh, thank you for your detailed explanation. It may because that the padded values is in the right of the sequence, and use the upper triangular matrix could ignore the padded index. Well, thanks again. : )

arghosh / AKT

"key_padding_mask" in attention mechanism not be implemented? #6