About the padding mask in the attention structure

Dear YadaYuki,

I hope this message finds you well. I apologize for reaching out to you once more. Upon further examination, I noticed an aspect of the code in src/recommendation/PLMBasedNewsEncoder.py that caught my attention. It appears that the code utilizes the pretrained BERT-base-uncased model. However, upon closer inspection, it seems that the input solely consists of token IDs from the texts, without including the padding mask.

The relevant snippet of code is as follows:

V = self.plm(input_val).last_hidden_state  # [batch_size, seq_len] -> [batch_size, seq_len, hidden_size]   
multihead_attn_output, _ = self.multihead_attention(V, V, V)

Upon reviewing the data processing section, particularly around line 198 in src/mind/MINDDataset.py, I observed the usage of the transform function. It appears that you may have intentionally omitted the padding mask. I am intrigued by this approach and would appreciate insight into the rationale behind this decision.

Thank you for your time and consideration.

Sincerely, Yibo

YadaYuki / news-recommendation-llm

About the padding mask in the attention structure #21