YadaYuki / news-recommendation-llm

Pre-trained Large Language Model (BERT) Based News Recommendation using Python / PyTorch 🌎
MIT License
33 stars 1 forks source link

About the padding mask in the attention structure #21

Open YiboZhao624 opened 6 months ago

YiboZhao624 commented 6 months ago

Dear YadaYuki,

I hope this message finds you well. I apologize for reaching out to you once more. Upon further examination, I noticed an aspect of the code in src/recommendation/PLMBasedNewsEncoder.py that caught my attention. It appears that the code utilizes the pretrained BERT-base-uncased model. However, upon closer inspection, it seems that the input solely consists of token IDs from the texts, without including the padding mask.

The relevant snippet of code is as follows:

V = self.plm(input_val).last_hidden_state  # [batch_size, seq_len] -> [batch_size, seq_len, hidden_size]   
multihead_attn_output, _ = self.multihead_attention(V, V, V)

Upon reviewing the data processing section, particularly around line 198 in src/mind/MINDDataset.py, I observed the usage of the transform function. It appears that you may have intentionally omitted the padding mask. I am intrigued by this approach and would appreciate insight into the rationale behind this decision.

Thank you for your time and consideration.

Sincerely, Yibo

dabasmoti commented 3 weeks ago

I have added this

V = self.plm(input_ids=input_val, attention_mask=attention_mask).last_hidden_state  # [batch_size, seq_len] -> [batch_size, seq_len, hidden_size]
        multihead_attn_output, _ = self.multihead_attention(
            V, V, V
        )