[QUESTION]Where does the attention_mask come from when the gpt_model is not the first or last pipeline stage?

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

Other

9.23k stars 2.08k forks source link

Open janelu9 opened 2 weeks ago

janelu9 commented 2 weeks ago

I know the hidden_states are the output of previous stage, but I don't understand the how the attention_mask is passed to the next transformer block.