NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[QUESTION]Where does the attention_mask come from when the gpt_model is not the first or last pipeline stage? #861

Open janelu9 opened 2 weeks ago

janelu9 commented 2 weeks ago

I know the hidden_states are the output of previous stage, but I don't understand the how the attention_mask is passed to the next transformer block.