[QUESTION] Should llama or gpt-like models have padding attention mask?

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

9.3k stars 2.1k forks source link

[QUESTION] Should llama or gpt-like models have padding attention mask? #536

Open kisseternity opened 8 months ago

kisseternity commented 8 months ago

Your question Hello, as far as I know about Megatron, I've only seen padding mask for bert implementation. Yet in Huggingface transformers library, the llama model should also take in the padding mask, while the Megatron's attention mask is the causal mask. Am I right or do I miss something? Please take a look. thanks.

github-actions[bot] commented 6 months ago

Marking as stale. No activity in 60 days.

drxmy commented 3 months ago

Do you figure it out? I also only see causal mask for training. Inference has padding but the attention mask computed by get_ltor_masks_and_position_ids does not consider padding.

kisseternity commented 3 months ago

Do you figure it out? I also only see causal mask for training. Inference has padding but the attention mask computed by get_ltor_masks_and_position_ids does not consider padding.

It turns out the default dataloader in Megatron is designed for pretraining, meaning that every sample is expected to be the same max seq len. So no padding is needed. If you want to do fine-tuning, I think code needs to be added for padding. You may look the code in DeepSpeed-chat for reference.

drxmy commented 3 months ago

Do you figure it out? I also only see causal mask for training. Inference has padding but the attention mask computed by get_ltor_masks_and_position_ids does not consider padding.

It turns out the default dataloader in Megatron is designed for pretraining, meaning that every sample is expected to be the same max seq len. So no padding is needed. If you want to do fine-tuning, I think code needs to be added for padding. You may look the code in DeepSpeed-chat for reference.

Thank you! I was hoping the padding code was just not found by me

XLzed commented 3 months ago

The megatron GPTDataset is implemented as batch-packing like https://huggingface.co/docs/trl/sft_trainer#packing-dataset--constantlengthdataset- which provides samples with constant length, but I don't know the performance effect of packing compared to padding, or maybe reset-attention-mask should be used for small dataset to avoid Cross-Contamination Attention.

github-actions[bot] commented 1 month ago

Marking as stale. No activity in 60 days.