Open kisseternity opened 8 months ago
Marking as stale. No activity in 60 days.
Do you figure it out? I also only see causal mask for training. Inference has padding but the attention mask computed by get_ltor_masks_and_position_ids does not consider padding.
Do you figure it out? I also only see causal mask for training. Inference has padding but the attention mask computed by get_ltor_masks_and_position_ids does not consider padding.
It turns out the default dataloader in Megatron is designed for pretraining, meaning that every sample is expected to be the same max seq len. So no padding is needed. If you want to do fine-tuning, I think code needs to be added for padding. You may look the code in DeepSpeed-chat for reference.
Do you figure it out? I also only see causal mask for training. Inference has padding but the attention mask computed by get_ltor_masks_and_position_ids does not consider padding.
It turns out the default dataloader in Megatron is designed for pretraining, meaning that every sample is expected to be the same max seq len. So no padding is needed. If you want to do fine-tuning, I think code needs to be added for padding. You may look the code in DeepSpeed-chat for reference.
Thank you! I was hoping the padding code was just not found by me
The megatron GPTDataset is implemented as batch-packing like https://huggingface.co/docs/trl/sft_trainer#packing-dataset--constantlengthdataset- which provides samples with constant length, but I don't know the performance effect of packing compared to padding, or maybe reset-attention-mask should be used for small dataset to avoid Cross-Contamination Attention.
Marking as stale. No activity in 60 days.
Your question Hello, as far as I know about Megatron, I've only seen padding mask for bert implementation. Yet in Huggingface transformers library, the llama model should also take in the padding mask, while the Megatron's attention mask is the causal mask. Am I right or do I miss something? Please take a look. thanks.