"Mask is silently ignored due to the use of a custom kernel" with pretrain_gpt_single_node.sh

tianjianjiang commented 2 years ago

Update

The issue turned out to be DeepSpeed usages of pretrain_gpt_single_node.sh. I will make a pull request soon.

Original Report

Please let me know what details I shall provide, thank you!

Python 3.7.12 pt=1.11.0+cu113, cuda=11.3

Using pretrain_gpt_single_node.sh with the instructions on README:

[...]/megatron/model/fused_softmax.py", line 191, in forward_fused_softmax
    assert mask is None, "Mask is silently ignored due to the use of a custom kernel"

Also tried --no-masked-softmax-fusion but no differences except the line.

[...]/megatron/model/fused_softmax.py", line 218, in forward_torch_softmax
    assert mask is None

gordicaleksa commented 1 year ago

Any update on this one? Hitting the same error.

edit: commenting out the assert solved it lol, might have changed the behavior but it works for me...

tianjianjiang commented 1 year ago

Hi @gordicaleksa,

Any update on this one? Hitting the same error.

edit: commenting out the assert solved it lol, might have changed the behavior but it works for me...

Sorry for the belated update, I haven't really done the PR as promised, but let me share a simple fix for that, since the PR would be really small and probably too simple to be a significant PR anyway.

Two caveats:

I guess turning off that assert should be alright.
I'm writing this based on my vague recollection and speculations, many tech details can be wrong.

Quick-and-dirty solution

Change to torch.distributed.launch (or torchrun) with --nnodes=1 --nproc_per_node=$N_GPUS instead of deepspeed --num_gpus $N_GPUS
- It seemed to me that DeepSpeed might expect something more since a certain revision (I tried to find that version but failed). For this pretrain_gpt_single_node.sh, there's no --deepspeed --deepspeed_config ds_config.json, which I'm not sure it is truly mandatory.
- I ended up imitating pretrain_gpt_distributed.sh and SLURM scripts in https://github.com/bigscience-workshop/bigscience/tree/master/train without multi-node parts of them.

Speculation

Since fused kernel is from Megatron, and we have #306 with some custom steps, I suspect a bare deepspeed runner may not really like the custom part, but I am not confident with fairseq, DeepSpeed, and Megatron customizations.
- For example (which might be entirely irrelevant), we got --checkpoint-activations that I presumed it was from fairseq, and yet we might also have --deepspeed-activation-checkpointing from DeepSpeed for Megatron, which isn't in this pretrain_gpt_single_node.sh script, but almost always in https://github.com/bigscience-workshop/bigscience/tree/master/train.
One thing may or may not be related is that, the modification of pretrain_gpt_single_node.sh is before #306.

liutaocode commented 1 year ago

Commenting works for me too

dajiji commented 1 year ago

It seemed to me that DeepSpeed might expect something more since a certain revision (I tried to find that version but failed). For this pretrain_gpt_single_node.sh, there's no --deepspeed --deepspeed_config ds_config.json, which I'm not sure it is truly mandatory.

It was deepspeed 0.8.3

bigscience-workshop / Megatron-DeepSpeed