bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.31k stars 213 forks source link

"Mask is silently ignored due to the use of a custom kernel" with pretrain_gpt_single_node.sh #320

Open tianjianjiang opened 2 years ago

tianjianjiang commented 2 years ago

Update

The issue turned out to be DeepSpeed usages of pretrain_gpt_single_node.sh. I will make a pull request soon.

Original Report

Please let me know what details I shall provide, thank you!

Python 3.7.12 pt=1.11.0+cu113, cuda=11.3

Using pretrain_gpt_single_node.sh with the instructions on README:

[...]/megatron/model/fused_softmax.py", line 191, in forward_fused_softmax
    assert mask is None, "Mask is silently ignored due to the use of a custom kernel"

Also tried --no-masked-softmax-fusion but no differences except the line.

[...]/megatron/model/fused_softmax.py", line 218, in forward_torch_softmax
    assert mask is None
gordicaleksa commented 1 year ago

Any update on this one? Hitting the same error.

edit: commenting out the assert solved it lol, might have changed the behavior but it works for me...

tianjianjiang commented 1 year ago

Hi @gordicaleksa,

Any update on this one? Hitting the same error.

edit: commenting out the assert solved it lol, might have changed the behavior but it works for me...

Sorry for the belated update, I haven't really done the PR as promised, but let me share a simple fix for that, since the PR would be really small and probably too simple to be a significant PR anyway.

Two caveats:

Quick-and-dirty solution

Speculation

liutaocode commented 1 year ago

Commenting works for me too

dajiji commented 1 year ago
  • It seemed to me that DeepSpeed might expect something more since a certain revision (I tried to find that version but failed). For this pretrain_gpt_single_node.sh, there's no --deepspeed --deepspeed_config ds_config.json, which I'm not sure it is truly mandatory.

It was deepspeed 0.8.3