Open tianjianjiang opened 2 years ago
Any update on this one? Hitting the same error.
edit: commenting out the assert solved it lol, might have changed the behavior but it works for me...
Hi @gordicaleksa,
Any update on this one? Hitting the same error.
edit: commenting out the assert solved it lol, might have changed the behavior but it works for me...
Sorry for the belated update, I haven't really done the PR as promised, but let me share a simple fix for that, since the PR would be really small and probably too simple to be a significant PR anyway.
Two caveats:
torch.distributed.launch
(or torchrun
) with --nnodes=1 --nproc_per_node=$N_GPUS
instead of deepspeed --num_gpus $N_GPUS
pretrain_gpt_single_node.sh
, there's no --deepspeed --deepspeed_config ds_config.json
, which I'm not sure it is truly mandatory.pretrain_gpt_distributed.sh
and SLURM scripts in https://github.com/bigscience-workshop/bigscience/tree/master/train without multi-node parts of them.deepspeed
runner may not really like the custom part, but I am not confident with fairseq, DeepSpeed, and Megatron customizations.
--checkpoint-activations
that I presumed it was from fairseq, and yet we might also have --deepspeed-activation-checkpointing
from DeepSpeed for Megatron, which isn't in this pretrain_gpt_single_node.sh
script, but almost always in https://github.com/bigscience-workshop/bigscience/tree/master/train.pretrain_gpt_single_node.sh
is before #306.Commenting works for me too
- It seemed to me that DeepSpeed might expect something more since a certain revision (I tried to find that version but failed). For this
pretrain_gpt_single_node.sh
, there's no--deepspeed --deepspeed_config ds_config.json
, which I'm not sure it is truly mandatory.
It was deepspeed 0.8.3
Update
The issue turned out to be DeepSpeed usages of
pretrain_gpt_single_node.sh
. I will make a pull request soon.Original Report
Please let me know what details I shall provide, thank you!
Python 3.7.12 pt=1.11.0+cu113, cuda=11.3
Using
pretrain_gpt_single_node.sh
with the instructions on README:Also tried
--no-masked-softmax-fusion
but no differences except the line.