Convert DeepSpeed Checkpoint to Megatron Checkpoint
args = Namespace(for_release=False, input_folder='checkpoints/tr11b-1B3-ml/checkpoints/main/global_step1/', output_folder='./trans_checkpoints', target_pp=1, target_tp=1)
Converting DeepSpeed checkpoint in checkpoints/tr11b-1B3-ml/checkpoints/main/global_step1/ to Megatron checkpoint in ./trans_checkpoints
Traceback (most recent call last):
File "tools/convert_checkpoint/deepspeed_to_megatron.py", line 187, in <module>
main()
File "tools/convert_checkpoint/deepspeed_to_megatron.py", line 173, in main
ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp,
File "/data/anaconda3/envs/ds/lib/python3.8/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 72, in __init__
self.zero_checkpoint = ZeROCheckpoint(dir)
File "/data/anaconda3/envs/ds/lib/python3.8/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 26, in __init__
assert self.num_files > 0, f'No ZeRO files found in {dir}'
AssertionError: No ZeRO files found in checkpoints/tr11b-1B3-ml/checkpoints/main/global_step1/
I did not get any zero file while saving checkpoint in pretrain.
I met the same problem. I guess the options ZERO_STAGE=0 and --fp16 cannot work together? It cannot generate any ZeRO files.
But I don't know how to solve it.
I did not get any zero file while saving checkpoint in pretrain.