bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.31k stars 213 forks source link

About reshape deepspeed checkpoint #343

Open henan991201 opened 2 years ago

henan991201 commented 2 years ago

I want to ask why we can not make tp and pp of a checkpoint bigger? For example, make tp=4 when its original tp is 2. I tried to do this thing but got an assertion error. Thanks for any reply.

misska1 commented 2 years ago

I need to finetune 7B model on V100(32GB)*4 can only use zero_stage 0 and convert transformers into megatron models for TP&PP 4. Looking forward to your work on this script to increase TP &PP size for Deepspeed checkpoint @stas00

stas00 commented 2 years ago

Please see: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#checkpoint-reshaping I know it's not generic at the moment, please let me know if you run into any difficulties following those instructions while adapting to your situation.

@tjruwase, it's probably a good idea to document the topology reshaping process as part of the functionality. Do you have plans to do that? You can totally re-use the steps I shared above. I know you were planning to move all of it into deepspeed.

tjruwase commented 2 years ago

@stas00, thanks for helping with this issue. Yes, documentation is on my TODO. I think linking to your recipe is a great solution for now.

thies1006 commented 1 year ago

I tried to create an universal checkpoint of the 3B model (following the instructions in the link) and I get KeyError: 'param_slice_mappings'. It seems that this key is created by the BP16 optimizer and (if I'm not mistaken) the 3B model was trained in FP16 format. Is it possible to create the universal checkpoint for those models as well? Thank you! @stas00

tjruwase commented 1 year ago

@thies1006, there is ongoing work to support other models. Can you please try the following combination?

Megatron-DeepSpeed: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/349 DeepSpeed: https://github.com/microsoft/DeepSpeed/pull/2284

thies1006 commented 1 year ago

I tried with those PRs and it seems that the problem stays the same.

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "ds_to_universal.py", line 108, in extract_zero_shards
    param_slice_mappings = optim_sd[PARAM_SLICE_MAPPINGS]
KeyError: 'param_slice_mappings'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "ds_to_universal.py", line 290, in <module>
    main()
  File "ds_to_universal.py", line 269, in main
    _extract_zero_shard_files(args, ds_checkpoint, slice_shapes, temp_dir)
  File "ds_to_universal.py", line 231, in _extract_zero_shard_files
    _do_parallel_work(do_work, work_chunks, args.num_extract_workers)
  File "ds_to_universal.py", line 220, in _do_parallel_work
    pool.map(do_work, batch)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'param_slice_mappings'
tjruwase commented 1 year ago

Can you confirm that the model was originally trained using zero stage 1?

thies1006 commented 1 year ago

I'm referring to this model: https://huggingface.co/bigscience/bloom-3b-optimizer-states/tree/main/global_step337250 I cannot say for sure if this model was trained with Zero 1, maybe @stas00 can confirm?

stas00 commented 1 year ago

only 176B was trained on A100 and thus bf16 (z?), everything else was trained on V100s, thus fp16, thus z1

tjruwase commented 1 year ago

Got it. In that case, the model was trained before we started adding universal checkpoint support to z1. Resolving this could require regenerating the checkpoints with the new z1 code. Is this feasible, @stas00?

stas00 commented 1 year ago

I wasn't part of this training, @TevenLeScao do you by chance know who did the bloom-3b training? and if it's possible to update to the deepspeed@master in the conda env, run one more iteration, save the new checkpoint and update https://huggingface.co/bigscience/bloom-3b-optimizer-states/tree/main/global_step337250?

heraclex12 commented 1 year ago

Hi, any updates? Does anyone overcome this issue?

TevenLeScao commented 1 year ago

Hey, sorry for missing this thread - @stas00 I'm trying that fix.

ShinoharaHare commented 1 year ago

Hi, any updates?

wptoux commented 1 year ago

I tried to generate a universal checkpoint for bloomz-7b on 8 x a100 40G using the following method.

  1. using deepspeed_to_deepspeed.py, convert the data parallel of the checkpoint to 8.
  2. in the training code, before starting training, insert the save_checkpoint_and_time(iteration, model, optimizer, lr_scheduler) function so that it can generate a new version of the checkpoint. then load the checkpoint generated in step 1. The reason for saving the checkpoint before training is that the program will OOM after starting training. https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/e52bdabbde3c6895aceb76c1bced295c2646121f/megatron/training.py#L889
  3. use ds_to_universal.py to convert the checkpoint generated in step 2 into a universal checkpoint.
LiinXemmon commented 1 year ago

I tried to generate a universal checkpoint for bloomz-7b on 8 x a100 40G using the following method.

  1. using deepspeed_to_deepspeed.py, convert the data parallel of the checkpoint to 8.
  2. in the training code, before starting training, insert the save_checkpoint_and_time(iteration, model, optimizer, lr_scheduler) function so that it can generate a new version of the checkpoint. then load the checkpoint generated in step 1. The reason for saving the checkpoint before training is that the program will OOM after starting training. https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/e52bdabbde3c6895aceb76c1bced295c2646121f/megatron/training.py#L889
  3. use ds_to_universal.py to convert the checkpoint generated in step 2 into a universal checkpoint.

I am attempting to convert bigscience/bloom-7b1-optimizer-states to am universal checkpoint. At your step 3, the temp files generated by ds_to_universal.py run out of my 3.5TB disk. Have you met the problem?

stas00 commented 1 year ago

7*(8+4)=84 - so you should have a 84GB universal checkpoint. (8+4 optim states+weights)

Can you check with du -s where do you get the bloat?

e.g.

du -ahd1 /path | sort -rh

will help you to sort by large files

LiinXemmon commented 1 year ago

7*(8+4)=84 - so you should have a 84GB universal checkpoint. (8+4 optim states+weights)

Can you check with du -s where do you get the bloat?

e.g.

du -ahd1 /path | sort -rh

will help you to sort by large files

Sorry I did not make it clear. I first use deepspeed_to_deepspeed.py to convert the bigscience/bloom-7b1-optimizer-states checkpoint into (TP=1,PP=1,DP=1). The converted checkpoint is shown as follows. image

Then I load the checkpoint and train for 1 iteration and save the checkpoint. After that I run ds_to_universal.py. I can't see the final universal checkpoint files since at the first stage "Extracting ZeRO fragments" it generates some big files under /output_folder/tmp/. The files are like below image These are only part of them. The total size can be bigger than 3.5TB. I have no idea why they get that big. The converting ends up with "no more disk space". Do I just need a larger ssd or is something wrong?

stas00 commented 1 year ago

Thank you for the details, @LiinXemmon

@tjruwase, is it possible we are hitting the tensor bloat here as well? Same as you're fixing in https://github.com/microsoft/DeepSpeed/pull/3348

@LiinXemmon, could you please confirm if one of these huge files shrinks to a much smaller size if you run this script? https://github.com/stas00/toolbox/blob/master/pytorch/torch-checkpoint-shrink.py

You could run something like:

./torch-checkpoint-shrink.py --checkpoint_dir /output_folder/tmp/ --patterns 'tied*weight'

to try it on the first file from your last listing, for example. Thank you!

tjruwase commented 1 year ago

@stas00, yes, it is likely the same issue. The outcome of using torch-checkpoint-shrink.py would be a useful tip.