Question about downloading checkpoints of 6.3B，2.5B，1.3B

misska1 commented 2 years ago

Does BigScience also provide the original BLOOM checkpoints (without conversion to Huggingface 🤗). I am working on finetuning BLOOM (6.3B，2.5B，1.3B) and I need those checkpoint files. issues/315

In https://github.com/bigscience-workshop/bigscience/tree/master/train/tr1-13B-base ,I found some urls but they are all offline.


I created 4 repos at https://huggingface.co/bigscience/ and now we can clone those as the dirs data will be output into:

cd $six_ALL_CCFRSCRATCH/checkpoints/tr1-13B
git clone https://huggingface.co/bigscience/tr1-13B-checkpoints checkpoints
git clone https://huggingface.co/bigscience/tr1-13B-tensorboard tensorboard
git clone https://huggingface.co/bigscience/tr1-13B-codecarbon codecarbon
git clone https://huggingface.co/bigscience/tr1-13B-logs logs

malteos commented 2 years ago

You can convert the HF checkpoints back to Megatron-DeepSpeed. See this (a bit hacky) script: https://gist.github.com/malteos/c194368594e16439c101b7bf27195fd1

misska1 commented 2 years ago

You can convert the HF checkpoints back to Megatron-DeepSpeed. See this (a bit hacky) script: https://gist.github.com/malteos/c194368594e16439c101b7bf27195fd1

@malteos Thank you for your answer! However , in your code ,I need to specify a DeepSpeed checkpoint.

        "checkpoint_dir",
        type=str,
        help="Path to the DeepSpeed checkpoint directory",

But I do not have a DeepSpeed checkpoint for (6.3B，2.5B，1.3B) to compare.

malteos commented 2 years ago

The script updates the weights of Deepspeed checkpoint directly on the disk with the weights from a HF checkpoint. So you just need to save an untrained DS checkpoint and update it afterwards. You can use the existing slurm scripts for that and only set the number of train steps to 1.

bigscience-workshop / Megatron-DeepSpeed

Question about downloading checkpoints of 6.3B，2.5B，1.3B #323