Open AnaRhisT94 opened 2 years ago
Hi, I am not sure but in the original megatron code, there was an argument (don't remember the name) that resets the optimizer, dataloader etc which.you could use to do finetuning. Not sure if that is present or works in this repo.
Hey @mayank31398, Just wondering is the pretrain_gpt.py is used for pretraining BLOOM models? if yes then Architecture for gpt and Bloom are same? but i see different implementation for gpt and bloom in hugginface transformers.
Also i am trying to finetune StarCoder model using Megatron-DeepSpeed 3D parallelism, can you give some idea how it can be done?
This is the script used for launching 176B: https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/tr11-176B-ml.slurm The architecture is not the same since BLOOM uses alibi and GPT uses absolute embeddings.
For Starcoder, 4D parallelism is used Tensor Parallel, Pipeline Parallel, Sequence Parallel, Data Parallel This is the repo used for starcoder and santacoder training: https://github.com/bigcode-project/Megatron-LM
Thank You.
Hi,
What's the process in finetuning BLOOM? Did anyone succeed and willing to share the code?
Thanks!