OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

[Super AI] Complete training pipeline #179

Open ArthurMinovsky opened 1 year ago

ArthurMinovsky commented 1 year ago

Set up:

  1. Make sure GPT-J, LLaMA, and Falcon can be trained on the current pipeline with
    • [x] Flash Attention enabled (Prefer Pytorch 2.0)
    • [ ] torch.compile works
    • [x] Gradient Checkpointing enabled
    • [x] Deepspeed enabled
    • [x] Lora enabled (No Need for LLaMa)
    • [x] WanDB enabled
    • [x] Resume Training is possible if error during training
    • [x] Can load weight from huggingface (No need for LLaMa)
  2. Make sure GPT-J and Falcon can be used with the following optimizers
    • [ ] 8-Bit Adam
    • [ ] 1-Bit Adam
    • [x] Lion
    • [ ] 8-Bit Adam with Deepspeed stage 2
    • [ ] 1-Bit Adam with Deepspeed stage 2
    • [x] Lion with Deepspeed Stage 2
  3. Make sure our training pipeline contains the following
    • [x] Script to train tokenizer from single huggingface dataset
    • [x] Script to convert one huggingface dataset to tokenized huggingface dataset
    • [x] Script to train from tokenized huggingface dataset
  4. Make sure the tokenizer merging code works with Falcon, MPT, and GPT-J

Things to do

  1. Continue pertaining 1.1 Wait for Pantip + OSCAR dataset to be ready before starting training 1.2 Train BPE tokenizer from Pantip + OSCAR dataset https://huggingface.co/flax-community/gpt2-base-thai/blob/main/train_tokenizer.py 1.3 Merge BPE tokenizer with Falcon, MPT, GPT-J tokenizer 1.4 Train with Lora enabled with all 50B tokens for three models (MPT wait for #180 to finish first)
  2. Train from scratch 2.1 Train LLaMa 100M with 4 GPUs with mC4 or OSCAR Thai dataset to reduce risk (Lora Disabled)
    • [ ] Check if we can resume training if training crash
    • [ ] Check if we can generate proper results from LLaMA 100M 2.2 Train GPT-NEO-X with Pythia-1B-Deduped configuration https://github.com/EleutherAI/pythia (refer to Reproducing Training session). We need to change the batch size to match our GPUs limitation.
    • [ ] Train for 1-10B tokens and check if the validation score is similar to the Pythia paper https://arxiv.org/pdf/2304.01373.pdf

Image

2.3 Train Sentencepiece tokenizer with Pantip +OSCAR dataset (Wait for 1.1) 2.4 (After 2.1 and 2.2 Finished) Train LLaMa 1B for 100B tokens (The Pile 50B + PantipOSCAR 50B) (Lora Disabled)

Notes: Upload all model/tokenizers to DVC