[Super AI] Complete training pipeline

Set up:

Make sure GPT-J, LLaMA, and Falcon can be trained on the current pipeline with
- [x] Flash Attention enabled (Prefer Pytorch 2.0)
- [ ] torch.compile works
- [x] Gradient Checkpointing enabled
- [x] Deepspeed enabled
- [x] Lora enabled (No Need for LLaMa)
- [x] WanDB enabled
- [x] Resume Training is possible if error during training
- [x] Can load weight from huggingface (No need for LLaMa)
Make sure GPT-J and Falcon can be used with the following optimizers
- [ ] 8-Bit Adam
- [ ] 1-Bit Adam
- [x] Lion
- [ ] 8-Bit Adam with Deepspeed stage 2
- [ ] 1-Bit Adam with Deepspeed stage 2
- [x] Lion with Deepspeed Stage 2
Make sure our training pipeline contains the following
- [x] Script to train tokenizer from single huggingface dataset
- [x] Script to convert one huggingface dataset to tokenized huggingface dataset
- [x] Script to train from tokenized huggingface dataset
Make sure the tokenizer merging code works with Falcon, MPT, and GPT-J

Things to do

Continue pertaining 1.1 Wait for Pantip + OSCAR dataset to be ready before starting training 1.2 Train BPE tokenizer from Pantip + OSCAR dataset https://huggingface.co/flax-community/gpt2-base-thai/blob/main/train_tokenizer.py 1.3 Merge BPE tokenizer with Falcon, MPT, GPT-J tokenizer 1.4 Train with Lora enabled with all 50B tokens for three models (MPT wait for #180 to finish first)
Train from scratch 2.1 Train LLaMa 100M with 4 GPUs with mC4 or OSCAR Thai dataset to reduce risk (Lora Disabled)
- [ ] Check if we can resume training if training crash
- [ ] Check if we can generate proper results from LLaMA 100M 2.2 Train GPT-NEO-X with Pythia-1B-Deduped configuration https://github.com/EleutherAI/pythia (refer to Reproducing Training session). We need to change the batch size to match our GPUs limitation.
- [ ] Train for 1-10B tokens and check if the validation score is similar to the Pythia paper https://arxiv.org/pdf/2304.01373.pdf

2.3 Train Sentencepiece tokenizer with Pantip +OSCAR dataset (Wait for 1.1) 2.4 (After 2.1 and 2.2 Finished) Train LLaMa 1B for 100B tokens (The Pile 50B + PantipOSCAR 50B) (Lora Disabled)

Notes: Upload all model/tokenizers to DVC

OpenThaiGPT / openthaigpt-pretraining

[Super AI] Complete training pipeline #179