Make sure GPT-J, LLaMA, and Falcon can be trained on the current pipeline with
[x] Flash Attention enabled (Prefer Pytorch 2.0)
[ ] torch.compile works
[x] Gradient Checkpointing enabled
[x] Deepspeed enabled
[x] Lora enabled (No Need for LLaMa)
[x] WanDB enabled
[x] Resume Training is possible if error during training
[x] Can load weight from huggingface (No need for LLaMa)
Make sure GPT-J and Falcon can be used with the following optimizers
[ ] 8-Bit Adam
[ ] 1-Bit Adam
[x] Lion
[ ] 8-Bit Adam with Deepspeed stage 2
[ ] 1-Bit Adam with Deepspeed stage 2
[x] Lion with Deepspeed Stage 2
Make sure our training pipeline contains the following
[x] Script to train tokenizer from single huggingface dataset
[x] Script to convert one huggingface dataset to tokenized huggingface dataset
[x] Script to train from tokenized huggingface dataset
Make sure the tokenizer merging code works with Falcon, MPT, and GPT-J
Things to do
Continue pertaining
1.1 Wait for Pantip + OSCAR dataset to be ready before starting training
1.2 Train BPE tokenizer from Pantip + OSCAR dataset https://huggingface.co/flax-community/gpt2-base-thai/blob/main/train_tokenizer.py
1.3 Merge BPE tokenizer with Falcon, MPT, GPT-J tokenizer
1.4 Train with Lora enabled with all 50B tokens for three models (MPT wait for #180 to finish first)
Train from scratch
2.1 Train LLaMa 100M with 4 GPUs with mC4 or OSCAR Thai dataset to reduce risk (Lora Disabled)
[ ] Check if we can resume training if training crash
[ ] Check if we can generate proper results from LLaMA 100M
2.2 Train GPT-NEO-X with Pythia-1B-Deduped configuration https://github.com/EleutherAI/pythia (refer to Reproducing
Training session). We need to change the batch size to match our GPUs limitation.
Set up:
torch.compile
worksThings to do
2.3 Train Sentencepiece tokenizer with Pantip +OSCAR dataset (Wait for 1.1) 2.4 (After 2.1 and 2.2 Finished) Train LLaMa 1B for 100B tokens (The Pile 50B + PantipOSCAR 50B) (Lora Disabled)
Notes: Upload all model/tokenizers to DVC