Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
9.98k stars 995 forks source link

Catastrophic forgetting occur when I perform continued pre-training on Llama 3 #1517

Closed BestJiayi closed 3 months ago

BestJiayi commented 3 months ago

My question

Why does catastrophic forgetting occur when I perform continued pre-training on Llama 3? I used open source data from BookCorpus, iterated 100,000 steps, and then after testing the trained model with litgpt chat, the output becomes chaotic, and the original knowledge of Llama 3 is forgotten. However, there was no problem when I iterated 200 steps. Is this a bug? Below are my commands and screenshots of the data. This problem has troubled me for three weeks, and I have tried many methods but failed to solve it. I am eagerly looking forward to your feedback. Thanks!

My commands

litgpt pretrain Meta-Llama-3-8B-Instruct \ --tokenizer_dir checkpoints/meta-llama/Meta-Llama-3-8B-Instruct \ --initial_checkpoint_dir checkpoints/meta-llama/Meta-Llama-3-8B-Instruct \ --data TextFiles \ --data.train_data_path "/workspace/pretrain/custom_texts" \ --train.max_tokens 2152000000 \ --train.max_seq_length 2048 \ --train.lr_warmup_steps 30000 \ --train.min_lr 4e-08 \ --train.global_batch_size 8 \ --train.micro_batch_size 1 \ --train.log_interval 10 \ --train.save_interval 100 \ --eval.interval 50 \ --eval.final_validation true \ --out_dir out/custom-model

My data

批注 2024-06-24 101649 批注 2024-06-24 101729

My litgpt chat result

When testing with litgpt chat, for example, when the prompt inputs "hi," it starts to generate nonsense continuously and non-stop. 批注 2024-06-24 102527

BestJiayi commented 3 months ago

Does the format of continued pre-training have to be TextFiles? Can't it be in json format?

rasbt commented 3 months ago

Thanks for the informative post and showing these hands-on examples, that's very helpful to see. Regarding catastrophic forgetting, what may help is adding ~5% of the original training data back into the continued pretraining dataset. There was a recent paper that looked into that. I've written a high-level overview here: https://magazine.sebastianraschka.com/i/142924793/simple-and-scalable-strategies-to-continually-pre-train-large-language-models

Does the format of continued pre-training have to be TextFiles? Can't it be in json format?

That's just the one I implemented because I thought it might be easier for people (I assume more people know what a text file is compared to JSON). But if you are up to it, please feel free to contribute a JSON dataset dataset approach analogous to the TextFiles approach -- that'd be welcome. (PS: We actually do support JSON for instruction finetuning)

BestJiayi commented 3 months ago

@rasbt Thank you very much for your detailed explanation. I will study the article carefully now. As for the raw data of llama3, I actually don't know where to get it. I have been searching for it for a long time. Thanks for your answer.

rasbt commented 3 months ago

Yes, getting the original data for Llama 3 is impossible for non Meta people, unfortunately. But I think it's safe to assume that the dataset more or less overlaps with public datasets like RefinedWeb.

BestJiayi commented 3 months ago

@rasbt Thanks! I'll download RefinedWeb to test its effectiveness and observe whether training solely on RefinedWeb will result in catastrophic forgetting.

rasbt commented 3 months ago

Good idea, this would be a good control experiment.

BestJiayi commented 3 months ago

Thanks!

Andrei-Aksionov commented 3 months ago

What if instead of continued pretrained do a fine-tuning with LoRA. I believe that there is a paper that LoRA forgets less (I saw Sebastian's post on LinkedIn about it). Though in the paper they compared LoRA and a proper full fine-tuning, I think that there is no reason to at least try it also for pretraining (basically just a different naming).

BestJiayi commented 3 months ago

@Andrei-Aksionov Indeed, I agree that the LoRA fine-tuning approach can definitely be applied to continued pre-training. I will carefully read this paper. Thanks!

Andrei-Aksionov commented 3 months ago

In addition, you can take a look at the Studio template. There @awaelchli discussed a topic of learning rate and forgetting (Warm up chapter).

BestJiayi commented 3 months ago

@Andrei-Aksionov Okay, thank you for sharing. I will go to Studio template to learn more.