Closed BestJiayi closed 3 months ago
Does the format of continued pre-training have to be TextFiles? Can't it be in json format?
Thanks for the informative post and showing these hands-on examples, that's very helpful to see. Regarding catastrophic forgetting, what may help is adding ~5% of the original training data back into the continued pretraining dataset. There was a recent paper that looked into that. I've written a high-level overview here: https://magazine.sebastianraschka.com/i/142924793/simple-and-scalable-strategies-to-continually-pre-train-large-language-models
Does the format of continued pre-training have to be TextFiles? Can't it be in json format?
That's just the one I implemented because I thought it might be easier for people (I assume more people know what a text file is compared to JSON). But if you are up to it, please feel free to contribute a JSON dataset dataset approach analogous to the TextFiles approach -- that'd be welcome. (PS: We actually do support JSON for instruction finetuning)
@rasbt Thank you very much for your detailed explanation. I will study the article carefully now. As for the raw data of llama3, I actually don't know where to get it. I have been searching for it for a long time. Thanks for your answer.
Yes, getting the original data for Llama 3 is impossible for non Meta people, unfortunately. But I think it's safe to assume that the dataset more or less overlaps with public datasets like RefinedWeb.
@rasbt Thanks! I'll download RefinedWeb to test its effectiveness and observe whether training solely on RefinedWeb will result in catastrophic forgetting.
Good idea, this would be a good control experiment.
Thanks!
What if instead of continued pretrained do a fine-tuning with LoRA. I believe that there is a paper that LoRA forgets less (I saw Sebastian's post on LinkedIn about it). Though in the paper they compared LoRA and a proper full fine-tuning, I think that there is no reason to at least try it also for pretraining (basically just a different naming).
@Andrei-Aksionov Indeed, I agree that the LoRA fine-tuning approach can definitely be applied to continued pre-training. I will carefully read this paper. Thanks!
In addition, you can take a look at the Studio template.
There @awaelchli discussed a topic of learning rate and forgetting (Warm up
chapter).
@Andrei-Aksionov Okay, thank you for sharing. I will go to Studio template to learn more.
My question
Why does catastrophic forgetting occur when I perform continued pre-training on Llama 3? I used open source data from BookCorpus, iterated 100,000 steps, and then after testing the trained model with litgpt chat, the output becomes chaotic, and the original knowledge of Llama 3 is forgotten. However, there was no problem when I iterated 200 steps. Is this a bug? Below are my commands and screenshots of the data. This problem has troubled me for three weeks, and I have tried many methods but failed to solve it. I am eagerly looking forward to your feedback. Thanks!
My commands
litgpt pretrain Meta-Llama-3-8B-Instruct \ --tokenizer_dir checkpoints/meta-llama/Meta-Llama-3-8B-Instruct \ --initial_checkpoint_dir checkpoints/meta-llama/Meta-Llama-3-8B-Instruct \ --data TextFiles \ --data.train_data_path "/workspace/pretrain/custom_texts" \ --train.max_tokens 2152000000 \ --train.max_seq_length 2048 \ --train.lr_warmup_steps 30000 \ --train.min_lr 4e-08 \ --train.global_batch_size 8 \ --train.micro_batch_size 1 \ --train.log_interval 10 \ --train.save_interval 100 \ --eval.interval 50 \ --eval.final_validation true \ --out_dir out/custom-model
My data
My litgpt chat result
When testing with litgpt chat, for example, when the prompt inputs "hi," it starts to generate nonsense continuously and non-stop.