How to train LLama-70B eagle?

SafeAILab / EAGLE

Official Implementation of EAGLE

https://arxiv.org/pdf/2406.16858

Apache License 2.0

622 stars 59 forks source link

How to train LLama-70B eagle? #38

Closed YixinSong-e closed 3 months ago

YixinSong-e commented 4 months ago

I have a finetuned llama-70B model, but I can't run this project correctly due to OOM. I have 8 80G-A100.

Liyuhui-12 commented 4 months ago

What training configuration did you use? We trained using 4xA100 40G. Below is our training configuration.

train_config={ "lr":3e-5, "bs":4, "gradient_accumulation_steps":1, "datapath":f"{args.tmpdir}", "is_warmup":True, "num_epochs":200, "num_warmup_steps":2000, "total_steps":800000, "p_w":0.1, "v_w":1.0, "head_w":0.1, "num_workers":2, "embeding":True, "act":"No", "data_noise":True, "noise":"uniform", "mean":0.0, "std":0.2, "residual":"true,norm", "max_len":1200, "config_path":"config.json", "b1":0.9, "b2": 0.95, "grad_clip": 0.5, }

xiongqisong commented 4 months ago

I face the same error when train BlueLM-7B-Chat，use A100 4 GPU，80G memory each GPU，OOM happened when run accelerator.backward(), below is the result of torch.cuda.memory_summary():

Liyuhui-12 commented 4 months ago

I suspect it might be due to the following reasons:

In the json file pointed to by --configpath, num_hidden_layers is not set to 1.
The max_len in the train_config is not correctly set.

xiongqisong commented 4 months ago

3Q for the answer，nice! Train process is running~^_^ hope train can finish peacefully.