Closed YixinSong-e closed 3 months ago
What training configuration did you use? We trained using 4xA100 40G. Below is our training configuration.
train_config={ "lr":3e-5, "bs":4, "gradient_accumulation_steps":1, "datapath":f"{args.tmpdir}", "is_warmup":True, "num_epochs":200, "num_warmup_steps":2000, "total_steps":800000, "p_w":0.1, "v_w":1.0, "head_w":0.1, "num_workers":2, "embeding":True, "act":"No", "data_noise":True, "noise":"uniform", "mean":0.0, "std":0.2, "residual":"true,norm", "max_len":1200, "config_path":"config.json", "b1":0.9, "b2": 0.95, "grad_clip": 0.5, }
I face the same error when train BlueLM-7B-Chat,use A100 4 GPU,80G memory each GPU,OOM happened when run accelerator.backward(), below is the result of torch.cuda.memory_summary():
I suspect it might be due to the following reasons:
3Q for the answer,nice! Train process is running~^_^
hope train can finish peacefully.
I have a finetuned llama-70B model, but I can't run this project correctly due to OOM. I have 8 80G-A100.