Closed janelu9 closed 3 months ago
So magical. I guess there may be something wrong with the optimizer
Hello~,I might have encountered the same issue. What is your problem background? Pretrain or further pretrain? Llama Family? What is the loss in the following step?
Hello~,I might have encountered the same issue. What is your problem background? Pretrain or further pretrain? Llama Family? What is the loss in the following step?
loss decreased slowly in following steps. I load the llama3-8b's weights by my custom method , it seemed the model loaded weights sucessful , but model's weights were all reseted randomlly at second step I think.
Did you ever train llama2 or 3 for enough time? What is your loss now? Would the loss decrease to near zero, which is happening to me.
Did you ever train llama2 or 3 for enough time? What is your loss now? Would the loss decrease to near zero, which is happening to me.
No. When I trained llama3 by other engine, the loss never increased so much on the same dataset. Maybe some bugs in my custom method
Probably we all need someone more professional to tackle it.
Probably we all need someone more professional to tackle it.
I really believe that the params are reseted randomly at the second iteration.
I further pretrain llama2 for some steps in Nvidia-GPU, which is fine, but encoutered the similar issue in other type GPU.
I further pretrain llama2 for some steps in Nvidia-GPU, which is fine, but encoutered the similar issue in other type GPU.
I set learning rate very small as 1e-9, the loss still increased exponentially at scend iteration that confirmed my guess
I further pretrain llama2 for some steps in Nvidia-GPU, which is fine, but encoutered the similar issue in other type GPU.
Well, I have solved my problem by loading the pretrained weights before the optimizer is built
Can you tell me the details such as the line of code you change. Very appreciate that !!
Can you tell me the details such as the line of code you change. Very appreciate that !!
def setup_model_and_optimizer(model_provider_func,
model_type,
no_wd_decay_cond=None,
scale_lr_cond=None,
lr_mult=1.0):
args = get_args()
timers = get_timers()
model = get_model(model_provider_func, model_type)
#if args.load_hf_model:
# load_hf_model(model,args.model,args.cache_model)
# if args.only_cache_model:
# sys.exit()
unwrapped_model = unwrap_model(model)
kwargs = {}
for f in dataclasses.fields(OptimizerConfig):
if hasattr(args, f.name):
kwargs[f.name] = getattr(args, f.name)
config = OptimizerConfig(**kwargs)
config.timers = timers
optimizer = get_megatron_optimizer(config, model, no_wd_decay_cond,
scale_lr_cond, lr_mult)
opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
...
I modified setup_model_and_optimizer
in megatron.training.training
like the codes commented out , load_hf_model
is my custom weights loading method who can load weights from huggingface directelly.
if args.load is not None or args.pretrained_checkpoint is not None:
timers('load-checkpoint', log_level=0).start(barrier=True)
args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
model, optimizer, opt_param_scheduler)
timers('load-checkpoint').stop(barrier=True)
timers.log(['load-checkpoint'])
else:
args.iteration = 0
args.num_floating_point_operations_so_far = 0
thx, but how to deal with these following codes
if args.load is not None or args.pretrained_checkpoint is not None: timers('load-checkpoint', log_level=0).start(barrier=True) args.iteration, args.num_floating_point_operations_so_far = load_checkpoint( model, optimizer, opt_param_scheduler) timers('load-checkpoint').stop(barrier=True) timers.log(['load-checkpoint']) else: args.iteration = 0 args.num_floating_point_operations_so_far = 0
thx, but how to deal with these following codes
These codes doesn't conflict, they load weights from checkpoint. I don't have any checkpoint, I just want to train model from huggingface's weights.
OK, Got you. Thanks for your help~
@janelu9 Sorry to bother, I have just the last one question, if you turn off flash attention, will the first step of loss during the further pretrain be changed?
@janelu9 Sorry to bother, I have just the last one question, if you turn off flash attention, will the first step of loss during the further pretrain be changed?
Maybe, according to the micro batch data
why?