[QUESTION] Loss increased by 10 times at second step (after one step of backward).

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

10.12k stars 2.28k forks source link

[QUESTION] Loss increased by 10 times at second step (after one step of backward). #862

Closed janelu9 closed 3 months ago

janelu9 commented 3 months ago

why?

janelu9 commented 3 months ago

So magical. I guess there may be something wrong with the optimizer

JiwenJ commented 3 months ago

Hello~，I might have encountered the same issue. What is your problem background? Pretrain or further pretrain? Llama Family? What is the loss in the following step?

janelu9 commented 3 months ago

Hello~，I might have encountered the same issue. What is your problem background? Pretrain or further pretrain? Llama Family? What is the loss in the following step?

loss decreased slowly in following steps. I load the llama3-8b's weights by my custom method , it seemed the model loaded weights sucessful , but model's weights were all reseted randomlly at second step I think.

JiwenJ commented 3 months ago

Did you ever train llama2 or 3 for enough time? What is your loss now? Would the loss decrease to near zero, which is happening to me.

janelu9 commented 3 months ago

Did you ever train llama2 or 3 for enough time? What is your loss now? Would the loss decrease to near zero, which is happening to me.

No. When I trained llama3 by other engine, the loss never increased so much on the same dataset. Maybe some bugs in my custom method

JiwenJ commented 3 months ago

Probably we all need someone more professional to tackle it.

janelu9 commented 3 months ago

Probably we all need someone more professional to tackle it.

I really believe that the params are reseted randomly at the second iteration.

JiwenJ commented 3 months ago

I further pretrain llama2 for some steps in Nvidia-GPU, which is fine, but encoutered the similar issue in other type GPU.

janelu9 commented 3 months ago

I further pretrain llama2 for some steps in Nvidia-GPU, which is fine, but encoutered the similar issue in other type GPU.

I set learning rate very small as 1e-9, the loss still increased exponentially at scend iteration that confirmed my guess

janelu9 commented 3 months ago

I further pretrain llama2 for some steps in Nvidia-GPU, which is fine, but encoutered the similar issue in other type GPU.

Well, I have solved my problem by loading the pretrained weights before the optimizer is built

JiwenJ commented 3 months ago

Can you tell me the details such as the line of code you change. Very appreciate that !!

janelu9 commented 3 months ago

Can you tell me the details such as the line of code you change. Very appreciate that !!

def setup_model_and_optimizer(model_provider_func,
                              model_type,
                              no_wd_decay_cond=None,
                              scale_lr_cond=None,
                              lr_mult=1.0):
    args = get_args()
    timers = get_timers()

    model = get_model(model_provider_func, model_type)
    #if args.load_hf_model: 
    #    load_hf_model(model,args.model,args.cache_model)
    #    if args.only_cache_model:
    #        sys.exit()
    unwrapped_model = unwrap_model(model)

    kwargs = {}
    for f in dataclasses.fields(OptimizerConfig):
        if hasattr(args, f.name):
            kwargs[f.name] = getattr(args, f.name)
    config = OptimizerConfig(**kwargs)
    config.timers = timers
    optimizer = get_megatron_optimizer(config, model, no_wd_decay_cond,
                                       scale_lr_cond, lr_mult)
    opt_param_scheduler = get_optimizer_param_scheduler(optimizer)
    ...

I modified setup_model_and_optimizer in megatron.training.training like the codes commented out , load_hf_model is my custom weights loading method who can load weights from huggingface directelly.

JiwenJ commented 3 months ago

    if args.load is not None or args.pretrained_checkpoint is not None:
        timers('load-checkpoint', log_level=0).start(barrier=True)
        args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
            model, optimizer, opt_param_scheduler)
        timers('load-checkpoint').stop(barrier=True)
        timers.log(['load-checkpoint'])
    else:
        args.iteration = 0
        args.num_floating_point_operations_so_far = 0

thx, but how to deal with these following codes

janelu9 commented 3 months ago

    if args.load is not None or args.pretrained_checkpoint is not None:
        timers('load-checkpoint', log_level=0).start(barrier=True)
        args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
            model, optimizer, opt_param_scheduler)
        timers('load-checkpoint').stop(barrier=True)
        timers.log(['load-checkpoint'])
    else:
        args.iteration = 0
        args.num_floating_point_operations_so_far = 0

thx, but how to deal with these following codes

These codes doesn't conflict, they load weights from checkpoint. I don't have any checkpoint, I just want to train model from huggingface's weights.

JiwenJ commented 3 months ago

OK, Got you. Thanks for your help~

JiwenJ commented 3 months ago

@janelu9 Sorry to bother, I have just the last one question, if you turn off flash attention, will the first step of loss during the further pretrain be changed?

janelu9 commented 3 months ago

@janelu9 Sorry to bother, I have just the last one question, if you turn off flash attention, will the first step of loss during the further pretrain be changed?

Maybe, according to the micro batch data