add support to finetune with use_distributed_optimizer

dumpmemory commented 1 year ago

fix issues for finetune with use_distributed_optimizer option

martinjaggi commented 1 year ago

could you comment what is the issue solved by this fix? (compared to the finetuning code and scripts we provide?)

dumpmemory commented 1 year ago

could you comment what is the issue solved by this fix? (compared to the finetuning code and scripts we provide?)

Yes. fix the missing function when u add --use-distributed-optimizer args in fintuning scripts.

dumpmemory commented 1 year ago

also fix https://github.com/epfLLM/Megatron-LLM/issues/67#issuecomment-1723083120

dumpmemory commented 12 months ago

any update ?

kylematoba commented 11 months ago

hi, sorry no update: the whole team is working on a big run right now and obviously changing the function signature for checkpoint loading is not something we're keen to do right now. We should be done in about a month.

mynewstart commented 11 months ago

Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer \, an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.

optim_name = os.path.join(
            common_path + "_%03d" % mpu.get_data_parallel_rank(),
            "optim.pt")

The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:

def initialize_megatron(extra_args_provider=None,
                        args_defaults={}):
    """Set global variables, initialize distributed, and
    set autoresume and random seeds.
    `allow_no_cuda` should not be set unless using megatron for cpu only 
    data processing. In general this arg should not be set unless you know 
    what you are doing.
    """

    # Make sure cuda is available.
    assert torch.cuda.is_available(), 'Megatron requires CUDA.'

    # Parse arguments
    args = megatron.arguments.parse_args(extra_args_provider)

    if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False):
        assert args.load is not None, '--use-checkpoints-args requires --load argument'
        load_args_from_checkpoint(args)

    megatron.arguments.validate_args(args, args_defaults)

    # set global args, build tokenizer, and set adlr_autoresume,
    # tensorboard-writer, and timers.
    set_global_variables(args)

    # torch.distributed initialization
    def _finish_mpu_init():
        _initialize_distributed(args)

        # Random seeds for reproducibility.
        if args.rank == 0:
            print('> setting random seeds to {} ...'.format(args.seed))
        _set_random_seed(args.seed, args.data_parallel_random_init)

    # Megatron's MPU is the master. Complete initialization right away.
    _finish_mpu_init()
    _init_autoresume()
    # _compile_dependencies(args)

    # No continuation function
    return None

dumpmemory commented 11 months ago

Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer , an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.

optim_name = os.path.join(
            common_path + "_%03d" % mpu.get_data_parallel_rank(),
            "optim.pt")

The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:

def initialize_megatron(extra_args_provider=None,
                        args_defaults={}):
    """Set global variables, initialize distributed, and
    set autoresume and random seeds.
    `allow_no_cuda` should not be set unless using megatron for cpu only 
    data processing. In general this arg should not be set unless you know 
    what you are doing.
    """

    # Make sure cuda is available.
    assert torch.cuda.is_available(), 'Megatron requires CUDA.'

    # Parse arguments
    args = megatron.arguments.parse_args(extra_args_provider)

    if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False):
        assert args.load is not None, '--use-checkpoints-args requires --load argument'
        load_args_from_checkpoint(args)

    megatron.arguments.validate_args(args, args_defaults)

    # set global args, build tokenizer, and set adlr_autoresume,
    # tensorboard-writer, and timers.
    set_global_variables(args)

    # torch.distributed initialization
    def _finish_mpu_init():
        _initialize_distributed(args)

        # Random seeds for reproducibility.
        if args.rank == 0:
            print('> setting random seeds to {} ...'.format(args.seed))
        _set_random_seed(args.seed, args.data_parallel_random_init)

    # Megatron's MPU is the master. Complete initialization right away.
    _finish_mpu_init()
    _init_autoresume()
    # _compile_dependencies(args)

    # No continuation function
    return None

I will update the code. i have fixed this

dumpmemory commented 11 months ago

Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer , an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.

optim_name = os.path.join(
            common_path + "_%03d" % mpu.get_data_parallel_rank(),
            "optim.pt")

The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:

def initialize_megatron(extra_args_provider=None,
                        args_defaults={}):
    """Set global variables, initialize distributed, and
    set autoresume and random seeds.
    `allow_no_cuda` should not be set unless using megatron for cpu only 
    data processing. In general this arg should not be set unless you know 
    what you are doing.
    """

    # Make sure cuda is available.
    assert torch.cuda.is_available(), 'Megatron requires CUDA.'

    # Parse arguments
    args = megatron.arguments.parse_args(extra_args_provider)

    if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False):
        assert args.load is not None, '--use-checkpoints-args requires --load argument'
        load_args_from_checkpoint(args)

    megatron.arguments.validate_args(args, args_defaults)

    # set global args, build tokenizer, and set adlr_autoresume,
    # tensorboard-writer, and timers.
    set_global_variables(args)

    # torch.distributed initialization
    def _finish_mpu_init():
        _initialize_distributed(args)

        # Random seeds for reproducibility.
        if args.rank == 0:
            print('> setting random seeds to {} ...'.format(args.seed))
        _set_random_seed(args.seed, args.data_parallel_random_init)

    # Megatron's MPU is the master. Complete initialization right away.
    _finish_mpu_init()
    _init_autoresume()
    # _compile_dependencies(args)

    # No continuation function
    return None

pls , try the new one.

kylematoba commented 11 months ago

hello @dumpmemory we're working on clearing the open issues and will be getting to this one soon. Thank you for your patience.

kylematoba commented 11 months ago

Thank you for your contribution @dumpmemory. We'll not merge this to keep our own complexity down. Sorry if this wasn't clear, but this repo is meant more as replication code for an upcoming paper than a long-lived fork from NVIDIA's megatron and we are not so interested in allocating time to adding features that we're not using. I'll add a note to the docs saying this :).

mynewstart commented 11 months ago

@kylematoba So does it means the main branch don't support use_distributed_optimizer?

epfLLM / Megatron-LLM

add support to finetune with use_distributed_optimizer #68