Closed dumpmemory closed 11 months ago
could you comment what is the issue solved by this fix? (compared to the finetuning code and scripts we provide?)
could you comment what is the issue solved by this fix? (compared to the finetuning code and scripts we provide?)
Yes. fix the missing function when u add --use-distributed-optimizer args in fintuning scripts.
any update ?
hi, sorry no update: the whole team is working on a big run right now and obviously changing the function signature for checkpoint loading is not something we're keen to do right now. We should be done in about a month.
Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer \, an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.
optim_name = os.path.join(
common_path + "_%03d" % mpu.get_data_parallel_rank(),
"optim.pt")
The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:
def initialize_megatron(extra_args_provider=None,
args_defaults={}):
"""Set global variables, initialize distributed, and
set autoresume and random seeds.
`allow_no_cuda` should not be set unless using megatron for cpu only
data processing. In general this arg should not be set unless you know
what you are doing.
"""
# Make sure cuda is available.
assert torch.cuda.is_available(), 'Megatron requires CUDA.'
# Parse arguments
args = megatron.arguments.parse_args(extra_args_provider)
if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False):
assert args.load is not None, '--use-checkpoints-args requires --load argument'
load_args_from_checkpoint(args)
megatron.arguments.validate_args(args, args_defaults)
# set global args, build tokenizer, and set adlr_autoresume,
# tensorboard-writer, and timers.
set_global_variables(args)
# torch.distributed initialization
def _finish_mpu_init():
_initialize_distributed(args)
# Random seeds for reproducibility.
if args.rank == 0:
print('> setting random seeds to {} ...'.format(args.seed))
_set_random_seed(args.seed, args.data_parallel_random_init)
# Megatron's MPU is the master. Complete initialization right away.
_finish_mpu_init()
_init_autoresume()
# _compile_dependencies(args)
# No continuation function
return None
Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer , an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.
optim_name = os.path.join( common_path + "_%03d" % mpu.get_data_parallel_rank(), "optim.pt")
The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:
def initialize_megatron(extra_args_provider=None, args_defaults={}): """Set global variables, initialize distributed, and set autoresume and random seeds. `allow_no_cuda` should not be set unless using megatron for cpu only data processing. In general this arg should not be set unless you know what you are doing. """ # Make sure cuda is available. assert torch.cuda.is_available(), 'Megatron requires CUDA.' # Parse arguments args = megatron.arguments.parse_args(extra_args_provider) if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False): assert args.load is not None, '--use-checkpoints-args requires --load argument' load_args_from_checkpoint(args) megatron.arguments.validate_args(args, args_defaults) # set global args, build tokenizer, and set adlr_autoresume, # tensorboard-writer, and timers. set_global_variables(args) # torch.distributed initialization def _finish_mpu_init(): _initialize_distributed(args) # Random seeds for reproducibility. if args.rank == 0: print('> setting random seeds to {} ...'.format(args.seed)) _set_random_seed(args.seed, args.data_parallel_random_init) # Megatron's MPU is the master. Complete initialization right away. _finish_mpu_init() _init_autoresume() # _compile_dependencies(args) # No continuation function return None
I will update the code. i have fixed this
Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer , an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.
optim_name = os.path.join( common_path + "_%03d" % mpu.get_data_parallel_rank(), "optim.pt")
The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:
def initialize_megatron(extra_args_provider=None, args_defaults={}): """Set global variables, initialize distributed, and set autoresume and random seeds. `allow_no_cuda` should not be set unless using megatron for cpu only data processing. In general this arg should not be set unless you know what you are doing. """ # Make sure cuda is available. assert torch.cuda.is_available(), 'Megatron requires CUDA.' # Parse arguments args = megatron.arguments.parse_args(extra_args_provider) if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False): assert args.load is not None, '--use-checkpoints-args requires --load argument' load_args_from_checkpoint(args) megatron.arguments.validate_args(args, args_defaults) # set global args, build tokenizer, and set adlr_autoresume, # tensorboard-writer, and timers. set_global_variables(args) # torch.distributed initialization def _finish_mpu_init(): _initialize_distributed(args) # Random seeds for reproducibility. if args.rank == 0: print('> setting random seeds to {} ...'.format(args.seed)) _set_random_seed(args.seed, args.data_parallel_random_init) # Megatron's MPU is the master. Complete initialization right away. _finish_mpu_init() _init_autoresume() # _compile_dependencies(args) # No continuation function return None
pls , try the new one.
hello @dumpmemory we're working on clearing the open issues and will be getting to this one soon. Thank you for your patience.
Thank you for your contribution @dumpmemory. We'll not merge this to keep our own complexity down. Sorry if this wasn't clear, but this repo is meant more as replication code for an upcoming paper than a long-lived fork from NVIDIA's megatron and we are not so interested in allocating time to adding features that we're not using. I'll add a note to the docs saying this :).
@kylematoba So does it means the main branch don't support use_distributed_optimizer?
fix issues for finetune with use_distributed_optimizer option