bigcode-project / Megatron-LM

Ongoing research training transformer models at scale
Other
373 stars 48 forks source link

fix missing world_size in args_to_keep #66

Closed mayank31398 closed 2 months ago

mayank31398 commented 1 year ago

This fixes the missing world_size in the checkpoint saver python file. Currently, it is being picked up from the checkpoint and we want it to be set to 1.

RaymondLi0 commented 1 year ago

This has never caused an issue on my side. Can you provide more explanation? When is this an issue?

mayank31398 commented 1 year ago

@RaymondLi0 this doesn't cause an issue if you are using a job with just 1 GPU or something I guess. But in my case, I have a dedicated node with 8 GPUs. Which throws an error saying some global batch size should be a multiple of number of GPUs. world_size is set to 8 in this case and we want to emulate it to be 1 to unshard :)

I am surprised that this has gone unnoticed.

mayank31398 commented 1 year ago

@RaymondLi0 any update on this?