Closed mayank31398 closed 2 months ago
This has never caused an issue on my side. Can you provide more explanation? When is this an issue?
@RaymondLi0 this doesn't cause an issue if you are using a job with just 1 GPU or something I guess. But in my case, I have a dedicated node with 8 GPUs. Which throws an error saying some global batch size should be a multiple of number of GPUs. world_size is set to 8 in this case and we want to emulate it to be 1 to unshard :)
I am surprised that this has gone unnoticed.
@RaymondLi0 any update on this?
This fixes the missing world_size in the checkpoint saver python file. Currently, it is being picked up from the checkpoint and we want it to be set to 1.