Closed HCHCXY closed 8 months ago
Hi,
The "default.yaml" file is specific to Accelerate and defines the distributed setup's config (which can be overridden by passing arguments). While similar in some parts, the "local_gpu_config.yaml" is specific to Lamorel (i.e. most arguments are only understood by Lamorel). So the order is the following:
local_gpu_config.yaml
in the example you mention)default.yaml
) and launch the processes with Accelerate. In this last step, Lamorel figures out on its own the total number of processes (world size).I left these two config files to avoid copying too many Accelerate's arguments to Lamorel's config files as most of these arguments aren't handled by Lamorel itself. But I agree that something more user-friendly may be possible.
Concerning the num_machines=2
, this is an issue related to Accelerate. By default, Accelerate uses the local_rank to set the default GPU device. When saying num_machines=1
on a single node with only 1 GPU, Accelerate will look for a second GPU on the second process... One solution to avoid this is to "fool" Accelerate by making it think there's two machines and launching by hand the two processes instead of asking it to do it. In that case, each process will have a local_rank equal to 0 (and thus both will use the same GPU. I just opened a PR to avoid this.
Hope this answers your questions.
That's a really clear answer,thanks a lot
I notice that in example "ppo finetuning", you config the whole project with file "local gpu config.yaml". however, you create another config file "default.yaml" for accelerate. I have following questions :