why should we have two configs?

HCHCXY commented 9 months ago

I notice that in example "ppo finetuning", you config the whole project with file "local gpu config.yaml". however, you create another config file "default.yaml" for accelerate. I have following questions :

most of the hyperparameters in two file are identical, why should we have these two files?
i notice that in "local gpu config.yaml", you set num_machines 2, which contradicts to training on single machine?
in "local gpu config.yaml", you set num_machines 2, while in "default.yaml", you set num_machines 1. which hyperparamter is truely used?

ClementRomac commented 9 months ago

Hi,

The "default.yaml" file is specific to Accelerate and defines the distributed setup's config (which can be overridden by passing arguments). While similar in some parts, the "local_gpu_config.yaml" is specific to Lamorel (i.e. most arguments are only understood by Lamorel). So the order is the following:

Lamorel is launched via the launcher and fetches its config file (local_gpu_config.yaml in the example you mention)
Then, it overrides some of the config's entries given the arguments provided to the launcher
Finally, it uses all these to override Accelerate's config (default.yaml) and launch the processes with Accelerate. In this last step, Lamorel figures out on its own the total number of processes (world size).

I left these two config files to avoid copying too many Accelerate's arguments to Lamorel's config files as most of these arguments aren't handled by Lamorel itself. But I agree that something more user-friendly may be possible.

Concerning the num_machines=2, this is an issue related to Accelerate. By default, Accelerate uses the local_rank to set the default GPU device. When saying num_machines=1 on a single node with only 1 GPU, Accelerate will look for a second GPU on the second process... One solution to avoid this is to "fool" Accelerate by making it think there's two machines and launching by hand the two processes instead of asking it to do it. In that case, each process will have a local_rank equal to 0 (and thus both will use the same GPU. I just opened a PR to avoid this.

Hope this answers your questions.

HCHCXY commented 9 months ago

That's a really clear answer，thanks a lot

flowersteam / lamorel

why should we have two configs? #18