flowersteam / lamorel

Lamorel is a Python library designed for RL practitioners eager to use Large Language Models (LLMs).
MIT License
176 stars 15 forks source link

why should we have two configs? #18

Closed HCHCXY closed 8 months ago

HCHCXY commented 9 months ago

I notice that in example "ppo finetuning", you config the whole project with file "local gpu config.yaml". however, you create another config file "default.yaml" for accelerate. I have following questions :

  1. most of the hyperparameters in two file are identical, why should we have these two files?
  2. i notice that in "local gpu config.yaml", you set num_machines 2, which contradicts to training on single machine?
  3. in "local gpu config.yaml", you set num_machines 2, while in "default.yaml", you set num_machines 1. which hyperparamter is truely used?
ClementRomac commented 9 months ago

Hi,

The "default.yaml" file is specific to Accelerate and defines the distributed setup's config (which can be overridden by passing arguments). While similar in some parts, the "local_gpu_config.yaml" is specific to Lamorel (i.e. most arguments are only understood by Lamorel). So the order is the following:

I left these two config files to avoid copying too many Accelerate's arguments to Lamorel's config files as most of these arguments aren't handled by Lamorel itself. But I agree that something more user-friendly may be possible.

Concerning the num_machines=2, this is an issue related to Accelerate. By default, Accelerate uses the local_rank to set the default GPU device. When saying num_machines=1 on a single node with only 1 GPU, Accelerate will look for a second GPU on the second process... One solution to avoid this is to "fool" Accelerate by making it think there's two machines and launching by hand the two processes instead of asking it to do it. In that case, each process will have a local_rank equal to 0 (and thus both will use the same GPU. I just opened a PR to avoid this.

Hope this answers your questions.

HCHCXY commented 9 months ago

That's a really clear answer,thanks a lot