FlagOpen / FlagScale

FlagScale is a large model toolkit based on open-sourced projects.
Other
132 stars 40 forks source link

[AutoTuner] Add first verison of autotuner #124

Closed Caozhou1995 closed 3 months ago

Caozhou1995 commented 3 months ago

This PR adds autotuner module, which can be used with one click by setting action=auto_tune, just like: python run.py --config-path ./examples/aquila/conf --config-name config action=auto_tune. AutoTuner currently supports the search of all major parallel strategies, including:

AutoTuner is user-friendly, users can add auto_tuner fields on the basis of training yaml to custom, such as follows:

auto_tuner:
  space:
    num_layers_per_virtual_pipeline_stage: [1]
    use_recompute: [false]
  control:
    max_time_per_task: 300
    train_iters: 5
    max_time: 600

Currently we implement a heuristic grid search algorithm with built-in efficient pruning strategies based on historical results, and more search algorithms will be added in the future, so users don't need to care about these parts at present.

Wherespaceis the search space, the user can customize the candidate value of each dimension, if not defined, there will be a default value by framework. We have the following search dimensions built in:

control is used to control the search process, such as the maximum running time of each task, how many steps are run, the maximum running time of autotuner, etc

When the auto tuner running, each task has a corresponding log directory, and the results are summarized and sorted that users only need to look at the csv to know the detailed data for task.