[AutoTuner] Add first verison of autotuner

This PR adds autotuner module, which can be used with one click by setting action=auto_tune, just like: python run.py --config-path ./examples/aquila/conf --config-name config action=auto_tune. AutoTuner currently supports the search of all major parallel strategies, including:

data parallel
tensor parallel
pipeline parallel
context parallel
expert parallel
recompute
etc.

AutoTuner is user-friendly, users can add auto_tuner fields on the basis of training yaml to custom, such as follows:

auto_tuner:
  space:
    num_layers_per_virtual_pipeline_stage: [1]
    use_recompute: [false]
  control:
    max_time_per_task: 300
    train_iters: 5
    max_time: 600

Currently we implement a heuristic grid search algorithm with built-in efficient pruning strategies based on historical results, and more search algorithms will be added in the future, so users don't need to care about these parts at present.

Wherespaceis the search space, the user can customize the candidate value of each dimension, if not defined, there will be a default value by framework. We have the following search dimensions built in:

data_parallel_size
use_distributed_optimizer
tensor_model_parallel_size
sequence_parallel
pipeline_model_parallel_size
num_layers_per_virtual_pipeline_stage
use_recompute
recompute_method
recompute_granularity
recompute_num_layers
micro_batch_size
context_parallel_size
expert_model_parallel_size

control is used to control the search process, such as the maximum running time of each task, how many steps are run, the maximum running time of autotuner, etc

When the auto tuner running, each task has a corresponding log directory, and the results are summarized and sorted that users only need to look at the csv to know the detailed data for task.

FlagOpen / FlagScale

[AutoTuner] Add first verison of autotuner #124