Metis is a system that automatically finds efficient parallelism plans for distributed deep learning training on heterogeneous GPUs. The auto-planner component of Metis is publicly available now. Please see the paper for further details (paper)
To run this project, you need to install the required packages. Follow the steps below to install the dependencies using the 'requirements.txt' file.
Clone the repository:
git clone https://github.com/SamsungLabs/Metis.git
Navigate to the project directory:
cd ~/Metis
Install dependencies using the requirements.txt file:
pip install -r requirements.txt
Once all dependencies are installed, you are ready to run the project.
The project relies on profile data to make informed decisions about distributed learning strategies. The profile data must be collected for different combinations of device types, tensor parallelism degrees, and batch sizes.
The profile data files must be named according to the following pattern:
DeviceType.{}_tp{}_bs{}.json
Where:
DeviceType.A100_tp1_bs1.json
DeviceType.A100_tp1_bs2.json
DeviceType.A100_tp1_bs4.json
...
DeviceType.H100_tp4_bs8.json
Each profile data file is a JSON file containing the following key sections:
Model Information:
Performance Data:
Overall Metrics:
{
"model": {
"model_name": "GPT3",
"num_layers": 10,
"parameters": {
"total_parameters_bytes": 601952256,
"parameters_per_layer_bytes": [98566144, 50601984, 50601984, 50601984, 50601984, 50601984, 50601984, 50601984, 50601984, 98570240],
"activation_parameters_bytes": [98566144, 50601984, 50601984, 50601984, 50601984, 50601984, 50601984, 50601984, 50601984, 98570240]
}
},
"execution_time": {
"total_time_ms": 1137.5594139099121,
"batch_generator_time_ms": 934.1955184936523,
"layernorm_grads_all_reduce_time_ms": 459.5518112182617,
"embedding_grads_all_reduce_time_ms": 37.360191345214844,
"optimizer_time_ms": 10814.285278320312,
"layer_compute_total_ms": [1.4263919830322266, 10.216951370239258, 10.216951370239258, 10.216951370239258, 10.216951370239258, 10.216951370239258, 10.216951370239258, 10.216951370239258, 10.216951370239258, 0.3376007080078125]
},
"execution_memory": {
"total_memory_mb": 15150.69,
"layer_memory_total_mb": [2366.8, 1195.9, 1195.9, 1195.9, 1195.9, 1195.9, 1195.9, 1195.9, 1195.9, 3216.7]
}
}
organize the profile data files in a structured directory for easy access:
/profile_data
├── DeviceType.A100_tp1_bs1.json
├── DeviceType.A100_tp1_bs2.json
├── DeviceType.A100_tp1_bs4.json
├── DeviceType.A100_tp1_bs8.json
├── DeviceType.A100_tp1_bs16.json
├── DeviceType.A100_tp2_bs1.json
├── DeviceType.A100_tp2_bs2.json
├── DeviceType.A100_tp2_bs4.json
...
├── DeviceType.H100_tp4_bs4.json
├── DeviceType.H100_tp4_bs8.json
└── DeviceType.H100_tp4_bs16.json
Once you have collected the necessary profile data, the optimizer will use these files to calculate the optimal distributed learning strategy for your model. Ensure that all relevant configurations are covered, as missing data may result in suboptimal strategy suggestions. By following this guide, you ensure the profile data is correctly formatted and useful for optimizing distributed learning strategies across different hardware and parallelism settings.
This section explains how to collect model profile data necessary for finding the optimal distributed training strategy. It provides methods for measuring the model's execution time and memory usage, which are crucial optimizing distributed training performance. This guide explains how to collect model profile data using PyTorch's Hook functions, memory measurement techniques and Megatron's Timer module. By collecting accurate execution time and memory usage data, you can find the optimal distributed training strategy. Note: For more details on PyTorch hooks and memory measurement, refer to the official PyTorch documentation
The essential profile data for optimizing distributed training are as follows:
To measure the execution time of each layer, we use PyTorch's Hook functions. PyTorch provides register_forward_pre_hook, register_forward_hook, register_backward_pre_hook and register_backward_hook to register custom actions at the start and end of the forward and backward passes of each layer.
Memory usage can be measured using torch.cuda.max_memory_reserved, which returns the maximum amount of memory allocated on the GPU. This value helps track the peak memory usage during the execution of each layer.
This project used Megatron's Timer module to collect key metrics. The Timer module precisely measures the time spent in different stages of the training process, especially during parameter updates.
Metis is a project that finds the optimal distributed training strategy based on the given cluster environment and profile dta. Users must first configure the resource and environment information for each node in the cluster, and then execute the script to optimize the training strategy. Below are the instructions for preparing the necessary data and running the script.
Before running Metis, you need to prepare the following data:
IP1 8
IP2 8
IP3 8
IP4 8
{
"IP1": {
"instance_type": "V100",
"inter_bandwidth": 312500000.0,
"intra_bandwidth": 5312500000.0,
"memory": 16
},
"IP2": {
"instance_type": "P100",
"inter_bandwidth": 312500000.0,
"intra_bandwidth": 5312500000.0,
"memory": 15
},
"IP3": {
"instance_type": "T4",
"inter_bandwidth": 312500000.0,
"intra_bandwidth": 5312500000.0,
"memory": 15
},
"IP4": {
"instance_type": "A100",
"inter_bandwidth": 312500000.0,
"intra_bandwidth": 5312500000.0,
"memory": 80
}
}
After preparing the necessary data, you can run Metis's main script to start distributed training.
Main Script: cost_het_cluster.sh
Below are the main parameters and their descriptions:
Once all the data is prepared, you can execute the script with the following command:
cd ~/Metis/scripts
source ./cost_het_cluster.sh MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128 HOME_DIR='/home/user' MAX_PROFILED_TP=4 MAX_PROFILED_BATCH_SIZE=4 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=4
This command will explore and execute the optimal distributed training strategy based on the pre-configured node and device information and profile data.
This work is licensed under a Creative Commons Attribution Non Commercial 4.0 International License.