[Profiler] Add group_info output - Githubissues

FlagOpen / FlagScale

FlagScale is a large model toolkit based on open-sourced projects.

Other

132 stars 40 forks source link

[Profiler] Add group_info output #206

Open phoenixdong opened 2 weeks ago

phoenixdong commented 2 weeks ago

Description

This PR adds functionality to output group information for large model execution, helping to track and manage task distribution during runtime.

New Functionality

parallelism_to_groups.json
Defines how tasks are grouped across various parallelism strategies (data, tensor, pipeline, etc.) for large model execution.
rank_to_parallelism_to_group_id.json
Maps device ranks to group IDs for different parallelism strategies.
rank_to_host_and_device.json
Provides mapping from device ranks to specific hardware (host IP, device ID, and GPU name).

Note

This PR enables the output of parallel group information for both decoder and encoder modes.

Usage Instructions

To enable the output of parallel group information during model training, add the following configuration to your training file:

system:
  ...
  analyze:
    analyze_save_dir: group_info_output_path

analyze_save_dir: Specifies the directory where the group information will be saved. Replace group_info_output_path with your desired output path for storing the parallelism group details.