Expert parallel is a form of model parallelism that applies to mixture-of-experts models.
it shards the experts (typically onto different devices) where they will run in-parallel, improving throughput.
requires a method to shuffle routed tokens to different devices where the corresponding experts reside (called the all-to-all primitive).
achieves significant speedups when the experts are balanced.
Within-GPU-expert parallel: On A100: async mode is not available within a GPU, so to use expert parallel within a GPU requires a dropless implementation to avoid padding or dropping tokens when managing multiple experts on a single device. Note that on H100, this can be avoided by using a new kernel for GEMM.
Diagram of Data Parallel (e.g., FSDP) vs Expert Parallel
Performance
Benchmark Results
Full-Finetuning
A100_80gb with single node (e.g., 8 devices)
effective batch size of 128 (per device batch_size of 1, using grad accum of 16)
Alpaca instruction formatted dataset, no packing
torch.bfloat16 training, no mixed-precision
Model
Gpus
TYPE
mem_peak
mem_alloc
train_runtime
mem improvement
runtime improvement
Mixtral-8x7B-Instruct-v0.1
8
FSDP-only
54.8 G
44.0 G
4019 s
baseline
baseline
Mixtral-8x7B-Instruct-v0.1
8
our plugin
45.5 G
33.5G
996 s
33 %
4.00 x
NOTE: the train runtimes were collected with --skip_memory_metrics=True (huggingface default); setting this to False was only used to benchmark memory numbers, as is known to result in worser runtime measurements
NOTE: throughput numbers were 83 and 337 tokens per second, respectively.
Checkpoint Resumption
Checkpointing works as evidenced by correct training resumption behavior (see below):
DS-MoE does not support stage3 when also sharding MoE; this means that the non-MoE parameters cannot be parameter-sharded.
DS-MoE uses the similar all-to-all primitive to distributed tokens to experts.
DS-MoE requires custom parameter preperation by means of calling a DS function split_params_into_different_moe_groups_for_optimizer. This call is not integrated into accelerate's _prepare_deepspeed function.
def create_moe_param_groups(model):
from deepspeed.moe.utils import split_params_into_different_moe_groups_for_optimizer
parameters = {'params': [p for p in model.parameters()], 'name': 'parameters'}
return split_params_into_different_moe_groups_for_optimizer(parameters)
optimizer_grouped_parameters = create_moe_param_groups(opt_model)
Updates to benchmark.py
We now also
allow now to have an empty framework_config entry to allow the scenario to include the "no acceleration" case in the matrix:
added a slow tag, that if true, then we ignore the scenario in unfiltered runs.
added a accelerator-config.json to pass arguments to Accelerator, for example to set the sync_each_batch flag.
name: accelerated-moe-megablocks
framework_config:
- # without acceleration. <- NEW
- moe-megablocks
slow: True # <- NEW: will be ignored in unfiltered runs
arguments:
learning_rate: 5e-5
torch_dtype: bfloat16
accelerator_config: scripts/benchmarks/accelerator-config.json
gradient_accumulation_steps: 16
logging_steps: 1
packing: False
adam_epsilon: 1e-8
model_name_or_path:
- 'mistralai/Mixtral-8x7B-Instruct-v0.1'
Checklist of items covered
[x] refactored and made easy to incorporate other HF MoE models
[x] update the configurability to incorporate different expert parallel dimensions
[x] updated the scenarios bench
[x] generally works for single node, put in provisions for multi-node (but not tested)
[x] handling loss balancing correctly
[x] formatting and linting
[x] checkpointing, ensuring that torch.distributed.dcp is operating correctly,.
Known Issues
torch.concat operation is dominating the load_sharded_experts_onto_device function.
it is very inefficient that in all the devices, we concatall the expert weights and then pass to torch.distributed to shard it
this may not scale well for larger number of experts in the future.
This PR adds a plug in for mixture of experts training, combining FSDP with expert parallel where the latter is borrowed from databricks megablocks
This implements the FSDP1 version of expert parallel from https://github.com/foundation-model-stack/moe-distributed-training
What is Expert Parallel?
Expert parallel is a form of model parallelism that applies to mixture-of-experts models.
Diagram of Data Parallel (e.g., FSDP) vs Expert Parallel
Performance
Benchmark Results
Full-Finetuning
torch.bfloat16
training, no mixed-precisionmem_peak
mem_alloc
train_runtime
mem
improvementruntime
improvementNOTE: the train runtimes were collected with
--skip_memory_metrics=True
(huggingface default); setting this toFalse
was only used to benchmark memory numbers, as is known to result in worser runtime measurementsNOTE: throughput numbers were 83 and 337 tokens per second, respectively.
Checkpoint Resumption
Checkpointing works as evidenced by correct training resumption behavior (see below):
DTensor
sharding.Next steps
Implementation Details
Comparison with DeepSpeed MoE (DS-MoE)
Deepspeed also has support for mixture-of-expert sharding. Noting down some points here:
stage3
when also sharding MoE; this means that the non-MoE parameters cannot be parameter-sharded.split_params_into_different_moe_groups_for_optimizer
. This call is not integrated into accelerate's_prepare_deepspeed
function.Updates to
benchmark.py
We now also
framework_config
entry to allow the scenario to include the "no acceleration" case in the matrix:slow
tag, that if true, then we ignore the scenario in unfiltered runs.accelerator-config.json
to pass arguments toAccelerator
, for example to set thesync_each_batch
flag.Checklist of items covered
torch.distributed.dcp
is operating correctly,.Known Issues
torch.concat
operation is dominating theload_sharded_experts_onto_device
function.concat
all the expert weights and then pass totorch.distributed
to shard it