bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.3k stars 211 forks source link

How to properly use Flops Profiler with pipelined parallelism? #385

Open flyingdown opened 1 year ago

flyingdown commented 1 year ago
  1. I train a gpt2 model with pipeline parallerlism, Flops Profiler in ds config is useless, it output nothing
  2. So add some code like this:
    prof = None
    if args.deepspeed:
    prof = FlopsProfiler(model[0])
    else:
    prof = FlopsProfiler(model)
    ...
    train_step(forward_step_func,
    train_data_iterator,
    model,
    optimizer,
    lr_scheduler)
    ....
    if iteration == profile_step and mpu.get_data_parallel_rank() == 0 and mpu.is_pipeline_last_stage() and mpu.get_tensor_model_parallel_rank() == 0:
    prof.print_model_profile(profile_step=profile_step)
    prof.end_profile()

    but output only has fwd info, like:

    
    -------------------------- DeepSpeed Flops Profiler --------------------------
    Profile Summary at step 8:
    Notations:
    data parallel size (dp_size), model parallel size(mp_size),
    number of parameters (params), number of multiply-accumulate operations(MACs),
    number of floating-point operations (flops), floating-point operations per second (FLOPS),
    fwd latency (forward propagation latency), bwd latency (backward propagation latency),
    step (weights update latency), iter latency (sum of fwd, bwd and step latency)

params per gpu: 463.21 M params of model = params per GPU mp_size: 463.21 M fwd MACs per GPU: 242304.87 GMACs fwd flops per GPU: 484678.47 G fwd flops of model = fwd flops per GPU mp_size: 484678.47 G fwd latency: 29.87 s fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 16.22 TFLOPS

3. How to interpret the following data,depth 0 and depth 1 spend less time than depth 2 ? Aren't they inclusive relationships?

----------------------------- Aggregated Profile per GPU ----------------------------- Top 1 modules in terms of params, MACs or fwd latency at different model depths: depth 0: params - {'PipelineEngine': '463.21 M'} MACs - {'PipelineEngine': '242304.87 GMACs'} fwd latency - {'PipelineEngine': '29.87 s'} depth 1: params - {'GPTModelPipe': '463.21 M'} MACs - {'GPTModelPipe': '242304.87 GMACs'} fwd latency - {'GPTModelPipe': '29.87 s'} depth 2: params - {'ParallelTransformerLayerPipe': '402.91 M'} MACs - {'ParallelTransformerLayerPipe': '228698.42 GMACs'} fwd latency - {'ParallelTransformerLayerPipe': '55.83 s'} depth 3: params - {'ParallelMLP': '268.5 M'} MACs - {'ParallelMLP': '140737.49 GMACs'} fwd latency - {'ParallelMLP': '27.79 s'}

4. The same question with following output.

PipelineEngine( 463.21 M, 100.00% Params, 233508.78 GMACs, 100.00% MACs, 31.21 s, 100.00% latency, 14.97 TFLOPS, (module): GPTModelPipe( 463.21 M, 100.00% Params, 233508.78 GMACs, 100.00% MACs, 31.21 s, 100.00% latency, 14.97 TFLOPS, (tied_modules): ModuleDict( 60.29 M, 13.02% Params, 0 MACs, 0.00% MACs, 0, 0.00% latency, 0.0 FLOPS, (embed): EmbeddingPipe( 60.29 M, 13.02% Params, 0 MACs, 0.00% MACs, 0, 0.00% latency, 0.0 FLOPS, (word_embeddings): VocabParallelEmbedding(51.9 M, 11.21% Params, 0 MACs, 0.00% MACs, 0, 0.00% latency, 0.0 FLOPS, ) (position_embeddings): Embedding(8.39 M, 1.81% Params, 0 MACs, 0.00% MACs, 0, 0.00% latency, 0.0 FLOPS, 2048, 4096) (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 0, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False) ) ) (27): ParallelTransformerLayerPipe( 50.36 M, 10.87% Params, 27487.79 GMACs, 11.77% MACs, 7.29 s, 23.37% latency, 7.54 TFLOPS, (input_layernorm): MixedFusedLayerNorm(8.19 k, 0.00% Params, 0 MACs, 0.00% MACs, 119.17 ms, 0.38% latency, 90.1 GFLOPS, ) (self_attention): ParallelAttention( 16.78 M, 3.62% Params, 9895.6 GMACs, 4.24% MACs, 3.0 s, 9.63% latency, 6.59 TFLOPS, (query_key_value): ColumnParallelLinear(12.59 M, 2.72% Params, 6597.07 GMACs, 2.83% MACs, 1.11 s, 3.57% latency, 11.86 TFLOPS, ) (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 286.5 ms, 0.92% latency, 0.0 FLOPS, ) (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 106.24 ms, 0.34% latency, 0.0 FLOPS, p=0.1, inplace=False) (dense): RowParallelLinear(4.2 M, 0.91% Params, 2199.02 GMACs, 0.94% MACs, 883.0 ms, 2.83% latency, 4.98 TFLOPS, ) ) (post_attention_layernorm): MixedFusedLayerNorm(8.19 k, 0.00% Params, 0 MACs, 0.00% MACs, 115.44 ms, 0.37% latency, 93.01 GFLOPS, ) (mlp): ParallelMLP( 33.56 M, 7.25% Params, 17592.19 GMACs, 7.53% MACs, 3.63 s, 11.63% latency, 9.69 TFLOPS, (dense_h_to_4h): ColumnParallelLinear(16.78 M, 3.62% Params, 8796.09 GMACs, 3.77% MACs, 1.49 s, 4.77% latency, 11.83 TFLOPS, ) (dense_4h_to_h): RowParallelLinear(16.78 M, 3.62% Params, 8796.09 GMACs, 3.77% MACs, 2.08 s, 6.65% latency, 8.47 TFLOPS, ) ) )