[Bug][MT5] Throughput is unexpected

strint commented 1 year ago

32m[10/19 15:36:23 lb.utils.events] [0 m eta: 7986 days, 7:30:37 iteration: 99/621340880 consumed_samples: 200 total_loss: 9.545 time: 1.1185 s/iter data_time: 0.0151 s/iter total_throughput: 1.79 samples/s lr: 1.02e-08

xyn1201 commented 1 year ago

t5 单机4卡测试

机器：oneflow-25 单机4卡
oneflow master https://github.com/Oneflow-Inc/oneflow/commit/93d19f3be52632cccc875c8e46011eced14249a0
libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063
用例：t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb32_gb512_1n4g zero_stage=2
- libai：4089 MiB /85.83 samples/s 日志：oss://oneflow-test/libai_vs_megatron/1020_t5/25/libai_zero/
- Megatron-deepspeed：4725 MiB /82.7 samples/s

t5 2机4卡测试

机器：oneflow-25 oneflow-28 2机一共8卡
oneflow master https://github.com/Oneflow-Inc/oneflow/commit/93d19f3be52632cccc875c8e46011eced14249a0
libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063
用例：t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb16_gb512_2n4g zero_stage=2
- libai：4530MiB/50.92 samples/s 日志：oss://oneflow-test/libai_vs_megatron/1019_t5/libai_zero/
- Megatron-deepspeed：4783 MiB/164.3 samples/s

chengtbf commented 1 year ago

这里比较明显的问题是，我们 4 卡 2-D 并行是超过 Megatron 的，但是两机8卡的吞吐比单机四卡的还慢。而 Megatron 是一个线性的加速比。

xiezipeng-ML commented 1 year ago

这里有点问题，libai.models.T5Model是megatron的版本，IDEA需要的是huggingface版本的T5，也就是libai的projects下的T5（projects/T5是交付项目），这两个模型结构有区别，已经让yongning增加一份projects/T5的测试了，交付之前也是用projects/T5来和libai.model.T5Model来测的纯数据并行：here，两个模型不一样，感觉不能简单地去比较和megatron的性能，因为megatron实现的不是huggingface版本的T5

两个T5的区别总结：

layernorm对应的算子不同（mt5用c++拼接算子:RMSLayernorm）
decoder多一层embedding：https://github.com/Oneflow-Inc/libai/blob/b3c5ba2b90ae6debbebf8e9b96806327fb21c9c5/projects/T5/models/attention.py#L117-L120
dropout对应算子不同 (mt5使用的是：https://github.com/Oneflow-Inc/oneflow/pull/8693)
mt5（projects下的T5）的lm_head没有共享embedding的参数（https://github.com/Oneflow-Inc/libai/blob/9a4af263756ff6a1c8abe73e9a51a29f0d8c0533/projects/T5/models/t5_model.py#L129-L134 ）
mt5（projects下的T5）比t5（libai.models中的T5）少了position_embedding，但是mt5中的attention多出了position_bias的相关计算（https://github.com/Oneflow-Inc/libai/blob/e9ca4087cb35b3ad268534ee60456db689e36063/projects/T5/models/attention.py#L272 和 https://github.com/Oneflow-Inc/libai/blob/e9ca4087cb35b3ad268534ee60456db689e36063/projects/T5/models/attention.py#L320 ）
mt5（projects下的T5）不包含任何bias. (Linear 和 LayerNorm)
mt5（projects下的T5）因为要对齐huggingface的版本，没有用到t5（libai.models中的T5）当中的一些优化的地方，比如scale_mask_softmax_fusion，（mt5: https://github.com/Oneflow-Inc/libai/blob/9a4af263756ff6a1c8abe73e9a51a29f0d8c0533/projects/T5/models/attention.py#L232-L244 t5: https://github.com/Oneflow-Inc/libai/blob/9a4af263756ff6a1c8abe73e9a51a29f0d8c0533/libai/layers/attention.py#L214-L250 ）
mt5（projects下的T5）的MLP层比t5是多出一层Linear的（https://github.com/Oneflow-Inc/libai/blob/main/projects/T5/models/mlp.py ）
[form chengcheng] Attention 里的 FuseMultiHeadAttention 这个优化， megatron 将原本正常语义下的 batch size 转置到了第一维，这个从语义上是难以理解的，但是从性能上，可以只在 Transformer layer 之前做一次 transpose，内部的 matmul 可以使用 batch gemm 执行。如果不这么做的话，需要在每个 layer 内部都做 transpose ，单单这个优化就有 10% 的性能差距。 https://github.com/NVIDIA/Megatron-LM/blob/0bb597b42c53355a567aba2a1357cc34b9d99ddd/megatron/model/transformer.py#L395

mt5里没用到scale_mask_softmax_fusion，所以是走了t5的att中的else分支

def att_mt5(attention_scores, attention_mask):
    dropout = nn.Dropout(0)
    attention_scores = flow.mul(attention_scores, attention_mask)
    attention_scores = attention_scores - 10000.0 * (1 - attention_mask)
    attention_weights = flow.softmax(attention_scores, dim=-1)
    attention_weights = dropout(attention_weights)
    return attention_weights
def att_t5(attention_scores, attention_mask, scale_mask_softmax_fusion=True, coeff=None, attention_dropout_prob=0, use_cache=False):
    dropout = nn.Dropout(0)
    if scale_mask_softmax_fusion:
        if attn_mask_type == AttnMaskType.padding:
            attention_mask = (
                attention_mask.expand_as(attention_scores) if use_cache else attention_mask
            )
            attention_weights = flow._C.fused_scale_mask_softmax_dropout(
                attention_scores,
                attention_mask,
                fill_value=-10000.0,
                scale=coeff,
                p=attention_dropout_prob,
            )[0]
    else:
        if coeff is not None:
            attention_scores *= coeff
        attention_scores = flow.mul(attention_scores, attention_mask)
        attention_scores = attention_scores - 10000.0 * (1 - attention_mask)
        attention_weights = flow.softmax(attention_scores, dim=-1)
        attention_weights = dropout(attention_weights)
    return attention_weights

@chengtbf @strint @xyn1201

jackalcooper commented 1 year ago

@xiezipeng-ML 这里说的hugging face版本的T5指的是 transformers 库的吗？如果是的话，直接支持transformers里面T5的oneflow后端之后，你觉得可以直接跑分布式训练吗？我上周移植了transformers的CLIP的infer，不知道训练会多多少东西。transformers的CLIP和t5应该会共用一些基础的模块吧。

xiezipeng-ML commented 1 year ago

@xiezipeng-ML 这里说的hugging face版本的T5指的是 transformers 库的吗？如果是的话，直接支持transformers里面T5的oneflow后端之后，你觉得可以直接跑分布式训练吗？我上周移植了transformers的CLIP的infer，不知道训练会多多少东西。transformers的CLIP和t5应该会共用一些基础的模块吧。

是的 transformers仓库，我slack请教你

chengtbf commented 1 year ago

@xyn1201 这个的 nsys 结果是不是还没有

xyn1201 commented 1 year ago

刚刚分别跑了dp4_mp2_pp1和dp2_mp4_pp1的2机4卡测试
- dp4_mp2_pp1：吞吐是比较正常的
- dp2_mp4_pp1：这个是IDEA给的配置，跑的很慢，15分钟第一个iter都没有跑完，后面就没再等了。

然后列一下dp4_mp2_pp1这组配置的对比结果，libai的是今天新跑的，megatron用的前面comment里的数据，两个模型的参数对齐了，但是数据集用的不一样，这个麻烦 @xiezipeng-ML 给说明一下

projects/T5 单机4卡测试

机器：oneflow-28 单机4卡
oneflow master https://github.com/Oneflow-Inc/oneflow/commit/f97f09f1d9a8668c972a12f66d77aaa19b164635
libai test_t5_time https://github.com/Oneflow-Inc/libai/commit/0002b6637c92e19728cd26830494fa33ab68efc1
对比：
- libai：mt5_pretrain.py mb16_gb256 dp2_mp2_pp1 zero_stage=2 如果跑mb32_gb512会OOM
8913 MiB /68.00 samples/s nsys 日志：oss://oneflow-test/mt5_test/1021/1n4g_28/libai/
- Megatron-deepspeed：mb32_gb512 dp2_mp2_pp1 zero_stage=2
4725 MiB /82.7 samples/s nsys

projects/T5 2机4卡测试

机器：oneflow-25 oneflow-28 2机一共8卡
oneflow master https://github.com/Oneflow-Inc/oneflow/commit/f97f09f1d9a8668c972a12f66d77aaa19b164635
libai test_t5_time https://github.com/Oneflow-Inc/libai/commit/0002b6637c92e19728cd26830494fa33ab68efc1
用例：
- libai：mt5_pretrain.py mb16_gb512 dp4_mp2_pp1 zero_stage=2 8613 MiB/63.75 samples/s 25_nsys 28_nsys 日志：oss://oneflow-test/mt5_test/1021/2n4g/libai/
- Megatron-deepspeed：mb16_gb512 dp4_mp2_pp1 zero_stage=2 4783 MiB/164.3 samples/s 25_nsys 28_nsys

chengtbf commented 1 year ago

缺少了 Megatron 1n4d 2n4d 的 nsys，oneflow 1n4d nsys

xiezipeng-ML commented 1 year ago

然后列一下dp4_mp2_pp1这组配置的对比结果，libai的是今天新跑的，megatron用的前面comment里的数据，两个模型的参数对齐了，但是数据集用的不一样，这个麻烦 @xiezipeng-ML 给说明一下

昨晚在libai的main分支里把IDEA的dataset换成了megatron的dataset测了下，两个datasets吞吐是一样的

xyn1201 commented 1 year ago

单卡 mb4_gb32 libai_nsys megatron_nsys @chengtbf

chengtbf commented 1 year ago

初步分析结论

之前两天的测试和本地测试受到 T5 （Megatron）和 MT5 （huggingface）的区别，以及本地历史 LiBai 版本的影响，拖延了问题分析的进度。

目前的初步结论是 2-D SBP 下， OneFlow T5 的 sbp infer 结果是不高效的，比 Megatron 多了几倍的通信开销，导致整体的吞吐慢了三倍。这个现象是随着 batch size 的增大而变得更差

两机分析

Megatron 2机 nsys 结果：

主要看两个指标，单个 iter 的前后向总耗时，以及 kernel 占比：

单个 iter 的耗时是 357ms，分为 fw encoder （78ms）+ fw decoder （24ms） + loss （9ms） + bw decoder （88ms） + bw encoder （150ms）
其中，nccl 通信基本上都是 allreduce 通信，占总的计算时长比为： 66% （49.8% + 12.8% + 3.4%）

Megatron 2n4d overview

LiBai MT5 2机 nsys 结果：

总耗时 1000ms，是 Megatron 的3倍，其中： fw encoder （360ms）+ fw decoder （94ms） + bw decoder （175ms）+ bw encoder （339ms）
nccl 占比：占大头的不是 allreduce，而是 send recv（应该是 sbp 推导到了 bad case，send recv 很不高效） send recv 46.4% （应该全部都是多于的） + allreduce 17.9 % + allgather 12.9 %

OneFlow 2n4d overview

单机4卡分析

单机四卡的性能结果， oneflow 比 Megatron 快一些： oneflow 518ms vs Megatron 716ms

同时 Megatron 的 iter 间调度的间隔很大。还有不少的优化空间。 oneflow 的调度是比较完美的。

Megatron 1n4d

Megatron 1n4d overview

OneFlow 1n4d

OneFlow 的时间虽然比 Megatron 快，但是并不是最优的，仍有至少 12% 的冗余 send recv 通信，主要是在前向 fw encoder 部分包含大量的 send recv 通信。

OneFlow 1n4d overview

单机单卡比较

OneFlow 比 Megatron 优势非常明显： OneFlow 88ms vs Megatron 127ms

OneFlow 1n1d overview

Megatron 1n1d overview

结论

OneFlow 单卡速度领先 Megatron
在 4 卡的 shape 下，2d sbp 的推导结果不是那么差，速度领先 Megatron
在 8 卡的 shape 下，2d sbp 的推导结果非常差，多了几倍的通信开销，速度比 Megatron 慢三倍

xyn1201 commented 1 year ago

SBP_INFER_RULE_TAG=2 和自动并行测试吞吐

机器：oneflow-25 oneflow-28 2机一共8卡
oneflow master https://github.com/Oneflow-Inc/oneflow/commit/f97f09f1d9a8668c972a12f66d77aaa19b164635
libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063
libai吞吐数据 mt5_pretrain.py mb16_gb512 dp4_mp2_pp1 zero_stage=2
- export SBP_INFER_RULE_TAG=2：11663 MiB/61.44 samples/s 吞吐和之前的数据持平
- 自动并行：在跑，还没有输出，一会儿和一鹏看一下这个问题

Yipeng1994 commented 1 year ago

https://github.com/Oneflow-Inc/oneflow/pull/9288 在第一档允许了自动并行与ZeRO共存，但是实际效果没有测试过。我在16上跑，OOM了，毕竟自动并行还没有考虑内存，有些慌。不过我看了一下，大的weight的sbp基本都是 (S0, S1)，并没有给出（B, B），所以功能上是符合预期了，就是有ZeRO，然后也有AutoParallel。

另外它的op很多，不算variable快5000个了，所以初始化cost的时候很慢，需要20分钟，我优化了一下，估计能压缩一半。至于搜索算法就很快，半分钟就能出结果。

在测试自动并行的同时，建议先看下半自动推导下mt5的boxing里面，哪里的sbp不符合预期，然后加上一些to_global来控制一下。

strint commented 1 year ago

带 nccl logical op 和 sbp 的 op graph log：https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/2n4g_log/output.log

搜索下Operator 可以找到 op graph 的起点。

xyn1201 commented 1 year ago

自动并行 2n4g 测试

机器：oneflow-25 oneflow-28 2机一共8卡
oneflow feat-auto_parallel-ZeRO分支 https://github.com/Oneflow-Inc/oneflow/pull/9288/commits/54771bc917aa1b7509e758b7d5c1344ce00e7246 用这个分支编译+自动并行的时间是半小时，确实加快了
libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063
为了不OOM，调小了batch_size，做了一组对比 mb4_gb128 dp4_mp2_pp1 zero_stage=2
- libai: 9915 MiB/60.16 samples/s
- megatron: 4285 MiB/103.9 samples/s

@Yipeng1994

leaves-zwx commented 1 year ago

看了一下 job/plan 发现了3个问题：

非预期的 SBP 变化

会导致后续一系列 SBP 都乱掉，从而导致有多余的 nccl logical boxing (是不是还有其他影响有多余的 nccl logical boxing 还要测试)

原因是 model.t5_model.encoder.layers.0.self_attention-reshape-29 这个 op 的位置，代码位置在这里，它消费了 query_key_value (broadcast_matmul) 的输出，该 broadcast_matmul 输出的 shape=(N,S,H), sbp=(S(0), S(2))，reshape 将其 reshape 成 (N,S,n,h)，预期 sbp 不变仍然是 (S(0), S(2))。

但在2n4d情况下，该 reshape 前面被插入1个 System-NCCL-Logical-(*S)2(*S)-1867，为 (S(0), S(2)) -> (S(0), S(1)) 的转换，导致后面一系列的 op 都不按照预期的 sbp 来推导，最终导致冗余 nccl logical boxing。而 1n4d 情况该处则正常。

1n4d job 片段

``` op { name: "model.t5_model.encoder.layers.0.self_attention-reshape-29" device_tag: "cuda" ctrl_in_op_name: "model.t5_model.decoder.layers.0.self_attention-where-922" scope_symbol_id: 728 stream_name_hint: "NCCL_COMPUTE_0" loc: "Python Stack[-2]: \'forward\' at \'/home/xuyongning/zero_test/t5_test/libai/projects/T5/models/transformer_layer.py\': line 177; Python Stack[-1]: \'forward\' at \'/home/xuyongning/zero_test/t5_test/libai/projects/T5/models/attention.py\': line 194; ... 9 more" user_conf { op_type_name: "reshape" input { key: "in" value { s: "model.t5_model.encoder.layers.0.self_attention.query_key_value-broadcast_matmul-28/out_0" } } output { key: "out" value { s: "model.t5_model.encoder.layers.0.self_attention-reshape-29/out_0" } } attr { key: "shape" value { at_shape { dim: 32 dim: 512 dim: 12 dim: 192 } } } input_order: "in" output_order: "out" } } op_name2nd_sbp_signature_conf { key: "model.t5_model.encoder.layers.0.self_attention-reshape-29" value { bn_in_op2nd_sbp { key: "in_0" value { sbp_parallel { split_parallel { axis: 0 } } sbp_parallel { split_parallel { axis: 2 } } } } bn_in_op2nd_sbp { key: "out_0" value { sbp_parallel { split_parallel { axis: 0 } } sbp_parallel { split_parallel { axis: 2 } } } } } } ```

2n4d job 片段

``` op { name: "model.t5_model.encoder.layers.0.self_attention-reshape-29" device_tag: "cuda" ctrl_in_op_name: "model.t5_model.decoder.layers.0.self_attention-reshape-903" scope_symbol_id: 728 stream_name_hint: "NCCL_COMPUTE_0" loc: "Python Stack[-2]: \'forward\' at \'/home/xuyongning/zero_test/t5_test/libai/projects/T5/models/transformer_layer.py\': line 177; Python Stack[-1]: \'forward\' at \'/home/xuyongning/zero_test/t5_test/libai/projects/T5/models/attention.py\': line 194; ... 9 more" user_conf { op_type_name: "reshape" input { key: "in" value { s: "System-NCCL-Logical-(*S)2(*S)-1867/out_0" } } output { key: "out" value { s: "model.t5_model.encoder.layers.0.self_attention-reshape-29/out_0" } } attr { key: "shape" value { at_shape { dim: 64 dim: 512 dim: 12 dim: 192 } } } input_order: "in" output_order: "out" } } op { name: "System-NCCL-Logical-(*S)2(*S)-1867" ctrl_in_op_name: "System-NCCL-Logical-(*S)2(*S)-1866" ctrl_in_op_name: "model.t5_model.decoder.layers.0.self_attention-reshape-903" scope_symbol_id: 17628 stream_name_hint: "NCCL_COMPUTE_0" user_conf { op_type_name: "_nccl_logical_2D_same_dim0_all2all" input { key: "in" value { s: "model.t5_model.encoder.layers.0.self_attention.query_key_value-broadcast_matmul-28/out_0" } } output { key: "out" value { s: "System-NCCL-Logical-(*S)2(*S)-1867/out_0" } } attr { key: "dst_reduced_nd_sbp" value { at_list_string { val: "S(0)" val: "S(1)" } } } attr { key: "src_reduced_nd_sbp" value { at_list_string { val: "S(0)" val: "S(2)" } } } input_order: "in" output_order: "out" } } op_name2nd_sbp_signature_conf { key: "model.t5_model.encoder.layers.0.self_attention-reshape-29" value { bn_in_op2nd_sbp { key: "in_0" value { sbp_parallel { split_parallel { axis: 0 } } sbp_parallel { split_parallel { axis: 1 } } } } bn_in_op2nd_sbp { key: "out_0" value { sbp_parallel { split_parallel { axis: 0 } } sbp_parallel { split_parallel { axis: 1 } } } } } } ```

1n4d 和 2n4d 他们两个的区别就是 batch size 变化了 (global)，猜测原因和 https://github.com/Oneflow-Inc/OneTeam/issues/1721 里面类似。但为什么 SBP_INFER_RULE_TAG=2 设置了后仍然不能阻止该非期望的 sbp 转换？需要再调试一下。

低效的 amp 转换

代码这里的 broadcast_add 的左边 position_bias 由 compute_bias 计算而来 dtype=float16，右边 attention_mask 由外面传入，原本 dtype=bool，因需经过若干 scalar 计算转为 int64，该处 float16 + int64，然后两者都转换成 float32 计算，后面继续进行 matmul 时又转回 float16。

plan 片段

``` order : 1667 , actor id : 2199025352758 name : model.t5_model.encoder.layers.0.self_attention-broadcast_add-70 thrd : 1048577 device_type : kCUDA stream_index : 1 { consume : in_ctrl : <- [ System-ZeRO-ParallelCast-model.t5_model.encoder.layers.2.mlp.wo.weight-repeat-248-242/out_ctrl_640 ] ( actor_id: 2199025356993, regst: regst_num: 1, cuda , ctrl ) consume : in : <- [ model.t5_model.encoder.layers.0.self_attention-cast-69/__out_0 ] ( actor_id: 2199025352757, regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (16,1,1,512) , dtype: kFloat ) consume : in : <- [ model.t5_model.encoder.layers.0.self_attention-expand_dims-64-out_0-cast_h2f/__out_0 ] ( actor_id: 2199025356805, regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (1,12,512,512) , dtype: kFloat ) produce : __z_0 regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (16,12,512,512) , dtype: kFloat { -> [ model.t5_model.encoder.layers.0.self_attention-broadcast_add-70-z_0-cast_f2h ] ( actor_id: 2199025356729 ) } produce : out_ctrl_4504 regst: regst_num: 1, cuda , ctrl { -> [ model.t5_model.decoder.layers.0.self_attention-transpose-938 ] ( actor_id: 2199025353448 ) } } order : 1463 , actor id : 2199025352757 name : model.t5_model.encoder.layers.0.self_attention-cast-69 thrd : 1048577 device_type : kCUDA stream_index : 1 { consume : in_ctrl : <- [ model.t5_model.encoder.layers.3.self_attention-scalar_mul-279/out_ctrl_632 ] ( actor_id: 2199025352924, regst: regst_num: 1, cuda , ctrl ) consume : in : <- [ model.t5_model.encoder.layers.0.self_attention-scalar_mul-68/__out_0 ] ( actor_id: 2199025352756, regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (16,1,1,512) , dtype: kInt64 ) produce : __out_0 regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (16,1,1,512) , dtype: kFloat { -> [ model.t5_model.encoder.layers.0.self_attention-broadcast_add-70 ] ( actor_id: 2199025352758 ) } produce : out_ctrl_3032 regst: regst_num: 1, cuda , ctrl { -> [ model.t5_model.encoder.layers.7.self_attention-scalar_mul-547 ] ( actor_id: 2199025353136 ) } } order : 1656 , actor id : 2199025356805 name : model.t5_model.encoder.layers.0.self_attention-expand_dims-64-out_0-cast_h2f thrd : 1048577 device_type : kCUDA stream_index : 1 { consume : in_ctrl : <- [ System-ZeRO-ParallelCast-model.t5_model.encoder.layers.2.mlp.wi_0.weight-repeat-233-241/out_ctrl_20337 ] ( actor_id: 2199025356992, regst: regst_num: 1, cuda , ctrl ) consume : in : <- [ model.t5_model.encoder.layers.0.self_attention-expand_dims-64/__out_0 ] ( actor_id: 2199025352752, regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (1,12,512,512) , dtype: kFloat16 ) produce : __out_0 regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (1,12,512,512) , dtype: kFloat { -> [ model.t5_model.encoder.layers.0.self_attention-broadcast_add-70 ] ( actor_id: 2199025352758 ) } produce : out_ctrl_4496 regst: regst_num: 1, cuda , ctrl { -> [ model.t5_model.decoder.layers.0.self_attention.relative_attention_bias-gather-937 ] ( actor_id: 2199025353447 ) } } ```

低效冗余的 cast

上述的 attention_mask 原本是 dtype=bool 的 mask 张量，需要传入到每一层 transformer layer 进行计算，计算在这里：(1 - attention_mask) * -1000

要进行这些计算，系统选择先把 bool cast to int64，该 cast 在每一层 transformer layer 都重复进行。

plan 片段

``` order : 1247 , actor id : 2199025352711 name : model.t5_model.encoder.layers.0-identity-17 thrd : 1048577 device_type : kCUDA stream_index : 1 { consume : in_ctrl : <- [ model.t5_model.decoder.layers.0-identity-888/out_ctrl_216 ] ( actor_id: 2199025353405, regst: regst_num: 1, cuda , ctrl ) consume : in : <- [ model.t5_model.extended_attn_mask-expand_dims-9/__out_0 ] ( actor_id: 2199025352706, regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (16,1,1,512) , dtype: kBool ) produce : __out_0 regst: regst_num: 1, cuda , time_shape: (1,1,8), shape: (16,1,1,512) , dtype: kBool { -> [ model.t5_model.encoder.layers.11.self_attention-cast-809 ] ( actor_id: 2199025353342 ) -> [ model.t5_model.encoder.layers.4.self_attention-cast-342 ] ( actor_id: 2199025352973 ) -> [ model.t5_model.encoder.layers.3.self_attention-cast-273 ] ( actor_id: 2199025352918 ) -> [ model.t5_model.encoder.layers.3.self_attention-cast-275 ] ( actor_id: 2199025352920 ) -> [ model.t5_model.encoder.layers.11.self_attention-cast-811 ] ( actor_id: 2199025353344 ) -> [ model.t5_model.encoder.layers.2.self_attention-cast-208 ] ( actor_id: 2199025352867 ) -> [ model.t5_model.encoder.layers.2.self_attention-cast-206 ] ( actor_id: 2199025352865 ) -> [ model.t5_model.encoder.layers.0.self_attention-cast-65 ] ( actor_id: 2199025352753 ) -> [ model.t5_model.encoder.layers.6.self_attention-cast-474 ] ( actor_id: 2199025353077 ) -> [ model.t5_model.encoder.layers.6.self_attention-cast-476 ] ( actor_id: 2199025353079 ) -> [ model.t5_model.encoder.layers.9.self_attention-cast-675 ] ( actor_id: 2199025353236 ) -> [ model.t5_model.encoder.layers.0.self_attention-cast-74 ] ( actor_id: 2199025352761 ) -> [ model.t5_model.encoder.layers.8.self_attention-cast-608 ] ( actor_id: 2199025353183 ) -> [ model.t5_model.encoder.layers.4.self_attention-cast-340 ] ( actor_id: 2199025352971 ) -> [ model.t5_model.encoder.layers.1.self_attention-cast-141 ] ( actor_id: 2199025352814 ) -> [ model.t5_model.encoder.layers.5.self_attention-cast-407 ] ( actor_id: 2199025353024 ) -> [ model.t5_model.encoder.layers.1.self_attention-cast-139 ] ( actor_id: 2199025352812 ) -> [ model.t5_model.encoder.layers.5.self_attention-cast-409 ] ( actor_id: 2199025353026 ) -> [ model.t5_model.encoder.layers.0.self_attention-cast-72 ] ( actor_id: 2199025352759 ) -> [ model.t5_model.encoder.layers.7.self_attention-cast-541 ] ( actor_id: 2199025353130 ) -> [ model.t5_model.encoder.layers.7.self_attention-cast-543 ] ( actor_id: 2199025353132 ) -> [ model.t5_model.encoder.layers.8.self_attention-cast-610 ] ( actor_id: 2199025353185 ) -> [ model.t5_model.encoder.layers.9.self_attention-cast-677 ] ( actor_id: 2199025353238 ) -> [ model.t5_model.encoder.layers.10.self_attention-cast-742 ] ( actor_id: 2199025353289 ) -> [ model.t5_model.encoder.layers.10.self_attention-cast-744 ] ( actor_id: 2199025353291 ) } produce : out_ctrl_208 regst: regst_num: 1, cuda , ctrl { -> [ model.t5_model.encoder.layers.0-identity-16 ] ( actor_id: 2199025352710 ) } } ```

同时 (1 - attention_mask) * -1000 的计算也在每一层重复，应该也是不必要的。比较高级的做法是通过编译技术消除重复计算，但目前应该没这种 pass，如果需要 benchmark 好看，可以修改一下写法，将 attention_mask 的转换和计算都写在外面去，比如手动 cast 成 float16，在进行上述计算。不好的地方在于代码可能与 pytorch 无法对齐（不过我看现在里面已经插入不少人工干预 to_global，所以应该本来就没那么对齐）。

Yipeng1994 commented 1 year ago

但在2n4d情况下，该 reshape 前面被插入1个 System-NCCL-Logical-(S)2(S)-1867，为 (S(0), S(2)) -> (S(0), S(1)) 的转换，导致后面一系列的 op 都不按照预期的 sbp 来推导，最终导致冗余 nccl logical boxing。而 1n4d 情况该处则正常。

reshape只有一个输入的话，哪个sbp规则下都是match的，不可能发生改变吧？上游是否发生了强制的转换？或者说reshape是否仍然不是源头？

leaves-zwx commented 1 year ago

model.t5_model.encoder.layers.0.self_attention-reshape-29 的上一个 op 是 model.t5_model.encoder.layers.0.self_attention.query_key_value-broadcast_matmul-28，在 1n4d 和 2n4d 下的 sbp signature 是一致的。

Yipeng1994 commented 1 year ago

emmm，等下我具体看看为什么推导出了不同的sbp

leaves-zwx commented 1 year ago

不是 GreedilyFindMinCopyCostNdSbp 这个函数的问题，而是 GetValidNdSbpSignatureList 的问题。

Yipeng1994 commented 1 year ago

不是 GreedilyFindMinCopyCostNdSbp 这个函数的问题，而是 GetValidNdSbpSignatureList 的问题。

哎，你也在看，这个今天修好了，在让 @xyn1201 测在底下这个commit，稍后等结果出来会一起解释 https://github.com/Oneflow-Inc/oneflow/pull/9288/commits/caf344f94d3fcf66bbe914f2ef93f4c03b0086b2

leaves-zwx commented 1 year ago

我调试看起来不像是这个原因，而是 reshape 的 GetSbp 函数本身有问题。我再 debug 看看具体是什么。

Yipeng1994 commented 1 year ago

我调试看起来不像是这个原因，而是 reshape 的 GetSbp 函数本身有问题。我再 debug 看看具体是什么。

哎，刚刚测试出修复失败了，我也再debug康康

leaves-zwx commented 1 year ago

reshape 的 sbp siganture list 在 1n4d 下正常，而 2n4d 下不正常的原因找到了：

debug log 片段

``` E20221023 15:37:22.099296 665754 reshape_user_op_util.cpp:179] [GetReshapeUserOpSbpSignatures] model.t5_model.encoder.layers.0.self_attention-reshape-29: (32,512,2304) -> (32,512,12,192), parallel_num=4 0 (origin=0) -> 0 (origin=0) 1 (origin=1) -> 1 (origin=1) 2 (origin=2) -> 2 (origin=2) E20221023 15:37:22.099376 665754 operator.cpp:519] [GetNdSbpSignatureList] model.t5_model.encoder.layers.0.self_attention-reshape-29, sbp_sig size=5, sbp_sig_list= (in_0) -> (out_0): [ (S(0)) -> (S(0)), (S(1)) -> (S(1)), (S(2)) -> (S(2)), (P) -> (P), (B) -> (B), ] E20221023 15:45:48.437899 680749 reshape_user_op_util.cpp:179] [GetReshapeUserOpSbpSignatures] model.t5_model.encoder.layers.0.self_attention-reshape-29: (64,512,2304) -> (64,512,12,192), parallel_num=8 0 (origin=0) -> 0 (origin=0) 1 (origin=1) -> 1 (origin=1) E20221023 15:45:48.437943 680749 operator.cpp:519] [GetNdSbpSignatureList] model.t5_model.encoder.layers.0.self_attention-reshape-29, sbp_sig size=4, sbp_sig_list= (in_0) -> (out_0): [ (S(0)) -> (S(0)), (S(1)) -> (S(1)), (P) -> (P), (B) -> (B), ] ```

代码在: https://github.com/Oneflow-Inc/oneflow/blob/22eabed6a2432085cd4aa7bf7bf98464d30e9cba/oneflow/user/ops/reshape_user_op_util.cpp#L131-L132) 处判断当前 dimension 是否可以被 split 的时候是用 % parallel_num 来判断的。

1n4d 下 reshape (32,512,2304) to (32,512,12,192), parallel_num=4, dim(2) == 12 被认为是可以 split 的
2n4d 下 reshape (64,512,2304) to (64,512,12,192), parallel_num=8, dim(2) == 12 认为是不可以 split 的

所以在 2n4d 下我们根据调试的信息可以看到 reshape 的 sbp signature list 里面没有 S(2) -> S(2) 这一项，但其实是可以 split 的，因为 4dp + 2mp，S(2) 要不切4份，要不切2份（取决于 S(2) 是 nd_sbp 的第1维还是第2维），12 % 4 == 0 和 12 % 2 == 0 都成立。

所以出现了 https://github.com/Oneflow-Inc/libai/issues/406#issuecomment-1287831082 里面所说的情况。这里的正确做法，应该根据 device mesh 的某一个维来判断是否能被 split，而不能只看 parallel_num。

但目前有一些困难，因为在 GetSbp 的时候，并不知晓推导的 sbp signature 将会被应用于 device mesh 的哪一维。这里只能添加上全部的 split(num_axes)，然后再到后面的 FilterNdSbpSignatureListByLogicalShape 或其他什么地方去 filter。

Yipeng1994 commented 1 year ago

reshape 的 sbp siganture list 在 1n4d 下正常，而 2n4d 下不正常的原因找到了：

debug log 片段代码在: https://github.com/Oneflow-Inc/oneflow/blob/22eabed6a2432085cd4aa7bf7bf98464d30e9cba/oneflow/user/ops/reshape_user_op_util.cpp#L131-L132) 处判断当前 dimension 是否可以被 split 的时候是用 % parallel_num 来判断的。

1n4d 下 reshape (32,512,2304) to (32,512,12,192), parallel_num=4, dim(2) == 12 被认为是可以 split 的 2n4d 下 reshape (64,512,2304) to (64,512,12,192), parallel_num=8, dim(2) == 12 认为是不可以 split 的所以在 2n4d 下我们根据调试的信息可以看到 reshape 的 sbp signature list 里面没有 S(2) -> S(2) 这一项，但其实是可以 split 的，因为 4dp + 2mp，S(2) 要不切4份，要不切2份（取决于 S(2) 是 nd_sbp 的第1维还是第2维），12 % 4 == 0 和 12 % 2 == 0 都成立。

所以出现了 https://github.com/Oneflow-Inc/libai/issues/406#issuecomment-1287831082 里面所说的情况。这里的正确做法，应该根据 device mesh 的某一个维来判断是否能被 split，而不能只看 parallel_num。

但目前有一些困难，因为在 GetSbp 的时候，并不知晓推导的 sbp signature 将会被应用于 device mesh 的哪一维。这里只能添加上全部的 split(num_axes)，然后再到后面的 FilterNdSbpSignatureListByLogicalShape 或其他什么地方去 filter。

是的，原因就跟文晓讲的差不多。通过打印log可以看出来

op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature.
candidate nd sbp signature are: (in_0) -> (out_0): [
    ((S(0), S(0))) -> ((S(0), S(0))),
    ((S(0), S(1))) -> ((S(0), S(1))),
    ((S(0), P)) -> ((S(0), P)),
    ((S(0), B)) -> ((S(0), B)),
    ((S(1), S(0))) -> ((S(1), S(0))),
    ((S(1), S(1))) -> ((S(1), S(1))),
    ((S(1), P)) -> ((S(1), P)),
    ((S(1), B)) -> ((S(1), B)),
    ((P, S(0))) -> ((P, S(0))),
    ((P, S(1))) -> ((P, S(1))),
    ((P, P)) -> ((P, P)),
    ((P, B)) -> ((P, B)),
    ((B, S(0))) -> ((B, S(0))),
    ((B, S(1))) -> ((B, S(1))),
    ((B, P)) -> ((B, P)),
    ((B, B)) -> ((B, B)),
], but inputs sbp are: in_0: (S(0), S(2));
select idx: 1

备选策略里面没有S(2)。原因就是因为12不被8整除。reshape的sbp这部分是我之前重构的。为什么用的是parallel num，是因为reshape的get sbp函数只推导1d的sbp。而且一般这个1d的sbp推导是不涉及shape的，比如矩阵乘或者是加减这些op。在后面还有一个Filter，这个Filter做的才是根据shape筛选sbp。但是reshape本身跟shape又紧密关联，所以这里才必须要有这个filter。

2d sbp是根据1d sbp的直积得出，1d sbp把S(2) filter掉了，后面自然选不到 (S0, S2)。那怎么修复呢？添加所有的split是不行的，reshape的split需要划分一个对应组，只有组的头被整除时能被split。举一个例子： (32,512,2304, 100) to (32,512,12,192, 100) 组头分别对应 32 -> 32, 512 -> 512, 2304 -> 12, 100 -> 100 也就是 S0 -> S0, S1 ->S1, S2 -> S2, S3 ->S4 阔以看到输出的sbp是不能有 S3的，第三维不是组头。

昨天我做了一个修复尝试 https://github.com/Oneflow-Inc/oneflow/pull/9288/commits/caf344f94d3fcf66bbe914f2ef93f4c03b0086b2 就是在挑选1d sbp的时候hierarchy只保留大于1的最低值。比如 [2, 4] -> [2, 1] 比如 [4, 2, 2] -> [1, 2, 1] 比如 [16, 4, 8] -> [1, 4, 1] 这样在当这个最低值能被其他维度整除的时候，GetSbp才能给出一个完整的1d sbp备选策略。为什么使用一个最低值而不直接使用1呢？因为怕有的op对于parallel num为1的hierarchy直接给出一个B。

只是测试结果并没有如愿修复bug。 @xyn1201 做了测试关自动并行 5331 MiB/41.46 samples/s 开自动并行 9915 MiB/59.27 samples/s

点进吞吐可以看到log，S2还是没有出现。原因未知，不过今天稍微修复一下应该就行了。总而言之，这个问题的根本已经找到了，修复起来比较简单。

xyn1201 commented 1 year ago

debug_reshape_sbp_signature分支

https://github.com/Oneflow-Inc/oneflow/commit/4b04b25f521ab2d7727235347c057e3aa584350b
2n4g mb16_gb512
- libai: 11525 MiB/116.95 samples/s
- megatron: 4783 MiB/164.3 samples/s

refactor-GetSbpSignature分支

https://github.com/Oneflow-Inc/oneflow/pull/9304/commits/195b0ea149c77374737751356b97f6bf2da240ff
2n4g mb4_gb128
- 关自动并行 5283 MiB/71.42 samples/s
- 开自动并行 7281 MiB/90.50 samples/s
- megatron: 4285 MiB/103.9 samples/s
2n4g mb16_gb512
- 关自动并行 11505 MiB/118.37 samples/s
- megatron: 4783 MiB/164.3 samples/s

2个分支吞吐都有接近1倍的提升，但还低于megatron

Yipeng1994 commented 1 year ago

哎，refactor-GetSbpSignature 也测一下 2n4g mb16_gb512 康康是否会有内存暴涨的问题，然后输出一下boxing的log @xyn1201

Yipeng1994 commented 1 year ago

refactor-GetSbpSignature分支

Producer (S(0), S(2)), placement: hierarchy: (4,2), device: cuda
Shape: (16,512,2304)
idx: 0, sbp: (S(0), S(0)), placement: hierarchy: (4,2), device: cuda
idx: 1, sbp: (S(0), S(2)), placement: hierarchy: (4,2), device: cuda
op: `model.t5_model.encoder.layers.0.self_attention-reshape-29` can't find available sbp signature.
candidate nd sbp signature are: (in_0) -> (out_0): [
    ((S(0), S(0))) -> ((S(0), S(0))),
    ((S(0), S(2))) -> ((S(0), S(2))),
    ((S(0), S(1))) -> ((S(0), S(1))),
    ((S(0), P)) -> ((S(0), P)),
    ((S(0), B)) -> ((S(0), B)),
    ((S(2), S(0))) -> ((S(2), S(0))),
    ((S(2), S(2))) -> ((S(2), S(2))),
    ((S(2), S(1))) -> ((S(2), S(1))),
    ((S(2), P)) -> ((S(2), P)),
    ((S(2), B)) -> ((S(2), B)),
    ((S(1), S(0))) -> ((S(1), S(0))),
    ((S(1), S(2))) -> ((S(1), S(2))),
    ((S(1), S(1))) -> ((S(1), S(1))),
    ((S(1), P)) -> ((S(1), P)),
    ((S(1), B)) -> ((S(1), B)),
    ((P, S(0))) -> ((P, S(0))),
    ((P, S(2))) -> ((P, S(2))),
    ((P, S(1))) -> ((P, S(1))),
    ((P, P)) -> ((P, P)),
    ((P, B)) -> ((P, B)),
    ((B, S(0))) -> ((B, S(0))),
    ((B, S(2))) -> ((B, S(2))),
    ((B, S(1))) -> ((B, S(1))),
    ((B, P)) -> ((B, P)),
    ((B, B)) -> ((B, B)),
], but inputs sbp are: in_0: (S(0), S(2));
select idx: 1

把S2加回来了，但是吞吐只有70%，还是需要定位一下其他的问题。

xyn1201 commented 1 year ago

操作失误，上面测试的megatron数据是关掉zero的，所以重测了megatron开zero，并在下方整理现有的对比结果

开zero测试

oneflow debug_reshape_sbp_signature分支 https://github.com/Oneflow-Inc/oneflow/commit/4b04b25f521ab2d7727235347c057e3aa584350b
export SBP_INFER_RULE_TAG=2
libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063
2n4g mb16_gb512
- libai: 11525 MiB/116.95 samples/s nsys profiler_nsys log: oss://oneflow-test/mt5_test/debug_reshape_sbp_signature/4b04b25/log_path/log/
- megatron: 3653 MiB/124.6 samples/s nsys

Yipeng1994 commented 1 year ago

嗯嗯，这样容易接受多了

xyn1201 commented 1 year ago

不启用activation checkpointing

oneflow debug_reshape_sbp_signature分支 https://github.com/Oneflow-Inc/oneflow/commit/4b04b25f521ab2d7727235347c057e3aa584350b
2n4g mb16_gb512
- libai: 11525 MiB/116.95 samples/s 25_nsys 28_nsys profiler_nsys log: oss://oneflow-test/mt5_test/debug_reshape_sbp_signature/4b04b25/log_path/log/
- megatron: 7957 MiB/153.9 samples/s 25_nsys 28_nsys

leaves-zwx commented 1 year ago

如果把 parameter 的 zero 通信就地去做（在消费前去做）有1个问题，就是各张卡的计算节奏不是完全一致的（特别是不同的机器），那么就是频繁的出现你等我，我等你的情况，现象就是 send-recv 的 timeline 长度超出预期。send-recv 是需要同步的，他与计算交替执行，就会出现互等的情况。

而 megatron 里面把所有 parameter 的 zero 通信集中起来去做，这样互等的时间就会减少，注意不同卡上的 parameter 要按同样的顺序去进行通信交换。

这是我猜测的1个 send-recv 特别长的原因，就算换成 allgather, 这个问题应该也是存在的。

比如下图中特别长的 send-recv 在另外一台机器上就特别短：

leaves-zwx commented 1 year ago

我建议后面做这么几个测试：

都关闭 zero 的对比测试：为了确认排除 zero 后，其他地方的性能差距，比如没 fuse 的算子，重复的 cast 等
在 1n8d 的机器上测试 4dp + 2dp (+ zero)：同一台机器的执行节奏一般比较一致（这是观察到的情况），可以尽量少的减少 nccl kernel 的等待时间

xyn1201 commented 1 year ago

关zero 关checkpointing测试

oneflow debug_reshape_sbp_signature分支 https://github.com/Oneflow-Inc/oneflow/commit/4b04b25f521ab2d7727235347c057e3aa584350b
2n4g mb4_gb128 比原先调小了batch_size
- libai: 6854 MiB/110.46 samples/s 25_nsys 28_nsys log: oss://oneflow-test/mt5_test/debug_reshape_sbp_signature/4b04b25/again_log/25/log/
- megatron: 5001 MiB/129.9 samples/s 25_nsys 28_nsys

chengtbf commented 1 year ago

oneflow 补测一轮：

这个分支的结果吧

2n4g mb4_gb128 比原先调小了batch_size 然后这个调小 bsz ，是因为没开 Checkpointing oneflow oom？

xyn1201 commented 1 year ago

oneflow 补测一轮：

https://github.com/Oneflow-Inc/oneflow/tree/refactor-GetSbpSignature

Oneflow-Inc/oneflow@f7d29d1

关zero 关checkpointing

2n4g mb4_gb128
- libai: 6858 MiB/109.42 samples/s 25_nsys 28_nsys
- megatron: 5001 MiB/129.9 samples/s 25_nsys 28_nsys

chengtbf commented 1 year ago

用这个分支再测试一次。关掉 ZeRO ，关掉 Checkpointing。

leaves-zwx commented 1 year ago

encoder layer profile

对比 libai 和 megatron 的 encoder layer 的性能差异，但因为 libai t5 实际是 mt5，所以算子层面有一些差别，这个对比不是一个benchmark，只是为未来的优化做一个参考。

libai t5 encoder layer nsys snapshot

megatron t5 encoder layer nsys snapshot

我将上面 nsys 中所有的 kernel 都 dump 成表格 encoder_layer_pofile.xlsx，其中 sheet1 是 megatron 的，sheet2 是 libai 的。

下面会举一些例子如何通过上述表格观察到的差别：

LayerNorm 差别

libai t5 用的是 RMSLayerNorm，目前是用碎 op 组合的，耗时 69.32 μs。里面还有 cast_f2h/cast_h2f 等低效转换（因为有 two stage reduce）。

libai t5 layer norm stat snapshot

megatron 用的是 LayerNorm，耗时 36.83 μs。

megatron t5 layer norm stat snapshot

self_attention query_key_value

libai t5 self_attention query_key_value 耗时 59.2 μs。

libai t5 self_attention query_key_value stat snapshot

megatron t5 self_attention query_key_value 耗时 59.59 μs。

megatron t5 self_attention query_key_value stat snapshot

matmul 是一样的耗时，比较符合预期。libai t5 里面没有 bias add，megatron 里面用的是 cutlass kernel。

softmax

libai t5 self_attention softmax stat snapshot

megatron t5 self_attention softmax stat snapshot

两边的 mask 算法不一致。

dropout

libai t5 self_attention softmax stat snapshot

megatron t5 self_attention softmax stat snapshot

view 操作

libai t5 view stat snapshot

megatron t5 view stat snapshot

由于 oneflow 还不支持 lazy view，所以 tranpose, slice 都会带来 copy。而 megatron 里面只有唯一一次 contiguous 带来的 copy。

gelu

libai t5 gelu_tanh stat snapshot

megatron t5 gelu stat snapshot

libai 里面的 gelu_tanh 是组合的，所以比较碎有较多开销。megatron 里面 gelu 是自定义的。

libai self_attention 还存在未知 SendRecv

libai t5 send recv stat snapshot

不知道是做什么的，需要进一步看 job/plan

AllReduce

对比不具有参考意义，因为单个 rank 上的 allreduce 需要等其他 rank 上一起执行，所以 timeline 上长也许只是等待时间长（因为各rank执行节奏不一致）。

xyn1201 commented 1 year ago

debug_reshape_sbp_signature分支关zero 关checkpointing 测试结果汇总

https://github.com/Oneflow-Inc/oneflow/commit/4b04b25f521ab2d7727235347c057e3aa584350b
1n1g mb4_gb32
- libai: 11265 MiB/44.91 samples/s
- megatron: 8361 MiB/45.6 samples/s
1n4g mb4_gb64
- libai: 6849 MiB/58.48 samples/s
- megatron: 4989 MiB/78.1 samples/s
2n4g mb4_gb128
- libai: 6854 MiB/110.46 samples/s 25_nsys 28_nsys log: oss://oneflow-test/mt5_test/debug_reshape_sbp_signature/4b04b25/again_log/25/log/
- megatron: 5001 MiB/129.9 samples/s 25_nsys 28_nsys

leaves-zwx commented 1 year ago

上文中说的 SelfAttention 中的未知 SendRecv 是必要的，它在代码这里，megatron 这里没有的原因是算法不一样，megatron 里面的 t5 没有这个 position_bias。

position_bias 这里 position_bias (S(0), B) 要与 attention_scores (S(0), S(1)) 做计算，需要做一个 (S(0), B) -> (S(0), S(1))，目前 2d SBP 里面是用 SendRecv 实现的，但可以用 SameDim0AllScatter 来实现（没有通信开销）。

但上述 (S(0), B) -> (S(0), S(1)) 的转换不用每一层 layer 都做，因为 position_bias 是在 layer 0 通过 compute_bias 计算出来的，后面的所有 layer 使用的都是 layer 0 的 position_bias，所以该转换只需要做一次。而 position_bias 在与 attention_scores 相加之前，需要先与 attention_mask (S(0), B) 相加（见这里），加完之后 position_bias sbp 也变为了 (S(0), B)。

我们只需要将 position_bias = position_bias.to_global(placement=attention_scores.placement) 这行代码移动到前面的 if 作用域之内，position_bias = position_bias + (1 - attention_mask) * -1000 之后，即可使 (S(0), B) -> (S(0), S(1)) 的转换只做1次。

Yipeng1994 commented 1 year ago

position_bias 这里 position_bias (S(0), B) 要与 attention_scores (S(0), S(1)) 做计算，需要做一个 (S(0), B) -> (S(0), S(1))，目前 2d SBP 里面是用 SendRecv 实现的，但可以用 SameDim0AllScatter 来实现（没有通信开销）。

广义基础传输也没有通信开销吧，它是直接从本地拷数据

Yipeng1994 commented 1 year ago

但上述 (S(0), B) -> (S(0), S(1)) 的转换不用每一层 layer 都做，因为 position_bias 是在 layer 0 通过 compute_bias 计算出来的，后面的所有 layer 使用的都是 layer 0 的 position_bias，所以该转换只需要做一次。而 position_bias 在与 attention_scores 相加之前，需要先与 attention_mask (S(0), B) 相加（见这里），加完之后 position_bias sbp 也变为了 (S(0), B)。

我们只需要将 position_bias = position_bias.to_global(placement=attention_scores.placement) 这行代码移动到前面的 if 作用域之内，position_bias = position_bias + (1 - attention_mask) * -1000 之后，即可使 (S(0), B) -> (S(0), S(1)) 的转换只做1次。

像这种场景，自动并行就能完美地处理。因为它知道后方有什么信息，知道在前面做通信会有多倍的通信代价。我们自动并行之前就在不同的模型下碰到过这种cases，因为我们打开了Acc，然后很完美地处理了。

所以关掉zero了以后，阔以直接打开自动并行，甚至原代码都不用动。

xyn1201 commented 1 year ago

用这个分支再测试一次。关掉 ZeRO ，关掉 Checkpointing。

https://github.com/Oneflow-Inc/oneflow/tree/dev_cc_mt5_bench

Oneflow-Inc/oneflow@251a2ed

2n4g mb4_gb128
- libai: 6062 MiB/100.43 samples/s 25_nsys 28_nsys
- megatron: 5001 MiB/129.9 samples/s 25_nsys 28_nsys

leaves-zwx commented 1 year ago

所以关掉zero了以后，阔以直接打开自动并行，甚至原代码都不用动。

自动并行可以干预掉用户写的 to_global 吗？

chengtbf commented 1 year ago

所以关掉zero了以后，阔以直接打开自动并行，甚至原代码都不用动。

自动并行可以干预掉用户写的 to_global 吗？

这个忘记了，自动并行会移除 parallel cast 吗？我印象中某一档会移除 @Yipeng1994 @wyg1997

Yipeng1994 commented 1 year ago

这个忘记了，自动并行会移除 parallel cast 吗？我印象中某一档会移除 @Yipeng1994 @wyg1997

第二档会，可以选。第一档自动并行只负责用户没配置的部分第二档无视用户配置

xyn1201 commented 1 year ago

release/mt5_opt分支测试

https://github.com/Oneflow-Inc/oneflow/pull/9318/commits/5b3a585384146cd72c681fec0b280445d0d65dfe
libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063
2n4g 开zero 关checkpointing dp2_mp4_pp1（和IDEA一致的配置） mb8_gb128
- libai: 5097 MiB/46.59 samples/s 25_nsys 28_nsys
- megatron: 3889 MiB/61.5 samples/s 25_nsys 28_nsys

Oneflow-Inc / libai