Use fuse multi head att

Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

Apache License 2.0

391 stars 55 forks source link

1n1g	use_fuse_multi_head_att = False	use_fuse_multi_head_att = True
Throughput	total_throughput: 151.70 samples/s	total_throughput: 155.41 samples/s
GPU Memory	3147MiB	3129MiB

1n1g

use_fuse_multi_head_att = False

use_fuse_multi_head_att = True

Throughput

total_throughput: 151.70 samples/s

total_throughput: 155.41 samples/s

GPU Memory

3147MiB

3129MiB

这里魔改了一下，self_att和cross_att都使用了fuse_muti_head_att，attention层默认为fuse_multi_head_att，一共只多出3个必须的transpose：encode_embedding的输出进行一次transpose，decoder_embedding的输出进行一次transpose，loss接收的logits进行一次transpose
如果数据处理的时候直接处理成[seq_len, batch_size]的shape的话上述3个transpose可以取消
用这个pr下面的单测测过了修改后的模型和huggingface对齐：tests/model_utils/test_mt5_loader_2.py

@chengtbf @CPFLAME @strint @ouyangyu

Oneflow-Inc / libai