Does Fused Multi-head Attention support self-defined attention-mask?

NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT

Apache License 2.0

5.87k stars 893 forks source link

Does Fused Multi-head Attention support self-defined attention-mask? #714

Open zhanghaoie opened 1 year ago

zhanghaoie commented 1 year ago

Branch/Tag/Commit

main

Docker Image Version

nvcr.io/nvidia/pytorch:21.04-py3

GPU name

3090

CUDA Driver

525.89.02

Reproduced Steps

Bert Model with self defined attention-mask;
modify the input parameters in the FusedAttentionLayer.cu as following line, but the final result dosn't change.
dispatcher_fp16->run(qkv_buf_, attention_mask, padding_offset, attn_workspace_, qkv_buf_2_, stream_);

zhanghaoie commented 1 year ago

does UnFusedAttentionLayer's performance is much slower than FusedAttentionLayer?