Why is the rouge of Hugging Face different from the rouge of Faster Transformer?

NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT

Apache License 2.0

5.77k stars 882 forks source link

Hugging Face (total latency: 21.826529 sec) rouge1 : 10.786476875527406 rouge2 : 1.8231246974441166 rougeL : 8.652689713627165 rougeLsum : 10.326607305635523 Faster Transformers (total latency: 7.036808000000001 sec) rouge1 : 10.91735083630513 rouge2 : 1.8454654301092783 rougeL : 8.76872604148143 rougeLsum : 10.453229536094794

Why is the rouge of Hugging Face different from the rouge of Faster Transformer even though the same weight is used?

https://github.com/NVIDIA/FasterTransformer/blob/main/docs/t5_guide.md#running-t5-v11
Hugging Face (total latency: 21.826529 sec)
rouge1 : 10.786476875527406
rouge2 : 1.8231246974441166
rougeL : 8.652689713627165
rougeLsum : 10.326607305635523
Faster Transformers (total latency: 7.036808000000001 sec)
rouge1 : 10.91735083630513
rouge2 : 1.8454654301092783
rougeL : 8.76872604148143
rougeLsum : 10.453229536094794
I want to know why it's different.

Can I implement the results to be 100% the same?

It is almost impossible. Different gemm algorithm, and different kernel fusion all lead to different computing order. So, they must have small different in final results of transformer. And for generation model, such small gap would be accumulated and lead to different output ids finally.

NVIDIA / FasterTransformer

Why is the rouge of Hugging Face different from the rouge of Faster Transformer? #489