NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.77k stars 882 forks source link

Why is the rouge of Hugging Face different from the rouge of Faster Transformer? #489

Open lkm2835 opened 1 year ago

lkm2835 commented 1 year ago

Why is the rouge of Hugging Face different from the rouge of Faster Transformer even though the same weight is used?

https://github.com/NVIDIA/FasterTransformer/blob/main/docs/t5_guide.md#running-t5-v11

Hugging Face (total latency: 21.826529 sec)
rouge1 : 10.786476875527406
rouge2 : 1.8231246974441166
rougeL : 8.652689713627165
rougeLsum : 10.326607305635523
Faster Transformers (total latency: 7.036808000000001 sec)
rouge1 : 10.91735083630513
rouge2 : 1.8454654301092783
rougeL : 8.76872604148143
rougeLsum : 10.453229536094794

I want to know why it's different.

Can I implement the results to be 100% the same?

byshiue commented 1 year ago

Why is the rouge of Hugging Face different from the rouge of Faster Transformer even though the same weight is used?

https://github.com/NVIDIA/FasterTransformer/blob/main/docs/t5_guide.md#running-t5-v11

Hugging Face (total latency: 21.826529 sec)
rouge1 : 10.786476875527406
rouge2 : 1.8231246974441166
rougeL : 8.652689713627165
rougeLsum : 10.326607305635523
Faster Transformers (total latency: 7.036808000000001 sec)
rouge1 : 10.91735083630513
rouge2 : 1.8454654301092783
rougeL : 8.76872604148143
rougeLsum : 10.453229536094794

I want to know why it's different.

Can I implement the results to be 100% the same?

It is almost impossible. Different gemm algorithm, and different kernel fusion all lead to different computing order. So, they must have small different in final results of transformer. And for generation model, such small gap would be accumulated and lead to different output ids finally.