FlagOpen / FlagGems

FlagGems is an operator library for large language models implemented in Triton Language.
Apache License 2.0
296 stars 27 forks source link

[Operator] Add repeat_interleave_self_tensor #230

Open zfu82 opened 3 weeks ago

zfu82 commented 3 weeks ago

Performance

Tested on NV-A100

Operator repeat_interleave_self_tensor Performance Test (dtype=torch.float16, mode=cuda)
Size    Torch Latency (ms)    Gems Latency (ms)    Gems Speedup
---------------------------------------------------------------
1024              0.336896              20.0387          0.0168
6144               1.53498              20.9459          0.0733
11264               2.7648              20.4554           0.135
16384              4.12979              21.6965            0.19
21504                5.376              21.4784            0.25
26624              7.02874              22.4543           0.313
31744              8.03123              22.8055           0.352
36864              8.09677              22.5853           0.358
41984              10.2134              23.1731           0.441
47104              11.3715              23.2704           0.489
52224              12.6669              24.6088           0.515
57344              13.7267              25.2928           0.543
62464              15.0774              25.3972           0.594
67584              15.2904              24.5217           0.624
72704              16.7752              24.8955           0.674
77824              17.6722              26.4264           0.669
Operator repeat_interleave_self_tensor Performance Test (dtype=torch.float32, mode=cuda)
Size    Torch Latency (ms)    Gems Latency (ms)    Gems Speedup
---------------------------------------------------------------
1024              0.338944              19.7806          0.0171
6144               1.54419              20.1861          0.0765
11264              3.11194              21.5511           0.144
16384              4.07859              21.1241           0.193
21504              5.98528               22.484           0.266
26624              7.27859              22.9478           0.317
31744              8.11418              22.4348           0.362
36864              8.45619              23.6575           0.357
41984              10.6609              23.8254           0.447
47104               11.732              24.4019           0.481
52224              13.4359              25.1873           0.533
57344              13.9284               25.385           0.549
62464              15.7348              26.5933           0.592
67584              15.9037               26.751           0.595
72704               17.792              27.5722           0.645
77824              19.1754              28.2491           0.679
Operator repeat_interleave_self_tensor Performance Test (dtype=torch.bfloat16, mode=cuda)
Size    Torch Latency (ms)    Gems Latency (ms)    Gems Speedup
---------------------------------------------------------------
1024                0.3328              20.1001          0.0166
6144               1.49504              20.1185          0.0743
11264              2.75968              20.3551           0.136
16384              3.95776              20.4063           0.194
21504               5.3545              21.1671           0.253
26624              6.84954              21.8993           0.313
31744              7.75066              21.7375           0.357
36864              8.07424              22.0928           0.365
41984              10.1786              22.6376            0.45
47104              11.1913              23.1301           0.484
52224              12.1201               23.682           0.512
57344              13.2219              24.4797            0.54
62464              14.9258               24.705           0.604
67584              14.8818              25.1003           0.593
72704              16.7035              25.7403           0.649
77824              17.9456              27.0879           0.662
PASSED