Closed gaetansnl closed 2 years ago
benchmarks
❯ pytest test/test_torchdynamo_bert.py -k "benchmark" --benchmark-group-by fullfunc,param:shape
===================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.9.15, pytest-7.1.3, pluggy-1.0.0
rootdir: /mnt/workspace/kernl
collected 572 items / 11 deselected / 561 selected
test/test_torchdynamo_bert.py .......................................................................................................ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss................................ [ 32%]
....................................................................................................................................................................................................................... [ 71%]
.........ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss [100%]
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-1x128-bert-base-uncased] 7.9872 (2.57) 7.9899 (2.58) 7.7998 (2.63) 8.1408 (2.56) 8.2159 (2.52) 8.2771 (2.52) 8.0689 (2.56) 9.0158 (2.38)
test_benchmark_implementations[baseline-1x128-sentence-transformers/all-MiniLM-L6-v2] 4.0152 (5.12) 4.0139 (5.14) 3.9148 (5.25) 4.177 (4.99) 4.2852 (4.83) 4.3102 (4.85) 4.198 (4.92) 4.9206 (4.36)
test_benchmark_implementations[baseline-1x128-t5-small] 13.8598 (1.48) 14.0985 (1.46) 13.3765 (1.54) 15.0268 (1.39) 13.7171 (1.51) 13.9161 (1.5) 13.4444 (1.54) 15.0441 (1.43)
test_benchmark_implementations[dynamo-1x128-bert-base-uncased] 6.997 (2.94) 7.1029 (2.9) 6.0877 (3.37) 8.0609 (2.59) 7.2228 (2.87) 7.3178 (2.86) 6.9524 (2.97) 8.5486 (2.51)
test_benchmark_implementations[dynamo-1x128-sentence-transformers/all-MiniLM-L6-v2] 3.3751 (6.09) 3.3769 (6.11) 3.2266 (6.37) 3.629 (5.75) 3.8232 (5.41) 3.923 (5.33) 3.7851 (5.46) 4.2954 (5.0)
test_benchmark_implementations[dynamo-1x128-t5-small] 12.075 (1.7) 12.0892 (1.71) 11.9195 (1.72) 12.3597 (1.69) 12.1862 (1.7) 12.3291 (1.69) 12.1373 (1.7) 13.1296 (1.63)
test_benchmark_implementations[dynamo_cuda_graphs-1x128-bert-base-uncased] 1.7746 (11.59) 1.7744 (11.63) 1.7705 (11.6) 1.7818 (11.7) 1.6138 (12.82) 1.6166 (12.92) 1.6096 (12.83) 1.6971 (12.64)
test_benchmark_implementations[dynamo_cuda_graphs-1x128-sentence-transformers/all-MiniLM-L6-v2] 0.6246 (32.92) 0.6246 (33.03) 0.6226 (32.99) 0.6267 (33.27) 0.6046 (34.23) 0.606 (34.47) 0.6023 (34.29) 0.6949 (30.88)
test_benchmark_implementations[dynamo_cuda_graphs-1x128-t5-small] 1.5135 (13.59) 1.6067 (12.84) 1.4981 (13.71) 1.7234 (12.1) 1.5702 (13.18) 1.5728 (13.28) 1.5673 (13.18) 1.675 (12.81)
test_benchmark_implementations[dynamo_no_dropout-1x128-bert-base-uncased] 6.868 (2.99) 6.8942 (2.99) 6.5167 (3.15) 7.4383 (2.8) 7.1646 (2.89) 7.236 (2.89) 6.891 (3.0) 7.9592 (2.7)
test_benchmark_implementations[dynamo_no_dropout-1x128-sentence-transformers/all-MiniLM-L6-v2] 3.2524 (6.32) 3.2494 (6.35) 3.0158 (6.81) 3.4847 (5.98) 3.4674 (5.97) 3.5189 (5.94) 3.3634 (6.14) 3.923 (5.47)
test_benchmark_implementations[dynamo_no_dropout-1x128-t5-small] 12.1487 (1.69) 12.1261 (1.7) 11.7545 (1.75) 12.4109 (1.68) 13.1879 (1.57) 13.306 (1.57) 12.8454 (1.61) 13.7882 (1.56)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x128-bert-base-uncased] 3.6014 (5.71) 3.6048 (5.72) 3.4939 (5.88) 3.7251 (5.6) 3.9364 (5.26) 3.9292 (5.32) 3.8066 (5.43) 4.1656 (5.15)
test_benchmark_implementations[dynamo_optimized-1x128-bert-base-uncased] 14.4248 (1.43) 14.4 (1.43) 14.2694 (1.44) 14.4476 (1.44) 14.7625 (1.4) 14.9087 (1.4) 14.6598 (1.41) 15.7277 (1.36)
test_benchmark_implementations[dynamo_optimized-1x128-sentence-transformers/all-MiniLM-L6-v2] 7.3882 (2.78) 7.3925 (2.79) 7.3298 (2.8) 7.4888 (2.78) 7.7379 (2.67) 7.7562 (2.69) 7.6219 (2.71) 8.138 (2.64)
test_benchmark_implementations[dynamo_optimized-1x128-t5-small] 20.564 (1.0) 20.6306 (1.0) 20.5384 (1.0) 20.8508 (1.0) 20.6954 (1.0) 20.8929 (1.0) 20.652 (1.0) 21.4561 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-bert-base-uncased] 1.6302 (12.61) 1.5937 (12.95) 1.4336 (14.33) 1.6333 (12.77) 1.4812 (13.97) 1.4839 (14.08) 1.4773 (13.98) 1.5775 (13.6)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-sentence-transformers/all-MiniLM-L6-v2] 0.4045 (50.84) 0.4286 (48.13) 0.3994 (51.43) 0.4649 (44.85) 0.4557 (45.41) 0.4573 (45.69) 0.4536 (45.53) 0.5474 (39.19)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-t5-small] 1.8033 (11.4) 1.8031 (11.44) 1.8012 (11.4) 1.8063 (11.54) 1.6396 (12.62) 1.6422 (12.72) 1.6358 (12.62) 1.7353 (12.36)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x128-bert-base-uncased] 1.6947 (12.13) 1.6216 (12.72) 1.4807 (13.87) 1.6978 (12.28) 1.5388 (13.45) 1.5416 (13.55) 1.5351 (13.45) 1.6359 (13.12)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x128-sentence-transformers/all-MiniLM-L6-v2] 0.4618 (44.53) 0.4614 (44.71) 0.4598 (44.67) 0.4628 (45.05) 0.4594 (45.05) 0.4606 (45.36) 0.4563 (45.26) 0.5515 (38.91)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x128-t5-small] 1.8053 (11.39) 1.8049 (11.43) 1.8033 (11.39) 1.8074 (11.54) 1.6434 (12.59) 1.6468 (12.69) 1.6405 (12.59) 1.7386 (12.34)
test_benchmark_implementations[onnx-1x128-bert-base-uncased] 3.582 (5.74) 3.5913 (5.74) 3.2358 (6.35) 4.0489 (5.15) 3.2617 (6.35) 3.3188 (6.3) 3.2244 (6.4) 3.6164 (5.93)
test_benchmark_implementations[onnx_optim_fp16-1x128-bert-base-uncased] 2.8641 (7.18) 2.8701 (7.19) 2.7812 (7.38) 2.9768 (7.0) 2.8838 (7.18) 2.9133 (7.17) 2.8205 (7.32) 3.4263 (6.26)
test_benchmark_implementations[onnx_optim_fp32-1x128-bert-base-uncased] 3.5852 (5.74) 3.6226 (5.69) 3.5543 (5.78) 4.0428 (5.16) 3.2514 (6.37) 3.2901 (6.35) 3.2305 (6.39) 3.6178 (5.93)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
--------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-1x16-bert-base-uncased] 7.682 (2.69) 7.7118 (2.69) 7.595 (2.72) 7.9278 (2.64) 8.0521 (2.59) 8.1305 (2.59) 7.9709 (2.61) 8.6675 (2.59)
test_benchmark_implementations[baseline-1x16-sentence-transformers/all-MiniLM-L6-v2] 3.927 (5.26) 3.9452 (5.25) 3.8758 (5.32) 4.0827 (5.13) 4.2252 (4.94) 4.2415 (4.97) 4.1482 (5.02) 4.6139 (4.86)
test_benchmark_implementations[baseline-1x16-t5-small] 12.3279 (1.67) 12.331 (1.68) 12.2624 (1.68) 12.3924 (1.69) 13.4639 (1.55) 13.4829 (1.56) 12.6558 (1.65) 14.4626 (1.55)
test_benchmark_implementations[dynamo-1x16-bert-base-uncased] 6.6396 (3.11) 6.6549 (3.11) 6.4492 (3.2) 6.8536 (3.06) 6.9179 (3.02) 6.9599 (3.03) 6.8426 (3.04) 7.3644 (3.04)
test_benchmark_implementations[dynamo-1x16-sentence-transformers/all-MiniLM-L6-v2] 3.2606 (6.33) 3.2628 (6.35) 3.2043 (6.44) 3.366 (6.22) 3.6475 (5.72) 3.6921 (5.71) 3.5418 (5.88) 4.0715 (5.51)
test_benchmark_implementations[dynamo-1x16-t5-small] 11.0203 (1.87) 11.0692 (1.87) 10.9312 (1.89) 11.2497 (1.86) 15.0719 (1.38) 16.2501 (1.3) 12.3431 (1.69) 22.4205 (1.0)
test_benchmark_implementations[dynamo_cuda_graphs-1x16-bert-base-uncased] 1.1244 (18.36) 1.1929 (17.37) 1.1223 (18.38) 1.6261 (12.88) 1.0677 (19.54) 1.0733 (19.63) 1.0613 (19.62) 1.2177 (18.41)
test_benchmark_implementations[dynamo_cuda_graphs-1x16-sentence-transformers/all-MiniLM-L6-v2] 0.4588 (45.0) 0.4411 (46.97) 0.4045 (50.99) 0.467 (44.86) 0.4638 (44.98) 0.4657 (45.23) 0.4613 (45.14) 0.5556 (40.36)
test_benchmark_implementations[dynamo_cuda_graphs-1x16-t5-small] 1.4705 (14.04) 1.5432 (13.43) 1.4674 (14.05) 2.0019 (10.47) 1.4545 (14.34) 1.4977 (14.06) 1.4449 (14.41) 1.7826 (12.58)
test_benchmark_implementations[dynamo_no_dropout-1x16-bert-base-uncased] 6.741 (3.06) 6.6996 (3.09) 6.3355 (3.26) 7.082 (2.96) 6.7989 (3.07) 6.813 (3.09) 6.6161 (3.15) 7.0404 (3.18)
test_benchmark_implementations[dynamo_no_dropout-1x16-sentence-transformers/all-MiniLM-L6-v2] 3.2256 (6.4) 3.2316 (6.41) 2.9266 (7.05) 3.4714 (6.04) 3.629 (5.75) 3.6681 (5.74) 3.5464 (5.87) 4.0347 (5.56)
test_benchmark_implementations[dynamo_no_dropout-1x16-t5-small] 10.7037 (1.93) 10.7682 (1.92) 10.5984 (1.95) 10.9875 (1.91) 11.5583 (1.8) 11.6896 (1.8) 11.4648 (1.82) 12.3367 (1.82)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x16-bert-base-uncased] 3.3597 (6.15) 3.3867 (6.12) 3.2041 (6.44) 3.5635 (5.88) 3.8301 (5.45) 3.8313 (5.5) 3.6893 (5.65) 4.1744 (5.37)
test_benchmark_implementations[dynamo_optimized-1x16-bert-base-uncased] 14.4681 (1.43) 14.5149 (1.43) 14.3852 (1.43) 14.6831 (1.43) 14.7698 (1.41) 14.8412 (1.42) 14.7237 (1.41) 15.1918 (1.48)
test_benchmark_implementations[dynamo_optimized-1x16-sentence-transformers/all-MiniLM-L6-v2] 7.478 (2.76) 7.56 (2.74) 7.3749 (2.8) 8.5381 (2.45) 7.728 (2.7) 7.7702 (2.71) 7.6352 (2.73) 8.3301 (2.69)
test_benchmark_implementations[dynamo_optimized-1x16-t5-small] 20.6459 (1.0) 20.7201 (1.0) 20.6234 (1.0) 20.951 (1.0) 20.8613 (1.0) 21.0644 (1.0) 20.8263 (1.0) 21.4085 (1.05)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-bert-base-uncased] 0.6472 (31.9) 0.6471 (32.02) 0.6451 (31.97) 0.6554 (31.97) 0.6424 (32.47) 0.6451 (32.65) 0.6399 (32.55) 0.8005 (28.01)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-sentence-transformers/all-MiniLM-L6-v2] 0.3348 (61.66) 0.3196 (64.82) 0.297 (69.45) 0.3369 (62.19) 0.355 (58.76) 0.359 (58.68) 0.3519 (59.18) 0.4687 (47.83)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-t5-small] 1.1653 (17.72) 1.1529 (17.97) 1.0363 (19.9) 1.1704 (17.9) 1.106 (18.86) 1.1076 (19.02) 1.0983 (18.96) 1.1985 (18.71)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x16-bert-base-uncased] 0.6533 (31.6) 0.6534 (31.71) 0.6513 (31.67) 0.6554 (31.97) 0.6448 (32.35) 0.6465 (32.58) 0.6422 (32.43) 0.7361 (30.46)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x16-sentence-transformers/all-MiniLM-L6-v2] 0.3369 (61.28) 0.3266 (63.45) 0.297 (69.45) 0.4987 (42.01) 0.3543 (58.88) 0.3557 (59.22) 0.3521 (59.14) 0.465 (48.21)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x16-t5-small] 1.0629 (19.42) 1.105 (18.75) 1.0598 (19.46) 1.1715 (17.88) 1.1341 (18.39) 1.1366 (18.53) 1.1297 (18.44) 1.2494 (17.95)
test_benchmark_implementations[onnx-1x16-bert-base-uncased] 2.6032 (7.93) 2.6188 (7.91) 2.5181 (8.19) 2.9164 (7.18) 2.6243 (7.95) 2.6575 (7.93) 2.549 (8.17) 3.0429 (7.37)
test_benchmark_implementations[onnx_optim_fp16-1x16-bert-base-uncased] 2.8223 (7.32) 2.8197 (7.35) 2.7535 (7.49) 2.8529 (7.34) 2.7543 (7.57) 2.7912 (7.55) 2.685 (7.76) 3.1783 (7.05)
test_benchmark_implementations[onnx_optim_fp32-1x16-bert-base-uncased] 2.5416 (8.12) 2.553 (8.12) 2.4945 (8.27) 2.6429 (7.93) 2.5792 (8.09) 2.6097 (8.07) 2.5529 (8.16) 3.0423 (7.37)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-1x256-bert-base-uncased] 7.7384 (2.67) 7.7547 (2.66) 7.6209 (2.7) 7.932 (2.61) 8.1248 (2.56) 8.1814 (2.57) 8.0003 (2.6) 8.8167 (2.46)
test_benchmark_implementations[baseline-1x256-sentence-transformers/all-MiniLM-L6-v2] 3.9608 (5.21) 3.9564 (5.22) 3.8779 (5.3) 4.1032 (5.05) 4.2653 (4.88) 4.3008 (4.89) 4.1955 (4.96) 5.0456 (4.29)
test_benchmark_implementations[baseline-1x256-t5-small] 12.5911 (1.64) 12.6226 (1.63) 12.4385 (1.65) 13.0109 (1.59) 13.0027 (1.6) 13.2045 (1.59) 12.8765 (1.62) 14.2936 (1.51)
test_benchmark_implementations[dynamo-1x256-bert-base-uncased] 6.8168 (3.03) 6.8896 (2.99) 6.5802 (3.12) 7.633 (2.71) 7.1286 (2.92) 7.2463 (2.9) 7.0758 (2.94) 7.7586 (2.79)
test_benchmark_implementations[dynamo-1x256-sentence-transformers/all-MiniLM-L6-v2] 3.3761 (6.11) 3.3897 (6.09) 3.3147 (6.2) 3.5451 (5.85) 3.6673 (5.67) 3.6892 (5.7) 3.5946 (5.79) 4.152 (5.21)
test_benchmark_implementations[dynamo-1x256-t5-small] 11.8589 (1.74) 11.7864 (1.75) 11.5589 (1.78) 12.1098 (1.71) 12.2947 (1.69) 12.9873 (1.62) 11.5175 (1.81) 17.1832 (1.26)
test_benchmark_implementations[dynamo_cuda_graphs-1x256-bert-base-uncased] 2.2804 (9.05) 2.2143 (9.32) 2.0654 (9.95) 2.4095 (8.6) 2.0906 (9.95) 2.0882 (10.07) 2.0591 (10.1) 2.1556 (10.04)
test_benchmark_implementations[dynamo_cuda_graphs-1x256-sentence-transformers/all-MiniLM-L6-v2] 0.6871 (30.02) 0.715 (28.86) 0.681 (30.17) 0.769 (26.95) 0.7329 (28.39) 0.7362 (28.55) 0.7285 (28.56) 0.8266 (26.19)
test_benchmark_implementations[dynamo_cuda_graphs-1x256-t5-small] 2.5068 (8.23) 2.507 (8.23) 2.5037 (8.21) 2.5119 (8.25) 2.2687 (9.17) 2.2714 (9.25) 2.2654 (9.18) 2.3533 (9.2)
test_benchmark_implementations[dynamo_no_dropout-1x256-bert-base-uncased] 6.8588 (3.01) 6.9061 (2.99) 6.3703 (3.23) 7.3277 (2.83) 6.8653 (3.03) 6.8913 (3.05) 6.6599 (3.12) 7.3825 (2.93)
test_benchmark_implementations[dynamo_no_dropout-1x256-sentence-transformers/all-MiniLM-L6-v2] 3.3987 (6.07) 3.9296 (5.25) 3.0751 (6.68) 5.4364 (3.81) 3.4937 (5.96) 3.5125 (5.98) 3.4188 (6.09) 3.9612 (5.46)
test_benchmark_implementations[dynamo_no_dropout-1x256-t5-small] 10.837 (1.9) 10.8245 (1.91) 10.6906 (1.92) 10.9415 (1.89) 11.0068 (1.89) 11.0652 (1.9) 10.864 (1.91) 11.581 (1.87)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x256-bert-base-uncased] 3.1785 (6.49) 3.2125 (6.42) 3.0638 (6.71) 3.4642 (5.98) 3.526 (5.9) 3.5533 (5.92) 3.369 (6.18) 3.8534 (5.62)
test_benchmark_implementations[dynamo_optimized-1x256-bert-base-uncased] 14.4538 (1.43) 14.4792 (1.43) 14.3471 (1.43) 14.6166 (1.42) 14.6074 (1.42) 14.7507 (1.42) 14.5543 (1.43) 15.1575 (1.43)
test_benchmark_implementations[dynamo_optimized-1x256-sentence-transformers/all-MiniLM-L6-v2] 7.4332 (2.78) 7.4325 (2.78) 7.3431 (2.8) 7.5163 (2.76) 7.6991 (2.7) 7.7543 (2.71) 7.6478 (2.72) 8.2078 (2.64)
test_benchmark_implementations[dynamo_optimized-1x256-t5-small] 20.6275 (1.0) 20.6331 (1.0) 20.5455 (1.0) 20.7227 (1.0) 20.8079 (1.0) 21.0179 (1.0) 20.8036 (1.0) 21.6455 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-bert-base-uncased] 2.0316 (10.15) 2.0319 (10.15) 2.0285 (10.13) 2.0357 (10.18) 1.8586 (11.2) 1.8807 (11.18) 1.8278 (11.38) 2.1787 (9.94)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-sentence-transformers/all-MiniLM-L6-v2] 0.682 (30.25) 0.6819 (30.26) 0.6799 (30.22) 0.6902 (30.03) 0.6559 (31.73) 0.6572 (31.98) 0.6528 (31.87) 0.7459 (29.02)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-t5-small] 2.6624 (7.75) 2.6623 (7.75) 2.6583 (7.73) 2.6675 (7.77) 2.4036 (8.66) 2.4065 (8.73) 2.3983 (8.67) 2.5116 (8.62)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x256-bert-base-uncased] 1.9558 (10.55) 2.0252 (10.19) 1.9517 (10.53) 2.1217 (9.77) 1.9243 (10.81) 1.9558 (10.75) 1.9064 (10.91) 2.1052 (10.28)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x256-sentence-transformers/all-MiniLM-L6-v2] 0.6851 (30.11) 0.6851 (30.12) 0.683 (30.08) 0.6871 (30.16) 0.6562 (31.71) 0.659 (31.89) 0.653 (31.86) 0.8251 (26.24)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x256-t5-small] 2.6665 (7.74) 2.6694 (7.73) 2.433 (8.44) 3.0024 (6.9) 2.4055 (8.65) 2.4088 (8.73) 2.4004 (8.67) 2.4967 (8.67)
test_benchmark_implementations[onnx-1x256-bert-base-uncased] 4.3646 (4.73) 4.2084 (4.9) 3.9598 (5.19) 4.4165 (4.69) 3.9879 (5.22) 4.0841 (5.15) 3.944 (5.27) 4.689 (4.62)
test_benchmark_implementations[onnx_optim_fp16-1x256-bert-base-uncased] 2.816 (7.33) 2.8224 (7.31) 2.8099 (7.31) 2.858 (7.25) 2.5851 (8.05) 2.5952 (8.1) 2.5621 (8.12) 2.9207 (7.41)
test_benchmark_implementations[onnx_optim_fp32-1x256-bert-base-uncased] 4.3717 (4.72) 4.4046 (4.68) 3.9823 (5.16) 4.9562 (4.18) 3.9519 (5.27) 3.97 (5.29) 3.9409 (5.28) 4.2826 (5.05)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 33)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
--------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-1x33-bert-base-uncased] 8.4726 (2.46) 8.412 (2.48) 7.935 (2.62) 9.004 (2.33) 8.7561 (2.42) 8.9365 (2.37) 8.5405 (2.45) 9.6138 (2.24)
test_benchmark_implementations[baseline-1x33-sentence-transformers/all-MiniLM-L6-v2] 4.0204 (5.19) 4.0388 (5.17) 3.8953 (5.34) 4.2906 (4.9) 5.5619 (3.81) 5.625 (3.77) 5.2705 (3.97) 6.7621 (3.19)
test_benchmark_implementations[baseline-1x33-t5-small] 11.9523 (1.75) 12.0013 (1.74) 11.8917 (1.75) 12.2962 (1.71) 12.4312 (1.7) 12.4971 (1.7) 12.1976 (1.71) 13.2715 (1.62)
test_benchmark_implementations[dynamo-1x33-bert-base-uncased] 7.0502 (2.96) 7.0396 (2.97) 6.7227 (3.09) 7.3626 (2.85) 7.3552 (2.88) 7.3792 (2.88) 7.2699 (2.88) 7.741 (2.78)
test_benchmark_implementations[dynamo-1x33-sentence-transformers/all-MiniLM-L6-v2] 3.3865 (6.16) 3.3818 (6.18) 3.2597 (6.38) 3.5219 (5.96) 3.8123 (5.56) 3.9556 (5.36) 3.6183 (5.78) 4.7159 (4.57)
test_benchmark_implementations[dynamo-1x33-t5-small] 11.1677 (1.87) 11.1976 (1.87) 11.0172 (1.89) 11.3838 (1.85) 11.3337 (1.87) 11.5102 (1.84) 11.2035 (1.87) 12.0374 (1.79)
test_benchmark_implementations[dynamo_cuda_graphs-1x33-bert-base-uncased] 1.3107 (15.92) 1.3467 (15.51) 1.1827 (17.58) 1.9927 (10.54) 1.2184 (17.39) 1.2385 (17.13) 1.2133 (17.23) 1.5788 (13.65)
test_benchmark_implementations[dynamo_cuda_graphs-1x33-sentence-transformers/all-MiniLM-L6-v2] 0.4874 (42.81) 0.4884 (42.77) 0.4864 (42.75) 0.4977 (42.21) 0.4904 (43.2) 0.4956 (42.81) 0.4882 (42.83) 0.6356 (33.91)
test_benchmark_implementations[dynamo_cuda_graphs-1x33-t5-small] 1.7326 (12.04) 1.7326 (12.06) 1.7306 (12.02) 1.7347 (12.11) 1.5957 (13.28) 1.5979 (13.28) 1.5935 (13.12) 1.6772 (12.85)
test_benchmark_implementations[dynamo_no_dropout-1x33-bert-base-uncased] 6.6377 (3.14) 6.7421 (3.1) 6.4174 (3.24) 7.3035 (2.88) 7.4625 (2.84) 7.5605 (2.81) 7.0708 (2.96) 8.5945 (2.51)
test_benchmark_implementations[dynamo_no_dropout-1x33-sentence-transformers/all-MiniLM-L6-v2] 3.2287 (6.46) 3.2227 (6.48) 3.0938 (6.72) 3.3413 (6.29) 3.6559 (5.79) 3.6962 (5.74) 3.6189 (5.78) 4.0549 (5.32)
test_benchmark_implementations[dynamo_no_dropout-1x33-t5-small] 10.452 (2.0) 10.4648 (2.0) 10.1745 (2.04) 10.8237 (1.94) 11.6557 (1.82) 11.7381 (1.81) 11.2991 (1.85) 12.5264 (1.72)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x33-bert-base-uncased] 3.5103 (5.94) 3.5325 (5.91) 3.4396 (6.05) 3.7274 (5.64) 3.8989 (5.43) 3.9899 (5.32) 3.6709 (5.7) 4.8065 (4.48)
test_benchmark_implementations[dynamo_optimized-1x33-bert-base-uncased] 14.4466 (1.44) 14.5096 (1.44) 14.4005 (1.44) 14.6831 (1.43) 14.6474 (1.45) 14.8191 (1.43) 14.6261 (1.43) 15.1702 (1.42)
test_benchmark_implementations[dynamo_optimized-1x33-sentence-transformers/all-MiniLM-L6-v2] 8.0547 (2.59) 8.3413 (2.5) 7.892 (2.63) 10.3977 (2.02) 7.7632 (2.73) 7.8094 (2.72) 7.67 (2.73) 8.3359 (2.59)
test_benchmark_implementations[dynamo_optimized-1x33-t5-small] 20.8681 (1.0) 20.8873 (1.0) 20.7933 (1.0) 21.0063 (1.0) 21.1847 (1.0) 21.2159 (1.0) 20.9065 (1.0) 21.5517 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-bert-base-uncased] 0.8387 (24.88) 0.7997 (26.12) 0.7465 (27.85) 0.8499 (24.72) 0.7931 (26.71) 0.7952 (26.68) 0.7905 (26.45) 0.8854 (24.34)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-sentence-transformers/all-MiniLM-L6-v2] 0.3523 (59.24) 0.3521 (59.33) 0.3502 (59.37) 0.3594 (58.44) 0.3685 (57.48) 0.3703 (57.3) 0.3659 (57.14) 0.4649 (46.36)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-t5-small] 1.2564 (16.61) 1.2639 (16.53) 1.2544 (16.58) 1.281 (16.4) 1.1839 (17.89) 1.1978 (17.71) 1.1797 (17.72) 1.4164 (15.22)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x33-bert-base-uncased] 0.8684 (24.03) 0.8682 (24.06) 0.8663 (24.0) 0.8704 (24.13) 0.8148 (26.0) 0.8173 (25.96) 0.8118 (25.75) 0.905 (23.81)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x33-sentence-transformers/all-MiniLM-L6-v2] 0.341 (61.2) 0.3454 (60.48) 0.34 (61.16) 0.3574 (58.78) 0.3721 (56.93) 0.3759 (56.44) 0.3662 (57.09) 0.4813 (44.78)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x33-t5-small] 1.2749 (16.37) 1.2744 (16.39) 1.2728 (16.34) 1.2769 (16.45) 1.1796 (17.96) 1.1818 (17.95) 1.1773 (17.76) 1.2763 (16.89)
test_benchmark_implementations[onnx-1x33-bert-base-uncased] 2.688 (7.76) 2.7224 (7.67) 2.5805 (8.06) 3.2737 (6.42) 2.7011 (7.84) 2.7256 (7.78) 2.6018 (8.04) 3.2574 (6.62)
test_benchmark_implementations[onnx_optim_fp16-1x33-bert-base-uncased] 2.8836 (7.24) 2.8825 (7.25) 2.8262 (7.36) 2.9584 (7.1) 2.9294 (7.23) 2.945 (7.2) 2.875 (7.27) 3.4319 (6.28)
test_benchmark_implementations[onnx_optim_fp32-1x33-bert-base-uncased] 2.818 (7.41) 2.831 (7.38) 2.561 (8.12) 3.1826 (6.6) 2.6491 (8.0) 2.677 (7.93) 2.612 (8.0) 3.0799 (7.0)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-1x384-bert-base-uncased] 7.9555 (2.61) 7.938 (2.62) 7.7855 (2.65) 8.1316 (2.57) 8.3062 (2.51) 8.3782 (2.51) 8.1626 (2.55) 9.1931 (2.35)
test_benchmark_implementations[baseline-1x384-sentence-transformers/all-MiniLM-L6-v2] 3.884 (5.35) 3.9057 (5.32) 3.8195 (5.39) 4.1197 (5.07) 4.2459 (4.9) 4.2788 (4.92) 4.1714 (4.99) 4.8257 (4.47)
test_benchmark_implementations[baseline-1x384-t5-small] 12.8348 (1.62) 12.9254 (1.61) 12.6566 (1.63) 13.5947 (1.54) 13.29 (1.57) 13.4307 (1.57) 13.1264 (1.58) 14.6564 (1.47)
test_benchmark_implementations[dynamo-1x384-bert-base-uncased] 6.7546 (3.07) 6.7684 (3.07) 6.7011 (3.07) 6.868 (3.04) 7.6694 (2.71) 7.5896 (2.77) 7.1185 (2.92) 8.5041 (2.54)
test_benchmark_implementations[dynamo-1x384-sentence-transformers/all-MiniLM-L6-v2] 3.2543 (6.38) 3.2663 (6.36) 3.2184 (6.4) 3.3833 (6.17) 3.6118 (5.76) 3.6322 (5.79) 3.5544 (5.85) 4.142 (5.21)
test_benchmark_implementations[dynamo-1x384-t5-small] 11.4944 (1.81) 11.5435 (1.8) 11.4033 (1.81) 11.774 (1.77) 11.7263 (1.77) 11.8464 (1.78) 11.7131 (1.78) 12.3298 (1.75)
test_benchmark_implementations[dynamo_cuda_graphs-1x384-bert-base-uncased] 3.0925 (6.71) 3.0999 (6.7) 3.0863 (6.67) 3.3239 (6.28) 2.913 (7.15) 2.8961 (7.26) 2.8325 (7.34) 2.9464 (7.32)
test_benchmark_implementations[dynamo_cuda_graphs-1x384-sentence-transformers/all-MiniLM-L6-v2] 1.0249 (20.26) 1.0291 (20.18) 1.0218 (20.16) 1.4561 (14.34) 0.9923 (20.98) 0.996 (21.12) 0.9784 (21.25) 1.0726 (20.11)
test_benchmark_implementations[dynamo_cuda_graphs-1x384-t5-small] 3.2676 (6.35) 3.3032 (6.29) 3.2645 (6.31) 3.6833 (5.67) 3.009 (6.92) 3.0296 (6.94) 2.9767 (6.99) 3.4194 (6.31)
test_benchmark_implementations[dynamo_no_dropout-1x384-bert-base-uncased] 6.5331 (3.18) 6.5535 (3.17) 6.441 (3.2) 6.7174 (3.11) 6.8065 (3.06) 6.8306 (3.08) 6.6748 (3.12) 7.2409 (2.98)
test_benchmark_implementations[dynamo_no_dropout-1x384-sentence-transformers/all-MiniLM-L6-v2] 3.1345 (6.62) 3.1288 (6.64) 3.0578 (6.74) 3.2307 (6.46) 3.4546 (6.02) 3.4759 (6.05) 3.3908 (6.13) 3.8693 (5.58)
test_benchmark_implementations[dynamo_no_dropout-1x384-t5-small] 10.8165 (1.92) 10.9941 (1.89) 10.6876 (1.93) 11.3063 (1.85) 11.1824 (1.86) 11.2628 (1.87) 11.1333 (1.87) 11.6667 (1.85)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x384-bert-base-uncased] 3.4193 (6.07) 3.4691 (5.99) 3.115 (6.61) 4.2035 (4.97) 3.8611 (5.39) 3.8632 (5.44) 3.659 (5.68) 4.1334 (5.22)
test_benchmark_implementations[dynamo_optimized-1x384-bert-base-uncased] 14.4079 (1.44) 14.412 (1.44) 14.2766 (1.44) 14.5265 (1.44) 14.6538 (1.42) 14.7368 (1.43) 14.5943 (1.42) 15.1598 (1.42)
test_benchmark_implementations[dynamo_optimized-1x384-sentence-transformers/all-MiniLM-L6-v2] 7.4004 (2.81) 7.4217 (2.8) 7.3614 (2.8) 7.5151 (2.78) 7.6889 (2.71) 7.752 (2.71) 7.6611 (2.71) 8.1893 (2.63)
test_benchmark_implementations[dynamo_optimized-1x384-t5-small] 20.7627 (1.0) 20.7665 (1.0) 20.5988 (1.0) 20.8773 (1.0) 20.8141 (1.0) 21.0309 (1.0) 20.7955 (1.0) 21.5749 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-bert-base-uncased] 2.1473 (9.67) 2.1955 (9.46) 2.1053 (9.78) 2.2794 (9.16) 2.1314 (9.77) 2.1101 (9.97) 2.0529 (10.13) 2.1454 (10.06)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-sentence-transformers/all-MiniLM-L6-v2] 0.9185 (22.6) 0.9186 (22.61) 0.9155 (22.5) 0.9226 (22.63) 0.8923 (23.33) 0.8929 (23.55) 0.8829 (23.55) 0.977 (22.08)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-t5-small] 3.4908 (5.95) 3.4918 (5.95) 3.4847 (5.91) 3.5011 (5.96) 3.1636 (6.58) 3.1668 (6.64) 3.1573 (6.59) 3.2561 (6.63)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x384-bert-base-uncased] 2.4115 (8.61) 2.4114 (8.61) 2.4084 (8.55) 2.4187 (8.63) 2.2461 (9.27) 2.2355 (9.41) 2.1826 (9.53) 2.2883 (9.43)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x384-sentence-transformers/all-MiniLM-L6-v2] 0.9257 (22.43) 0.926 (22.43) 0.9226 (22.33) 0.9318 (22.4) 0.8971 (23.2) 0.8973 (23.44) 0.8874 (23.43) 0.9833 (21.94)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x384-t5-small] 3.4222 (6.07) 3.2859 (6.32) 3.0433 (6.77) 3.4273 (6.09) 3.092 (6.73) 3.0949 (6.8) 3.0864 (6.74) 3.1863 (6.77)
test_benchmark_implementations[onnx-1x384-bert-base-uncased] 5.33 (3.9) 5.4046 (3.84) 5.3199 (3.87) 6.0908 (3.43) 5.0104 (4.15) 4.9829 (4.22) 4.8268 (4.31) 5.3708 (4.02)
test_benchmark_implementations[onnx_optim_fp16-1x384-bert-base-uncased] 3.1867 (6.52) 3.1863 (6.52) 3.159 (6.52) 3.2482 (6.43) 3.2173 (6.47) 3.2763 (6.42) 3.2061 (6.49) 3.6311 (5.94)
test_benchmark_implementations[onnx_optim_fp32-1x384-bert-base-uncased] 5.333 (3.89) 5.2543 (3.95) 5.0278 (4.1) 5.3865 (3.88) 5.0063 (4.16) 4.9813 (4.22) 4.8585 (4.28) 5.2087 (4.14)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-1x512-bert-base-uncased] 7.6657 (2.7) 7.6953 (2.72) 7.6001 (2.71) 7.8735 (2.74) 7.9618 (2.67) 8.0523 (2.66) 7.8233 (2.65) 8.9374 (2.44)
test_benchmark_implementations[baseline-1x512-sentence-transformers/all-MiniLM-L6-v2] 3.9485 (5.24) 3.9815 (5.26) 3.8789 (5.31) 4.2179 (5.12) 4.9473 (4.3) 4.9967 (4.29) 4.5362 (4.58) 5.614 (3.89)
test_benchmark_implementations[baseline-1x512-t5-small] 12.9823 (1.59) 13.0233 (1.61) 12.7786 (1.61) 13.6499 (1.58) 13.1497 (1.62) 13.3599 (1.6) 12.9257 (1.61) 14.9242 (1.46)
test_benchmark_implementations[dynamo-1x512-bert-base-uncased] 6.57 (3.15) 6.5985 (3.17) 6.4276 (3.2) 6.8362 (3.16) 6.7924 (3.13) 6.8549 (3.12) 6.706 (3.1) 7.3642 (2.97)
test_benchmark_implementations[dynamo-1x512-sentence-transformers/all-MiniLM-L6-v2] 3.3341 (6.21) 3.3528 (6.24) 3.2809 (6.27) 3.5145 (6.14) 3.6213 (5.87) 3.6569 (5.86) 3.5384 (5.87) 4.0563 (5.39)
test_benchmark_implementations[dynamo-1x512-t5-small] 11.4955 (1.8) 11.5166 (1.82) 11.3582 (1.81) 11.642 (1.85) 11.8769 (1.79) 11.9532 (1.79) 11.7719 (1.76) 12.4745 (1.75)
test_benchmark_implementations[dynamo_cuda_graphs-1x512-bert-base-uncased] 4.693 (4.41) 4.779 (4.38) 4.6879 (4.39) 5.5327 (3.9) 4.3944 (4.84) 4.4765 (4.78) 4.3294 (4.8) 5.2297 (4.18)
test_benchmark_implementations[dynamo_cuda_graphs-1x512-sentence-transformers/all-MiniLM-L6-v2] 1.4346 (14.42) 1.4737 (14.2) 1.3629 (15.1) 2.0562 (10.49) 1.4035 (15.15) 1.4126 (15.16) 1.3977 (14.85) 1.6794 (13.01)
test_benchmark_implementations[dynamo_cuda_graphs-1x512-t5-small] 4.7473 (4.36) 4.7029 (4.45) 4.4892 (4.58) 4.778 (4.52) 4.9807 (4.27) 4.9594 (4.32) 4.4888 (4.63) 5.5508 (3.94)
test_benchmark_implementations[dynamo_no_dropout-1x512-bert-base-uncased] 6.4298 (3.22) 6.4775 (3.23) 6.103 (3.37) 7.0461 (3.06) 7.142 (2.98) 7.3254 (2.92) 6.8822 (3.02) 8.1106 (2.69)
test_benchmark_implementations[dynamo_no_dropout-1x512-sentence-transformers/all-MiniLM-L6-v2] 3.0854 (6.71) 3.1061 (6.74) 3.0423 (6.76) 3.258 (6.62) 3.4413 (6.18) 3.4595 (6.19) 3.3729 (6.16) 3.885 (5.62)
test_benchmark_implementations[dynamo_no_dropout-1x512-t5-small] 10.9752 (1.89) 11.0247 (1.9) 10.8145 (1.9) 11.1974 (1.93) 11.4524 (1.86) 11.5261 (1.86) 11.2704 (1.84) 12.0367 (1.81)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x512-bert-base-uncased] 3.4764 (5.95) 3.4948 (5.99) 3.3434 (6.16) 3.6445 (5.92) 3.8591 (5.51) 3.9029 (5.49) 3.7247 (5.57) 4.2731 (5.11)
test_benchmark_implementations[dynamo_optimized-1x512-bert-base-uncased] 14.4579 (1.43) 14.4821 (1.44) 14.3903 (1.43) 14.592 (1.48) 14.7253 (1.44) 14.8562 (1.44) 14.6028 (1.42) 15.4563 (1.41)
test_benchmark_implementations[dynamo_optimized-1x512-sentence-transformers/all-MiniLM-L6-v2] 8.0947 (2.56) 8.1073 (2.58) 8.0305 (2.56) 8.1736 (2.64) 7.7937 (2.73) 7.8728 (2.72) 7.7057 (2.69) 8.3344 (2.62)
test_benchmark_implementations[dynamo_optimized-1x512-t5-small] 20.6938 (1.0) 20.9265 (1.0) 20.5784 (1.0) 21.5747 (1.0) 21.2585 (1.0) 21.4147 (1.0) 20.7622 (1.0) 21.8442 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-bert-base-uncased] 3.2481 (6.37) 3.2485 (6.44) 3.2451 (6.34) 3.2522 (6.63) 2.9354 (7.24) 2.9284 (7.31) 2.8956 (7.17) 2.9821 (7.33)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-sentence-transformers/all-MiniLM-L6-v2] 1.321 (15.67) 1.3208 (15.84) 1.3169 (15.63) 1.3261 (16.27) 1.2974 (16.39) 1.2989 (16.49) 1.2902 (16.09) 1.3913 (15.7)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-t5-small] 4.3223 (4.79) 4.3241 (4.84) 4.3192 (4.76) 4.3315 (4.98) 3.9595 (5.37) 3.9636 (5.4) 3.9517 (5.25) 4.0497 (5.39)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x512-bert-base-uncased] 3.4355 (6.02) 3.3516 (6.24) 3.1119 (6.61) 3.4417 (6.27) 3.0944 (6.87) 3.0966 (6.92) 3.0725 (6.76) 3.1647 (6.9)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x512-sentence-transformers/all-MiniLM-L6-v2] 1.3343 (15.51) 1.3184 (15.87) 1.2687 (16.22) 1.3394 (16.11) 1.3084 (16.25) 1.3116 (16.33) 1.3006 (15.96) 1.3979 (15.63)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x512-t5-small] 4.1032 (5.04) 4.0161 (5.21) 3.711 (5.55) 4.1093 (5.25) 3.7299 (5.7) 3.7354 (5.73) 3.7247 (5.57) 3.828 (5.71)
test_benchmark_implementations[onnx-1x512-bert-base-uncased] 7.9278 (2.61) 7.9378 (2.64) 7.9084 (2.6) 7.9862 (2.7) 7.3769 (2.88) 7.3306 (2.92) 7.1625 (2.9) 7.4711 (2.92)
test_benchmark_implementations[onnx_optim_fp16-1x512-bert-base-uncased] 4.2711 (4.85) 4.2818 (4.89) 4.2506 (4.84) 4.4483 (4.85) 3.9707 (5.35) 3.9784 (5.38) 3.8987 (5.33) 4.2722 (5.11)
test_benchmark_implementations[onnx_optim_fp32-1x512-bert-base-uncased] 7.4834 (2.77) 7.7587 (2.7) 7.4639 (2.76) 8.3671 (2.58) 7.3705 (2.88) 7.3416 (2.92) 7.1789 (2.89) 7.4897 (2.92)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
----------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-32x128-bert-base-uncased] 20.1103 (1.91) 20.222 (1.96) 20.098 (1.91) 20.4564 (2.0) 20.2393 (1.84) 20.3365 (1.85) 19.242 (1.94) 21.4758 (1.77)
test_benchmark_implementations[baseline-32x128-sentence-transformers/all-MiniLM-L6-v2] 4.7729 (8.06) 4.8271 (8.22) 4.7647 (8.07) 5.5798 (7.32) 4.9798 (7.49) 5.0319 (7.49) 4.8581 (7.68) 5.558 (6.84)
test_benchmark_implementations[baseline-32x128-t5-small] 17.5053 (2.2) 17.6603 (2.25) 17.4887 (2.2) 18.0183 (2.27) 17.8109 (2.09) 17.8566 (2.11) 17.6484 (2.11) 18.141 (2.1)
test_benchmark_implementations[dynamo-32x128-bert-base-uncased] 20.1605 (1.91) 20.3031 (1.95) 20.1329 (1.91) 20.4626 (2.0) 19.6522 (1.9) 19.755 (1.91) 19.2226 (1.94) 20.1006 (1.89)
test_benchmark_implementations[dynamo-32x128-sentence-transformers/all-MiniLM-L6-v2] 4.9705 (7.74) 5.0075 (7.92) 4.9633 (7.75) 5.7559 (7.1) 4.9375 (7.55) 4.9243 (7.65) 4.8092 (7.76) 5.0393 (7.55)
test_benchmark_implementations[dynamo-32x128-t5-small] 17.5852 (2.19) 17.663 (2.25) 17.5104 (2.2) 18.0675 (2.26) 17.7046 (2.11) 17.6579 (2.13) 17.4258 (2.14) 17.7565 (2.14)
test_benchmark_implementations[dynamo_cuda_graphs-32x128-bert-base-uncased] 21.0074 (1.83) 21.6282 (1.83) 20.9603 (1.84) 22.3089 (1.83) 18.9083 (1.97) 19.4129 (1.94) 18.8095 (1.98) 20.0227 (1.9)
test_benchmark_implementations[dynamo_cuda_graphs-32x128-sentence-transformers/all-MiniLM-L6-v2] 4.7831 (8.04) 4.8748 (8.14) 4.778 (8.05) 6.2177 (6.57) 4.6992 (7.94) 4.7629 (7.91) 4.5362 (8.22) 5.222 (7.28)
test_benchmark_implementations[dynamo_cuda_graphs-32x128-t5-small] 18.2067 (2.11) 17.9569 (2.21) 17.1796 (2.24) 18.262 (2.24) 17.9897 (2.07) 18.012 (2.09) 17.788 (2.1) 18.2229 (2.09)
test_benchmark_implementations[dynamo_no_dropout-32x128-bert-base-uncased] 20.1708 (1.91) 20.2898 (1.96) 20.096 (1.91) 20.4513 (2.0) 19.7122 (1.89) 19.8281 (1.9) 19.143 (1.95) 20.256 (1.88)
test_benchmark_implementations[dynamo_no_dropout-32x128-sentence-transformers/all-MiniLM-L6-v2] 4.951 (7.77) 4.9535 (8.01) 4.9377 (7.79) 4.9744 (8.22) 4.947 (7.54) 4.9294 (7.64) 4.8045 (7.76) 5.0619 (7.51)
test_benchmark_implementations[dynamo_no_dropout-32x128-t5-small] 17.5483 (2.19) 17.5473 (2.26) 17.4694 (2.2) 17.6302 (2.32) 17.7158 (2.11) 17.6569 (2.13) 17.4036 (2.14) 17.8327 (2.13)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x128-bert-base-uncased] 17.4756 (2.2) 17.5356 (2.26) 17.4582 (2.2) 17.8002 (2.3) 17.4646 (2.14) 17.1627 (2.19) 16.5188 (2.26) 17.4839 (2.18)
test_benchmark_implementations[dynamo_optimized-32x128-bert-base-uncased] 14.4835 (2.66) 14.5314 (2.73) 14.3811 (2.67) 14.806 (2.76) 14.9588 (2.49) 15.0727 (2.5) 14.8731 (2.51) 15.4737 (2.46)
test_benchmark_implementations[dynamo_optimized-32x128-sentence-transformers/all-MiniLM-L6-v2] 7.5981 (5.06) 7.6072 (5.21) 7.5574 (5.09) 7.6791 (5.32) 7.9115 (4.71) 7.9729 (4.72) 7.8787 (4.73) 8.5487 (4.45)
test_benchmark_implementations[dynamo_optimized-32x128-t5-small] 20.5292 (1.87) 20.5788 (1.93) 20.5005 (1.88) 20.6438 (1.98) 20.7852 (1.79) 20.9264 (1.8) 20.6732 (1.8) 21.3877 (1.78)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-bert-base-uncased] 13.4666 (2.86) 13.4672 (2.95) 13.4615 (2.86) 13.4707 (3.03) 13.3705 (2.79) 13.1356 (2.87) 12.5117 (2.98) 13.5314 (2.81)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-sentence-transformers/all-MiniLM-L6-v2] 3.84 (10.02) 3.8404 (10.33) 3.8359 (10.03) 3.8451 (10.63) 3.823 (9.76) 3.7664 (10.0) 3.6189 (10.31) 3.8349 (9.92)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-t5-small] 14.8449 (2.59) 14.8485 (2.67) 14.8398 (2.59) 14.8593 (2.75) 13.9983 (2.66) 14.1218 (2.67) 13.9111 (2.68) 14.2875 (2.66)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x128-bert-base-uncased] 15.0342 (2.56) 15.0459 (2.64) 14.7067 (2.62) 15.3201 (2.67) 14.9044 (2.5) 14.7121 (2.56) 13.8642 (2.69) 15.3747 (2.47)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x128-sentence-transformers/all-MiniLM-L6-v2] 3.8973 (9.87) 3.8972 (10.18) 3.8902 (9.89) 3.9076 (10.46) 3.8645 (9.65) 3.7958 (9.92) 3.6589 (10.19) 3.8817 (9.8)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x128-t5-small] 14.1865 (2.71) 14.3324 (2.77) 14.121 (2.72) 14.7098 (2.78) 13.8863 (2.69) 13.9945 (2.69) 13.806 (2.7) 14.1615 (2.69)
test_benchmark_implementations[onnx-32x128-bert-base-uncased] 38.4668 (1.0) 39.6668 (1.0) 38.4668 (1.0) 40.8668 (1.0) 37.3022 (1.0) 37.6702 (1.0) 37.3022 (1.0) 38.0383 (1.0)
test_benchmark_implementations[onnx_optim_fp16-32x128-bert-base-uncased] 18.4924 (2.08) 18.7878 (2.11) 18.4033 (2.09) 20.0591 (2.04) 19.0283 (1.96) 18.9233 (1.99) 18.2747 (2.04) 19.5136 (1.95)
test_benchmark_implementations[onnx_optim_fp32-32x128-bert-base-uncased] 38.0561 (1.01) 38.1012 (1.04) 38.0561 (1.01) 38.1462 (1.07) 37.2961 (1.0) 37.589 (1.0) 37.2961 (1.0) 37.8818 (1.0)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-32x16-bert-base-uncased] 8.0949 (2.53) 8.0867 (2.54) 7.9974 (2.56) 8.1654 (2.52) 8.6161 (2.42) 8.7201 (2.41) 8.3377 (2.49) 9.3768 (2.29)
test_benchmark_implementations[baseline-32x16-sentence-transformers/all-MiniLM-L6-v2] 4.1523 (4.94) 4.1759 (4.91) 4.0335 (5.08) 4.395 (4.68) 4.5378 (4.6) 4.5921 (4.57) 4.4281 (4.68) 5.2063 (4.12)
test_benchmark_implementations[baseline-32x16-t5-small] 13.4267 (1.53) 13.5531 (1.51) 13.2096 (1.55) 14.3955 (1.43) 13.4661 (1.55) 13.6877 (1.53) 13.4007 (1.55) 14.9987 (1.43)
test_benchmark_implementations[dynamo-32x16-bert-base-uncased] 6.8046 (3.01) 6.8248 (3.0) 6.6857 (3.06) 7.038 (2.92) 7.1556 (2.92) 7.178 (2.93) 7.0615 (2.93) 7.4938 (2.87)
test_benchmark_implementations[dynamo-32x16-sentence-transformers/all-MiniLM-L6-v2] 3.4652 (5.91) 3.475 (5.9) 3.4099 (6.0) 3.5994 (5.71) 3.803 (5.49) 3.8381 (5.47) 3.7357 (5.55) 4.1657 (5.15)
test_benchmark_implementations[dynamo-32x16-t5-small] 11.99 (1.71) 11.9973 (1.71) 11.9204 (1.72) 12.0719 (1.7) 12.2362 (1.71) 12.323 (1.7) 12.0694 (1.72) 12.8124 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-32x16-bert-base-uncased] 3.4601 (5.92) 3.5151 (5.83) 3.458 (5.92) 4.2936 (4.79) 3.1745 (6.57) 3.2583 (6.45) 3.0927 (6.7) 3.6324 (5.91)
test_benchmark_implementations[dynamo_cuda_graphs-32x16-sentence-transformers/all-MiniLM-L6-v2] 0.8243 (24.86) 0.837 (24.49) 0.8223 (24.9) 1.2984 (15.82) 0.7872 (26.5) 0.8048 (26.09) 0.7792 (26.59) 1.065 (20.16)
test_benchmark_implementations[dynamo_cuda_graphs-32x16-t5-small] 2.8672 (7.15) 2.9102 (7.04) 2.8641 (7.15) 3.3137 (6.2) 2.5922 (8.05) 2.6265 (8.0) 2.5846 (8.02) 3.0319 (7.08)
test_benchmark_implementations[dynamo_no_dropout-32x16-bert-base-uncased] 6.5608 (3.12) 6.6115 (3.1) 6.4276 (3.18) 7.2261 (2.84) 6.6954 (3.12) 6.7312 (3.12) 6.5988 (3.14) 7.1048 (3.02)
test_benchmark_implementations[dynamo_no_dropout-32x16-sentence-transformers/all-MiniLM-L6-v2] 3.2236 (6.36) 3.2352 (6.34) 3.1693 (6.46) 3.4174 (6.01) 3.9266 (5.31) 3.9418 (5.33) 3.866 (5.36) 4.3407 (4.95)
test_benchmark_implementations[dynamo_no_dropout-32x16-t5-small] 11.3603 (1.8) 11.3777 (1.8) 11.2723 (1.82) 11.4729 (1.79) 11.6763 (1.79) 11.7824 (1.78) 11.5762 (1.79) 12.345 (1.74)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x16-bert-base-uncased] 3.8779 (5.28) 3.9158 (5.24) 3.5236 (5.81) 4.4534 (4.61) 4.2331 (4.93) 4.2567 (4.93) 4.0975 (5.06) 4.6687 (4.6)
test_benchmark_implementations[dynamo_optimized-32x16-bert-base-uncased] 14.4568 (1.42) 14.4705 (1.42) 14.3647 (1.43) 14.5621 (1.41) 14.937 (1.4) 15.0749 (1.39) 14.7323 (1.41) 15.496 (1.39)
test_benchmark_implementations[dynamo_optimized-32x16-sentence-transformers/all-MiniLM-L6-v2] 7.5889 (2.7) 7.6036 (2.7) 7.5305 (2.72) 7.7322 (2.66) 7.9664 (2.62) 8.0316 (2.61) 7.8605 (2.64) 8.4617 (2.54)
test_benchmark_implementations[dynamo_optimized-32x16-t5-small] 20.4933 (1.0) 20.5025 (1.0) 20.4718 (1.0) 20.5476 (1.0) 20.8647 (1.0) 21.0003 (1.0) 20.7213 (1.0) 21.4716 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-bert-base-uncased] 2.5938 (7.9) 2.4931 (8.22) 2.3726 (8.63) 2.5989 (7.91) 2.3739 (8.79) 2.3617 (8.89) 2.3197 (8.93) 2.4097 (8.91)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-sentence-transformers/all-MiniLM-L6-v2] 0.7096 (28.88) 0.7103 (28.86) 0.7076 (28.93) 0.7178 (28.62) 0.6732 (30.99) 0.6749 (31.12) 0.6704 (30.91) 0.771 (27.85)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-t5-small] 2.0572 (9.96) 2.0577 (9.96) 2.0552 (9.96) 2.0603 (9.97) 1.8831 (11.08) 1.8862 (11.13) 1.8638 (11.12) 1.9654 (10.92)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x16-bert-base-uncased] 2.603 (7.87) 2.9186 (7.02) 2.3747 (8.62) 3.7663 (5.46) 2.3915 (8.72) 2.522 (8.33) 2.3258 (8.91) 3.1608 (6.79)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x16-sentence-transformers/all-MiniLM-L6-v2] 0.6656 (30.79) 0.6941 (29.54) 0.6636 (30.85) 1.3199 (15.57) 0.6755 (30.89) 0.6917 (30.36) 0.671 (30.88) 1.1886 (18.06)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x16-t5-small] 2.0613 (9.94) 2.0225 (10.14) 1.876 (10.91) 2.4381 (8.43) 1.9003 (10.98) 1.9104 (10.99) 1.8736 (11.06) 2.3291 (9.22)
test_benchmark_implementations[onnx-32x16-bert-base-uncased] 5.718 (3.58) 5.8492 (3.51) 5.6607 (3.62) 6.2648 (3.28) 5.6611 (3.69) 5.6351 (3.73) 5.5059 (3.76) 5.9045 (3.64)
test_benchmark_implementations[onnx_optim_fp16-32x16-bert-base-uncased] 4.0346 (5.08) 4.0609 (5.05) 3.6199 (5.66) 4.9091 (4.19) 3.317 (6.29) 3.3535 (6.26) 3.2826 (6.31) 3.7392 (5.74)
test_benchmark_implementations[onnx_optim_fp32-32x16-bert-base-uncased] 5.7446 (3.57) 5.9467 (3.45) 5.6535 (3.62) 7.0769 (2.9) 5.6742 (3.68) 5.6544 (3.71) 5.5423 (3.74) 5.886 (3.65)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
----------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-32x256-bert-base-uncased] 43.8989 (1.72) 43.9153 (1.72) 43.8989 (1.72) 43.9316 (1.72) 44.0885 (1.73) 44.9013 (1.7) 44.0885 (1.73) 45.714 (1.67)
test_benchmark_implementations[baseline-32x256-sentence-transformers/all-MiniLM-L6-v2] 11.9368 (6.33) 11.95 (6.32) 11.9265 (6.34) 12.0136 (6.29) 12.0605 (6.33) 11.9829 (6.37) 11.6743 (6.54) 12.1666 (6.27)
test_benchmark_implementations[baseline-32x256-t5-small] 38.7185 (1.95) 39.3805 (1.92) 38.7185 (1.95) 40.0425 (1.89) 38.3135 (1.99) 38.6135 (1.98) 38.3135 (1.99) 38.9135 (1.96)
test_benchmark_implementations[dynamo-32x256-bert-base-uncased] 44.1713 (1.71) 44.1718 (1.71) 44.1713 (1.71) 44.1723 (1.71) 44.2216 (1.73) 44.7548 (1.7) 44.2216 (1.73) 45.2881 (1.68)
test_benchmark_implementations[dynamo-32x256-sentence-transformers/all-MiniLM-L6-v2] 12.1252 (6.23) 12.1284 (6.23) 12.1201 (6.24) 12.1405 (6.23) 11.9394 (6.39) 11.9082 (6.41) 11.6552 (6.55) 12.0576 (6.33)
test_benchmark_implementations[dynamo-32x256-t5-small] 38.826 (1.95) 38.8352 (1.95) 38.826 (1.95) 38.8444 (1.95) 38.9082 (1.96) 39.5505 (1.93) 38.9082 (1.96) 40.1928 (1.9)
test_benchmark_implementations[dynamo_cuda_graphs-32x256-bert-base-uncased] 43.777 (1.73) 43.7857 (1.73) 43.777 (1.73) 43.7944 (1.73) 43.785 (1.74) 44.0495 (1.73) 43.785 (1.74) 44.314 (1.72)
test_benchmark_implementations[dynamo_cuda_graphs-32x256-sentence-transformers/all-MiniLM-L6-v2] 11.7975 (6.41) 11.8012 (6.4) 11.7862 (6.41) 11.8282 (6.39) 11.8569 (6.43) 11.7862 (6.47) 11.4683 (6.65) 11.9419 (6.39)
test_benchmark_implementations[dynamo_cuda_graphs-32x256-t5-small] 38.8639 (1.94) 38.8797 (1.94) 38.8639 (1.94) 38.8956 (1.94) 37.8844 (2.01) 38.4157 (1.99) 37.8844 (2.01) 38.9469 (1.96)
test_benchmark_implementations[dynamo_no_dropout-32x256-bert-base-uncased] 44.1754 (1.71) 44.18 (1.71) 44.1754 (1.71) 44.1846 (1.71) 44.6561 (1.71) 58.9414 (1.29) 44.6561 (1.71) 73.2268 (1.04)
test_benchmark_implementations[dynamo_no_dropout-32x256-sentence-transformers/all-MiniLM-L6-v2] 12.1272 (6.23) 12.1317 (6.23) 12.119 (6.24) 12.1498 (6.22) 11.8996 (6.41) 11.9185 (6.4) 11.6985 (6.52) 12.0558 (6.33)
test_benchmark_implementations[dynamo_no_dropout-32x256-t5-small] 38.8813 (1.94) 39.5628 (1.91) 38.8813 (1.94) 40.2442 (1.88) 39.0476 (1.95) 39.3913 (1.94) 39.0476 (1.95) 39.7349 (1.92)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x256-bert-base-uncased] 36.1165 (2.09) 36.1262 (2.09) 36.1165 (2.09) 36.1359 (2.09) 34.5393 (2.21) 35.392 (2.16) 34.5393 (2.21) 36.2447 (2.1)
test_benchmark_implementations[dynamo_optimized-32x256-bert-base-uncased] 28.6556 (2.64) 28.6587 (2.64) 28.6546 (2.64) 28.6659 (2.64) 27.053 (2.82) 27.1582 (2.81) 26.309 (2.9) 28.1126 (2.71)
test_benchmark_implementations[dynamo_optimized-32x256-sentence-transformers/all-MiniLM-L6-v2] 10.1489 (7.45) 10.1628 (7.44) 10.1386 (7.45) 10.1898 (7.42) 10.3971 (7.34) 10.3376 (7.38) 9.9587 (7.66) 10.4419 (7.31)
test_benchmark_implementations[dynamo_optimized-32x256-t5-small] 34.1545 (2.21) 34.1647 (2.21) 34.1545 (2.21) 34.175 (2.21) 34.2774 (2.23) 34.4148 (2.22) 34.2774 (2.23) 34.5523 (2.21)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-bert-base-uncased] 27.2364 (2.78) 27.2667 (2.77) 27.2087 (2.78) 27.3551 (2.76) 27.3652 (2.79) 26.7719 (2.85) 25.4413 (3.0) 27.5092 (2.77)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-sentence-transformers/all-MiniLM-L6-v2] 9.7085 (7.79) 9.7077 (7.79) 9.7034 (7.79) 9.7106 (7.78) 9.5923 (7.95) 9.5918 (7.95) 9.3268 (8.18) 9.7865 (7.8)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-t5-small] 33.3466 (2.27) 33.3496 (2.27) 33.3466 (2.27) 33.3527 (2.27) 32.9591 (2.31) 33.1959 (2.3) 32.9591 (2.31) 33.4327 (2.28)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x256-bert-base-uncased] 28.3853 (2.66) 28.3856 (2.66) 28.3812 (2.66) 28.3904 (2.66) 27.9692 (2.73) 27.7448 (2.75) 26.5484 (2.87) 28.7167 (2.66)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x256-sentence-transformers/all-MiniLM-L6-v2] 9.8796 (7.65) 9.8846 (7.65) 9.8693 (7.66) 9.9133 (7.62) 9.8898 (7.71) 9.7855 (7.8) 9.5171 (8.02) 9.9586 (7.66)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x256-t5-small] 30.9391 (2.44) 30.9391 (2.44) 30.9289 (2.44) 30.9494 (2.44) 30.9829 (2.46) 30.8543 (2.47) 30.3163 (2.52) 31.2637 (2.44)
test_benchmark_implementations[onnx-32x256-bert-base-uncased] 75.5272 (1.0) 75.5272 (1.0) 75.5272 (1.0) 75.5272 (1.0) 76.2924 (1.0) 76.2924 (1.0) 76.2924 (1.0) 76.2924 (1.0)
test_benchmark_implementations[onnx_optim_fp16-32x256-bert-base-uncased] 36.7002 (2.06) 38.138 (1.98) 36.7002 (2.06) 39.5759 (1.91) 35.0286 (2.18) 35.7077 (2.14) 35.0286 (2.18) 36.3868 (2.1)
test_benchmark_implementations[onnx_optim_fp32-32x256-bert-base-uncased] 75.5815 (1.0) 75.5815 (1.0) 75.5815 (1.0) 75.5815 (1.0) 75.7002 (1.01) 75.7002 (1.01) 75.7002 (1.01) 75.7002 (1.01)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 32)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-32x32-bert-base-uncased] 8.318 (3.03) 8.3861 (3.07) 8.2217 (3.04) 8.9518 (3.03) 8.6713 (2.96) 8.7514 (2.97) 8.5516 (2.86) 9.6215 (2.89)
test_benchmark_implementations[baseline-32x32-sentence-transformers/all-MiniLM-L6-v2] 4.3563 (5.79) 4.5901 (5.62) 3.8052 (6.56) 6.6755 (4.07) 4.6476 (5.52) 4.6881 (5.54) 4.6157 (5.29) 5.3856 (5.17)
test_benchmark_implementations[baseline-32x32-t5-small] 13.2076 (1.91) 13.4256 (1.92) 12.8911 (1.94) 15.0006 (1.81) 17.6551 (1.45) 17.6652 (1.47) 14.2071 (1.72) 20.7045 (1.35)
test_benchmark_implementations[dynamo-32x32-bert-base-uncased] 6.7973 (3.71) 6.9412 (3.71) 6.5833 (3.79) 7.4977 (3.62) 7.5334 (3.41) 7.4935 (3.47) 7.0981 (3.44) 7.9453 (3.51)
test_benchmark_implementations[dynamo-32x32-sentence-transformers/all-MiniLM-L6-v2] 3.5197 (7.17) 3.5462 (7.27) 3.3772 (7.39) 3.7745 (7.2) 3.9039 (6.57) 3.9479 (6.58) 3.7399 (6.53) 4.547 (6.13)
test_benchmark_implementations[dynamo-32x32-t5-small] 11.5804 (2.18) 11.6471 (2.21) 11.3123 (2.21) 11.9941 (2.26) 11.7353 (2.19) 11.9069 (2.18) 11.5649 (2.11) 13.1716 (2.11)
test_benchmark_implementations[dynamo_cuda_graphs-32x32-bert-base-uncased] 6.2792 (4.02) 6.1798 (4.17) 5.7436 (4.35) 6.2843 (4.32) 5.8772 (4.37) 5.8711 (4.43) 5.6244 (4.34) 6.3794 (4.37)
test_benchmark_implementations[dynamo_cuda_graphs-32x32-sentence-transformers/all-MiniLM-L6-v2] 1.2687 (19.88) 1.2685 (20.33) 1.2657 (19.72) 1.2739 (21.32) 1.1954 (21.46) 1.2193 (21.31) 1.1768 (20.76) 1.4892 (18.7)
test_benchmark_implementations[dynamo_cuda_graphs-32x32-t5-small] 4.2322 (5.96) 4.196 (6.14) 3.9004 (6.4) 4.2363 (6.41) 3.8989 (6.58) 4.0423 (6.43) 3.821 (6.39) 4.4923 (6.2)
test_benchmark_implementations[dynamo_no_dropout-32x32-bert-base-uncased] 6.57 (3.84) 6.4922 (3.97) 5.9689 (4.18) 6.8342 (3.97) 6.92 (3.71) 7.0632 (3.68) 6.8101 (3.59) 8.0956 (3.44)
test_benchmark_implementations[dynamo_no_dropout-32x32-sentence-transformers/all-MiniLM-L6-v2] 3.2974 (7.65) 3.3093 (7.79) 3.2494 (7.68) 3.4437 (7.89) 3.6381 (7.05) 3.6652 (7.09) 3.5759 (6.83) 4.0713 (6.84)
test_benchmark_implementations[dynamo_no_dropout-32x32-t5-small] 10.7336 (2.35) 10.7423 (2.4) 10.6424 (2.35) 10.8104 (2.51) 11.035 (2.33) 11.1114 (2.34) 10.9449 (2.23) 11.5263 (2.42)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x32-bert-base-uncased] 5.5591 (4.54) 5.559 (4.64) 5.5542 (4.49) 5.5656 (4.88) 5.1721 (4.96) 5.1773 (5.02) 5.1374 (4.76) 5.3707 (5.19)
test_benchmark_implementations[dynamo_optimized-32x32-bert-base-uncased] 15.8034 (1.6) 16.0256 (1.61) 15.7696 (1.58) 16.9861 (1.6) 16.9422 (1.51) 17.0062 (1.53) 16.772 (1.46) 17.4459 (1.6)
test_benchmark_implementations[dynamo_optimized-32x32-sentence-transformers/all-MiniLM-L6-v2] 8.7665 (2.88) 8.7274 (2.95) 8.1459 (3.06) 9.3266 (2.91) 9.0715 (2.83) 9.1803 (2.83) 8.6944 (2.81) 10.191 (2.73)
test_benchmark_implementations[dynamo_optimized-32x32-t5-small] 25.223 (1.0) 25.7836 (1.0) 24.9641 (1.0) 27.1636 (1.0) 25.66 (1.0) 25.983 (1.0) 24.4351 (1.0) 27.8538 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x32-bert-base-uncased] 4.6356 (5.44) 4.6359 (5.56) 4.6316 (5.39) 4.6408 (5.85) 4.4108 (5.82) 4.3622 (5.96) 4.1856 (5.84) 4.7117 (5.91)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x32-sentence-transformers/all-MiniLM-L6-v2] 1.0209 (24.71) 0.9912 (26.01) 0.9349 (26.7) 1.026 (26.47) 0.9732 (26.37) 0.9721 (26.73) 0.9546 (25.6) 1.052 (26.48)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x32-t5-small] 3.4007 (7.42) 3.4007 (7.58) 3.3966 (7.35) 3.4099 (7.97) 3.1713 (8.09) 3.1506 (8.25) 3.0793 (7.94) 3.1925 (8.72)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x32-bert-base-uncased] 4.6418 (5.43) 4.6423 (5.55) 4.6397 (5.38) 4.6459 (5.85) 4.3077 (5.96) 4.3155 (6.02) 4.1212 (5.93) 4.7883 (5.82)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x32-sentence-transformers/all-MiniLM-L6-v2] 1.025 (24.61) 1.0171 (25.35) 0.94 (26.56) 1.0332 (26.29) 0.9681 (26.5) 0.9678 (26.85) 0.9531 (25.64) 1.0507 (26.51)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x32-t5-small] 3.3997 (7.42) 3.3993 (7.59) 3.3935 (7.36) 3.4017 (7.99) 3.1266 (8.21) 3.1117 (8.35) 3.0536 (8.0) 3.1578 (8.82)
test_benchmark_implementations[onnx-32x32-bert-base-uncased] 12.1283 (2.08) 11.9173 (2.16) 11.2282 (2.22) 12.1641 (2.23) 11.0919 (2.31) 11.2471 (2.31) 11.0335 (2.21) 11.8642 (2.35)
test_benchmark_implementations[onnx_optim_fp16-32x32-bert-base-uncased] 6.2607 (4.03) 6.2683 (4.11) 6.2525 (3.99) 6.3244 (4.3) 5.6954 (4.51) 5.7099 (4.55) 5.6174 (4.35) 5.9653 (4.67)
test_benchmark_implementations[onnx_optim_fp32-32x32-bert-base-uncased] 12.0381 (2.1) 12.1289 (2.13) 11.4872 (2.17) 12.7601 (2.13) 10.9387 (2.35) 10.9769 (2.37) 10.7914 (2.26) 11.3731 (2.45)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 33)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-32x33-bert-base-uncased] 7.9391 (2.61) 7.9654 (2.61) 7.8582 (2.63) 8.1449 (2.56) 8.2707 (2.54) 8.3515 (2.52) 8.1364 (2.56) 9.1183 (2.36)
test_benchmark_implementations[baseline-32x33-sentence-transformers/all-MiniLM-L6-v2] 4.1667 (4.97) 4.1697 (4.98) 4.055 (5.1) 4.3868 (4.76) 4.6014 (4.56) 4.6691 (4.51) 4.4708 (4.65) 5.3424 (4.03)
test_benchmark_implementations[baseline-32x33-t5-small] 12.9792 (1.6) 13.0531 (1.59) 12.8471 (1.61) 13.5711 (1.54) 13.1205 (1.6) 13.2024 (1.6) 12.8459 (1.62) 13.8136 (1.56)
test_benchmark_implementations[dynamo-32x33-bert-base-uncased] 6.9765 (2.97) 6.9768 (2.97) 6.9734 (2.96) 6.9837 (2.99) 7.3883 (2.84) 7.3851 (2.85) 7.0931 (2.93) 7.9083 (2.72)
test_benchmark_implementations[dynamo-32x33-sentence-transformers/all-MiniLM-L6-v2] 3.6854 (5.62) 3.6974 (5.61) 3.6652 (5.64) 3.8186 (5.47) 4.0408 (5.19) 4.0671 (5.18) 3.9996 (5.2) 4.4164 (4.88)
test_benchmark_implementations[dynamo-32x33-t5-small] 11.3664 (1.82) 11.4128 (1.82) 11.2906 (1.83) 11.5908 (1.8) 11.6923 (1.79) 11.8049 (1.79) 11.6673 (1.78) 12.3657 (1.74)
test_benchmark_implementations[dynamo_cuda_graphs-32x33-bert-base-uncased] 6.1112 (3.39) 6.1855 (3.36) 6.0754 (3.4) 6.6611 (3.13) 6.0649 (3.46) 6.0383 (3.49) 5.9497 (3.5) 6.0727 (3.55)
test_benchmark_implementations[dynamo_cuda_graphs-32x33-sentence-transformers/all-MiniLM-L6-v2] 1.4387 (14.4) 1.4381 (14.43) 1.4346 (14.41) 1.4408 (14.49) 1.3425 (15.62) 1.3409 (15.72) 1.3239 (15.72) 1.4131 (15.24)
test_benchmark_implementations[dynamo_cuda_graphs-32x33-t5-small] 4.6172 (4.49) 4.6785 (4.44) 4.3049 (4.8) 5.4149 (3.86) 4.2414 (4.94) 4.2678 (4.94) 4.1524 (5.01) 4.6688 (4.61)
test_benchmark_implementations[dynamo_no_dropout-32x33-bert-base-uncased] 6.9683 (2.97) 6.9692 (2.98) 6.9652 (2.97) 6.9765 (2.99) 6.739 (3.11) 6.8518 (3.08) 6.6609 (3.12) 7.5908 (2.84)
test_benchmark_implementations[dynamo_no_dropout-32x33-sentence-transformers/all-MiniLM-L6-v2] 3.533 (5.87) 3.5313 (5.88) 3.3126 (6.24) 3.922 (5.32) 3.8736 (5.41) 3.9135 (5.39) 3.6641 (5.68) 4.3592 (4.94)
test_benchmark_implementations[dynamo_no_dropout-32x33-t5-small] 11.1022 (1.87) 11.1308 (1.86) 11.0541 (1.87) 11.2077 (1.86) 11.3968 (1.84) 11.462 (1.84) 11.1947 (1.86) 12.069 (1.78)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x33-bert-base-uncased] 6.0856 (3.41) 6.0854 (3.41) 6.0805 (3.4) 6.0897 (3.43) 5.6613 (3.7) 5.6507 (3.73) 5.5204 (3.77) 6.1203 (3.52)
test_benchmark_implementations[dynamo_optimized-32x33-bert-base-uncased] 14.378 (1.44) 14.4856 (1.43) 14.3494 (1.44) 14.7958 (1.41) 14.7295 (1.42) 14.8722 (1.42) 14.6572 (1.42) 15.2575 (1.41)
test_benchmark_implementations[dynamo_optimized-32x33-sentence-transformers/all-MiniLM-L6-v2] 7.7089 (2.69) 7.7434 (2.68) 7.6689 (2.7) 7.8264 (2.67) 8.0326 (2.61) 8.0712 (2.61) 7.9361 (2.62) 8.6762 (2.48)
test_benchmark_implementations[dynamo_optimized-32x33-t5-small] 20.7227 (1.0) 20.7548 (1.0) 20.6756 (1.0) 20.8745 (1.0) 20.9669 (1.0) 21.0806 (1.0) 20.8102 (1.0) 21.5374 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-bert-base-uncased] 4.649 (4.46) 4.6502 (4.46) 4.6459 (4.45) 4.6551 (4.48) 4.3411 (4.83) 4.2777 (4.93) 4.1309 (5.04) 4.3813 (4.92)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-sentence-transformers/all-MiniLM-L6-v2] 1.1807 (17.55) 1.1812 (17.57) 1.1786 (17.54) 1.1848 (17.62) 1.1131 (18.84) 1.1067 (19.05) 1.0855 (19.17) 1.1879 (18.13)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-t5-small] 3.454 (6.0) 3.4544 (6.01) 3.4519 (5.99) 3.457 (6.04) 3.2238 (6.5) 3.1918 (6.6) 3.1085 (6.69) 3.2375 (6.65)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x33-bert-base-uncased] 4.6797 (4.43) 4.9238 (4.22) 4.3407 (4.76) 5.8522 (3.57) 4.384 (4.78) 4.5645 (4.62) 4.1499 (5.01) 5.4244 (3.97)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x33-sentence-transformers/all-MiniLM-L6-v2] 1.1827 (17.52) 1.2674 (16.38) 1.1796 (17.53) 1.833 (11.39) 1.114 (18.82) 1.1178 (18.86) 1.0848 (19.18) 1.3429 (16.04)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x33-t5-small] 3.4662 (5.98) 3.4655 (5.99) 3.458 (5.98) 3.4724 (6.01) 3.2283 (6.49) 3.2041 (6.58) 3.1188 (6.67) 3.2473 (6.63)
test_benchmark_implementations[onnx-32x33-bert-base-uncased] 12.2227 (1.7) 12.2232 (1.7) 12.2073 (1.69) 12.2544 (1.7) 11.2614 (1.86) 11.2064 (1.88) 10.9628 (1.9) 11.3057 (1.91)
test_benchmark_implementations[onnx_optim_fp16-32x33-bert-base-uncased] 6.2362 (3.32) 6.3345 (3.28) 6.1748 (3.35) 6.7328 (3.1) 6.1569 (3.41) 6.1572 (3.42) 6.0005 (3.47) 6.411 (3.36)
test_benchmark_implementations[onnx_optim_fp32-32x33-bert-base-uncased] 12.2184 (1.7) 12.3203 (1.68) 12.2102 (1.69) 13.0181 (1.6) 11.2924 (1.86) 11.2386 (1.88) 10.9766 (1.9) 11.2972 (1.91)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-8x128-bert-base-uncased] 8.1203 (2.54) 8.1921 (2.54) 7.8879 (2.61) 8.8381 (2.38) 8.2766 (2.5) 8.4215 (2.52) 8.2159 (2.52) 9.2964 (2.45)
test_benchmark_implementations[baseline-8x128-sentence-transformers/all-MiniLM-L6-v2] 4.137 (4.99) 4.1388 (5.02) 4.0591 (5.07) 4.3092 (4.89) 4.444 (4.66) 4.4867 (4.73) 4.3781 (4.72) 5.1076 (4.45)
test_benchmark_implementations[baseline-8x128-t5-small] 13.356 (1.55) 13.4459 (1.54) 12.93 (1.59) 14.3576 (1.47) 12.9359 (1.6) 13.2989 (1.6) 12.83 (1.61) 15.2652 (1.49)
test_benchmark_implementations[dynamo-8x128-bert-base-uncased] 7.4363 (2.78) 7.4624 (2.78) 7.1414 (2.88) 7.9821 (2.64) 7.522 (2.75) 7.7256 (2.75) 7.3428 (2.82) 8.9776 (2.53)
test_benchmark_implementations[dynamo-8x128-sentence-transformers/all-MiniLM-L6-v2] 3.5717 (5.78) 3.5895 (5.79) 3.5013 (5.88) 3.7069 (5.68) 3.8943 (5.32) 3.9003 (5.44) 3.7813 (5.47) 4.2609 (5.34)
test_benchmark_implementations[dynamo-8x128-t5-small] 11.4719 (1.8) 11.4641 (1.81) 11.305 (1.82) 11.6306 (1.81) 11.7661 (1.76) 11.965 (1.77) 11.6804 (1.77) 12.9159 (1.76)
test_benchmark_implementations[dynamo_cuda_graphs-8x128-bert-base-uncased] 6.8219 (3.03) 6.8443 (3.03) 6.8188 (3.02) 7.1229 (2.96) 6.2375 (3.32) 6.2265 (3.41) 6.1514 (3.36) 6.2684 (3.63)
test_benchmark_implementations[dynamo_cuda_graphs-8x128-sentence-transformers/all-MiniLM-L6-v2] 1.4193 (14.54) 1.4608 (14.22) 1.4049 (14.66) 1.5206 (13.84) 1.4519 (14.27) 1.4546 (14.59) 1.4198 (14.56) 1.713 (13.28)
test_benchmark_implementations[dynamo_cuda_graphs-8x128-t5-small] 4.7987 (4.3) 4.8896 (4.25) 4.7964 (4.29) 5.7682 (3.65) 4.4247 (4.68) 4.4865 (4.73) 4.3511 (4.75) 4.9224 (4.62)
test_benchmark_implementations[dynamo_no_dropout-8x128-bert-base-uncased] 7.1619 (2.88) 7.1627 (2.9) 7.1557 (2.88) 7.1731 (2.93) 6.8799 (3.01) 6.9156 (3.07) 6.8199 (3.03) 7.2595 (3.13)
test_benchmark_implementations[dynamo_no_dropout-8x128-sentence-transformers/all-MiniLM-L6-v2] 3.3219 (6.21) 3.3435 (6.21) 3.2667 (6.31) 3.5164 (5.99) 3.664 (5.65) 3.6773 (5.77) 3.5991 (5.75) 4.0736 (5.58)
test_benchmark_implementations[dynamo_no_dropout-8x128-t5-small] 10.8165 (1.91) 10.8773 (1.91) 10.7531 (1.92) 11.092 (1.9) 11.0337 (1.88) 11.2074 (1.89) 10.9651 (1.89) 12.0296 (1.89)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x128-bert-base-uncased] 5.8419 (3.53) 5.9064 (3.52) 5.3985 (3.82) 6.8055 (3.09) 5.5713 (3.72) 5.7365 (3.7) 5.3772 (3.85) 6.6491 (3.42)
test_benchmark_implementations[dynamo_optimized-8x128-bert-base-uncased] 14.2602 (1.45) 14.3465 (1.45) 14.2397 (1.45) 14.5295 (1.45) 14.6202 (1.42) 14.7196 (1.44) 14.5206 (1.42) 15.2607 (1.49)
test_benchmark_implementations[dynamo_optimized-8x128-sentence-transformers/all-MiniLM-L6-v2] 7.5756 (2.72) 7.5906 (2.74) 7.5316 (2.73) 7.6564 (2.75) 7.9116 (2.62) 7.9924 (2.65) 7.8833 (2.62) 8.4753 (2.68)
test_benchmark_implementations[dynamo_optimized-8x128-t5-small] 20.6418 (1.0) 20.768 (1.0) 20.5969 (1.0) 21.0504 (1.0) 20.7112 (1.0) 21.2185 (1.0) 20.6785 (1.0) 22.7448 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-bert-base-uncased] 4.8148 (4.29) 4.8148 (4.31) 4.8118 (4.28) 4.8189 (4.37) 4.4394 (4.67) 4.3925 (4.83) 4.2787 (4.83) 4.4503 (5.11)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-sentence-transformers/all-MiniLM-L6-v2] 1.2646 (16.32) 1.2649 (16.42) 1.2616 (16.33) 1.2677 (16.61) 1.2102 (17.11) 1.2082 (17.56) 1.1909 (17.36) 1.289 (17.65)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-t5-small] 4.0622 (5.08) 4.0625 (5.11) 4.0591 (5.07) 4.0673 (5.18) 3.782 (5.48) 3.7692 (5.63) 3.716 (5.56) 3.8102 (5.97)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x128-bert-base-uncased] 4.9582 (4.16) 4.9095 (4.23) 4.6244 (4.45) 4.9623 (4.24) 4.5776 (4.52) 4.5196 (4.69) 4.4196 (4.68) 4.587 (4.96)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x128-sentence-transformers/all-MiniLM-L6-v2] 1.2759 (16.18) 1.2756 (16.28) 1.2728 (16.18) 1.279 (16.46) 1.2191 (16.99) 1.2175 (17.43) 1.199 (17.25) 1.298 (17.52)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x128-t5-small] 4.0038 (5.16) 3.9597 (5.24) 3.7468 (5.5) 4.0131 (5.25) 3.722 (5.56) 3.7044 (5.73) 3.6494 (5.67) 3.7499 (6.07)
test_benchmark_implementations[onnx-8x128-bert-base-uncased] 12.0638 (1.71) 12.3603 (1.68) 11.4914 (1.79) 13.5772 (1.55) 11.147 (1.86) 11.2763 (1.88) 10.8386 (1.91) 11.9962 (1.9)
test_benchmark_implementations[onnx_optim_fp16-8x128-bert-base-uncased] 6.0877 (3.39) 6.2458 (3.33) 6.0283 (3.42) 6.5434 (3.22) 6.0125 (3.44) 6.0374 (3.51) 5.9049 (3.5) 6.259 (3.63)
test_benchmark_implementations[onnx_optim_fp32-8x128-bert-base-uncased] 12.1438 (1.7) 12.5041 (1.66) 12.0809 (1.7) 13.5057 (1.56) 11.1081 (1.86) 11.1031 (1.91) 10.948 (1.89) 11.3194 (2.01)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
--------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-8x16-bert-base-uncased] 8.7224 (2.33) 8.718 (2.34) 8.404 (2.41) 8.9489 (2.3) 8.8036 (2.35) 8.7788 (2.39) 8.456 (2.45) 9.1029 (2.37)
test_benchmark_implementations[baseline-8x16-sentence-transformers/all-MiniLM-L6-v2] 4.565 (4.45) 4.5537 (4.48) 4.2424 (4.78) 4.779 (4.32) 4.8182 (4.3) 5.0353 (4.16) 4.4338 (4.66) 7.6707 (2.81)
test_benchmark_implementations[baseline-8x16-t5-small] 14.3473 (1.42) 14.3997 (1.42) 14.2295 (1.43) 14.5941 (1.41) 14.7764 (1.4) 15.1517 (1.38) 14.716 (1.41) 16.7848 (1.28)
test_benchmark_implementations[dynamo-8x16-bert-base-uncased] 7.1987 (2.82) 7.1913 (2.84) 6.8987 (2.94) 7.3953 (2.79) 7.6135 (2.72) 7.6507 (2.74) 7.4793 (2.77) 8.1916 (2.63)
test_benchmark_implementations[dynamo-8x16-sentence-transformers/all-MiniLM-L6-v2] 3.7687 (5.39) 3.7786 (5.4) 3.6424 (5.57) 3.9138 (5.27) 3.9267 (5.28) 3.9399 (5.32) 3.7995 (5.44) 4.361 (4.94)
test_benchmark_implementations[dynamo-8x16-t5-small] 12.5153 (1.62) 12.5816 (1.62) 12.4129 (1.63) 12.9044 (1.6) 12.8798 (1.61) 13.0716 (1.6) 12.5823 (1.64) 14.1337 (1.53)
test_benchmark_implementations[dynamo_cuda_graphs-8x16-bert-base-uncased] 1.8063 (11.25) 1.8214 (11.21) 1.8033 (11.25) 2.2733 (9.07) 1.6404 (12.63) 1.6599 (12.62) 1.6358 (12.64) 1.9514 (11.05)
test_benchmark_implementations[dynamo_cuda_graphs-8x16-sentence-transformers/all-MiniLM-L6-v2] 0.6112 (33.25) 0.6371 (32.04) 0.6083 (33.36) 1.2943 (15.93) 0.6269 (33.06) 0.6447 (32.49) 0.6207 (33.32) 0.9829 (21.94)
test_benchmark_implementations[dynamo_cuda_graphs-8x16-t5-small] 1.7664 (11.51) 1.7932 (11.38) 1.7635 (11.51) 2.2415 (9.2) 1.6102 (12.87) 1.6216 (12.92) 1.6061 (12.88) 1.8734 (11.51)
test_benchmark_implementations[dynamo_no_dropout-8x16-bert-base-uncased] 6.8354 (2.97) 6.8532 (2.98) 6.6929 (3.03) 7.0113 (2.94) 7.1375 (2.9) 7.1628 (2.92) 7.0734 (2.92) 7.4398 (2.9)
test_benchmark_implementations[dynamo_no_dropout-8x16-sentence-transformers/all-MiniLM-L6-v2] 3.4007 (5.98) 4.0904 (4.99) 3.2236 (6.3) 5.718 (3.61) 3.7136 (5.58) 3.7337 (5.61) 3.5135 (5.89) 4.0979 (5.26)
test_benchmark_implementations[dynamo_no_dropout-8x16-t5-small] 11.7873 (1.72) 11.8327 (1.73) 11.7268 (1.73) 12.0095 (1.72) 12.0854 (1.71) 12.2367 (1.71) 11.9702 (1.73) 12.9334 (1.67)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x16-bert-base-uncased] 3.8023 (5.35) 3.8162 (5.35) 3.6887 (5.5) 4.3407 (4.75) 4.1225 (5.03) 4.1361 (5.06) 3.9613 (5.22) 4.485 (4.81)
test_benchmark_implementations[dynamo_optimized-8x16-bert-base-uncased] 14.6123 (1.39) 14.6671 (1.39) 14.4056 (1.41) 14.9719 (1.38) 14.962 (1.38) 14.9677 (1.4) 14.765 (1.4) 15.3162 (1.41)
test_benchmark_implementations[dynamo_optimized-8x16-sentence-transformers/all-MiniLM-L6-v2] 7.51 (2.71) 7.5329 (2.71) 7.4373 (2.73) 7.7158 (2.67) 7.8431 (2.64) 7.8966 (2.65) 7.7832 (2.66) 8.3786 (2.57)
test_benchmark_implementations[dynamo_optimized-8x16-t5-small] 20.3244 (1.0) 20.4122 (1.0) 20.2926 (1.0) 20.6244 (1.0) 20.7217 (1.0) 20.9481 (1.0) 20.6834 (1.0) 21.5624 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-bert-base-uncased] 1.3107 (15.51) 1.3905 (14.68) 1.2974 (15.64) 1.4879 (13.86) 1.3557 (15.29) 1.3583 (15.42) 1.3521 (15.3) 1.4497 (14.87)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-sentence-transformers/all-MiniLM-L6-v2] 0.4751 (42.78) 0.4754 (42.94) 0.4741 (42.8) 0.4782 (43.13) 0.4719 (43.91) 0.4737 (44.22) 0.4695 (44.06) 0.5652 (38.15)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-t5-small] 1.3916 (14.6) 1.3916 (14.67) 1.3896 (14.6) 1.3937 (14.8) 1.2807 (16.18) 1.2827 (16.33) 1.2785 (16.18) 1.3723 (15.71)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x16-bert-base-uncased] 1.5032 (13.52) 1.5038 (13.57) 1.5012 (13.52) 1.5073 (13.68) 1.3748 (15.07) 1.3766 (15.22) 1.3705 (15.09) 1.4674 (14.69)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x16-sentence-transformers/all-MiniLM-L6-v2] 0.4792 (42.41) 0.4688 (43.54) 0.4178 (48.57) 0.4813 (42.85) 0.4752 (43.61) 0.4768 (43.93) 0.4732 (43.71) 0.569 (37.89)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x16-t5-small] 1.3967 (14.55) 1.3966 (14.62) 1.3947 (14.55) 1.3988 (14.74) 1.285 (16.13) 1.2873 (16.27) 1.2821 (16.13) 1.383 (15.59)
test_benchmark_implementations[onnx-8x16-bert-base-uncased] 2.9376 (6.92) 3.0992 (6.59) 2.8931 (7.01) 3.966 (5.2) 2.9486 (7.03) 2.9634 (7.07) 2.9395 (7.04) 3.281 (6.57)
test_benchmark_implementations[onnx_optim_fp16-8x16-bert-base-uncased] 2.9757 (6.83) 3.0455 (6.7) 2.7936 (7.26) 3.5707 (5.78) 3.0224 (6.86) 3.0868 (6.79) 2.8295 (7.31) 3.8465 (5.61)
test_benchmark_implementations[onnx_optim_fp32-8x16-bert-base-uncased] 2.9368 (6.92) 3.0415 (6.71) 2.9075 (6.98) 3.2737 (6.3) 2.9635 (6.99) 2.9903 (7.01) 2.952 (7.01) 3.327 (6.48)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-8x256-bert-base-uncased] 13.6847 (1.65) 13.8643 (1.63) 13.3089 (1.7) 14.4476 (1.59) 13.4402 (1.7) 13.644 (1.68) 13.3807 (1.67) 14.4749 (1.64)
test_benchmark_implementations[baseline-8x256-sentence-transformers/all-MiniLM-L6-v2] 4.0038 (5.66) 4.0237 (5.63) 3.9238 (5.77) 4.1677 (5.5) 4.418 (5.16) 4.4704 (5.12) 4.2906 (5.2) 5.0355 (4.73)
test_benchmark_implementations[baseline-8x256-t5-small] 13.388 (1.69) 13.3952 (1.69) 13.1994 (1.72) 13.5334 (1.69) 13.6238 (1.67) 13.7468 (1.67) 13.4358 (1.66) 14.3929 (1.65)
test_benchmark_implementations[dynamo-8x256-bert-base-uncased] 14.4138 (1.57) 14.3384 (1.58) 13.9315 (1.63) 14.4271 (1.59) 13.4756 (1.69) 13.4557 (1.7) 13.3306 (1.67) 13.5068 (1.76)
test_benchmark_implementations[dynamo-8x256-sentence-transformers/all-MiniLM-L6-v2] 3.8441 (5.89) 3.8464 (5.89) 3.841 (5.89) 3.8636 (5.94) 3.8452 (5.93) 3.8556 (5.94) 3.8107 (5.86) 4.0681 (5.85)
test_benchmark_implementations[dynamo-8x256-t5-small] 11.891 (1.9) 11.8879 (1.91) 11.7198 (1.93) 12.0156 (1.91) 12.1549 (1.88) 12.2303 (1.87) 11.9982 (1.86) 12.6792 (1.88)
test_benchmark_implementations[dynamo_cuda_graphs-8x256-bert-base-uncased] 14.079 (1.61) 14.0268 (1.62) 13.4543 (1.68) 14.3749 (1.6) 13.0296 (1.75) 13.3418 (1.72) 12.8918 (1.73) 14.1746 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-8x256-sentence-transformers/all-MiniLM-L6-v2] 3.5973 (6.29) 3.6112 (6.28) 3.5041 (6.46) 3.7079 (6.19) 3.5405 (6.44) 3.57 (6.42) 3.4835 (6.41) 4.1018 (5.8)
test_benchmark_implementations[dynamo_cuda_graphs-8x256-t5-small] 10.881 (2.08) 10.8159 (2.1) 10.454 (2.17) 11.0469 (2.08) 10.8309 (2.11) 10.8968 (2.1) 10.1213 (2.21) 11.6755 (2.04)
test_benchmark_implementations[dynamo_no_dropout-8x256-bert-base-uncased] 13.5404 (1.67) 13.8593 (1.64) 13.4615 (1.68) 14.4343 (1.59) 13.5687 (1.68) 13.5402 (1.69) 13.435 (1.66) 13.615 (1.75)
test_benchmark_implementations[dynamo_no_dropout-8x256-sentence-transformers/all-MiniLM-L6-v2] 3.842 (5.89) 3.8406 (5.9) 3.7652 (6.01) 3.8564 (5.95) 3.8274 (5.96) 3.843 (5.96) 3.8148 (5.85) 4.062 (5.86)
test_benchmark_implementations[dynamo_no_dropout-8x256-t5-small] 12.1416 (1.87) 12.2602 (1.85) 11.8313 (1.91) 12.6351 (1.82) 11.6905 (1.95) 11.8176 (1.94) 11.5396 (1.93) 12.3948 (1.92)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x256-bert-base-uncased] 11.2394 (2.01) 11.6337 (1.95) 10.3997 (2.18) 15.1091 (1.52) 10.7301 (2.13) 11.0985 (2.06) 10.2936 (2.17) 11.8069 (2.02)
test_benchmark_implementations[dynamo_optimized-8x256-bert-base-uncased] 14.2684 (1.59) 14.3024 (1.58) 14.2336 (1.59) 14.3841 (1.59) 14.5641 (1.57) 14.6802 (1.56) 14.5321 (1.54) 15.0488 (1.58)
test_benchmark_implementations[dynamo_optimized-8x256-sentence-transformers/all-MiniLM-L6-v2] 7.6207 (2.97) 7.6133 (2.98) 7.5541 (3.0) 7.6792 (2.99) 7.9215 (2.88) 7.9443 (2.88) 7.8443 (2.85) 8.357 (2.85)
test_benchmark_implementations[dynamo_optimized-8x256-t5-small] 20.8108 (1.09) 20.878 (1.09) 20.794 (1.09) 21.0186 (1.09) 21.1782 (1.08) 21.3574 (1.07) 21.1219 (1.06) 21.8378 (1.09)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-bert-base-uncased] 8.1285 (2.79) 8.131 (2.79) 8.1234 (2.79) 8.1439 (2.82) 7.7371 (2.95) 7.7023 (2.97) 7.3249 (3.05) 7.8931 (3.02)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-sentence-transformers/all-MiniLM-L6-v2] 2.9235 (7.75) 2.9235 (7.75) 2.9194 (7.76) 2.9276 (7.83) 2.8704 (7.95) 2.8488 (8.04) 2.7876 (8.01) 2.8891 (8.24)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-t5-small] 9.0317 (2.51) 8.8762 (2.55) 8.6149 (2.63) 9.0665 (2.53) 8.5792 (2.66) 8.5419 (2.68) 8.4395 (2.64) 8.6266 (2.76)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x256-bert-base-uncased] 8.3773 (2.7) 8.3766 (2.71) 8.3681 (2.71) 8.3825 (2.74) 8.113 (2.81) 7.9841 (2.87) 7.5708 (2.95) 8.6934 (2.74)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x256-sentence-transformers/all-MiniLM-L6-v2] 2.9696 (7.63) 2.9705 (7.63) 2.9655 (7.63) 2.9839 (7.69) 2.887 (7.9) 2.8677 (7.99) 2.8146 (7.93) 2.9235 (8.14)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x256-t5-small] 8.873 (2.55) 9.1498 (2.48) 8.5852 (2.64) 10.0649 (2.28) 8.1125 (2.81) 8.5707 (2.67) 8.0004 (2.79) 9.3767 (2.54)
test_benchmark_implementations[onnx-8x256-bert-base-uncased] 22.5005 (1.01) 22.6105 (1.0) 22.5004 (1.01) 22.9376 (1.0) 22.7464 (1.0) 22.908 (1.0) 22.227 (1.0) 23.8073 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x256-bert-base-uncased] 11.3633 (1.99) 11.672 (1.94) 11.3524 (1.99) 13.2784 (1.73) 11.3428 (2.01) 11.298 (2.03) 10.9969 (2.03) 11.3985 (2.09)
test_benchmark_implementations[onnx_optim_fp32-8x256-bert-base-uncased] 22.6447 (1.0) 22.665 (1.0) 22.6406 (1.0) 22.6888 (1.01) 22.8176 (1.0) 22.8726 (1.0) 22.3209 (1.0) 23.4825 (1.01)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 33)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
--------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-8x33-bert-base-uncased] 7.9947 (2.59) 7.9786 (2.6) 7.8438 (2.63) 8.0671 (2.58) 8.3793 (2.5) 8.4398 (2.5) 8.2132 (2.55) 8.9543 (2.41)
test_benchmark_implementations[baseline-8x33-sentence-transformers/all-MiniLM-L6-v2] 4.2332 (4.89) 4.2425 (4.89) 4.1247 (5.01) 4.4278 (4.7) 4.5448 (4.61) 4.562 (4.63) 4.4476 (4.71) 5.1313 (4.2)
test_benchmark_implementations[baseline-8x33-t5-small] 14.1097 (1.47) 14.1911 (1.46) 14.0401 (1.47) 14.4538 (1.44) 14.387 (1.46) 14.5887 (1.45) 13.8695 (1.51) 15.8505 (1.36)
test_benchmark_implementations[dynamo-8x33-bert-base-uncased] 7.2182 (2.87) 8.1296 (2.55) 6.9775 (2.96) 11.778 (1.77) 7.5161 (2.79) 7.5627 (2.79) 7.1822 (2.92) 8.1118 (2.66)
test_benchmark_implementations[dynamo-8x33-sentence-transformers/all-MiniLM-L6-v2] 3.4949 (5.93) 3.5182 (5.9) 3.4058 (6.07) 3.7304 (5.58) 3.8358 (5.47) 3.8773 (5.44) 3.746 (5.59) 4.3868 (4.92)
test_benchmark_implementations[dynamo-8x33-t5-small] 12.5061 (1.66) 12.5413 (1.65) 12.4356 (1.66) 12.6556 (1.64) 13.0027 (1.61) 13.0229 (1.62) 12.7834 (1.64) 13.3064 (1.62)
test_benchmark_implementations[dynamo_cuda_graphs-8x33-bert-base-uncased] 2.2723 (9.12) 2.2435 (9.25) 2.0603 (10.03) 2.2784 (9.13) 2.0596 (10.18) 2.069 (10.2) 2.0401 (10.26) 2.3373 (9.23)
test_benchmark_implementations[dynamo_cuda_graphs-8x33-sentence-transformers/all-MiniLM-L6-v2] 0.7404 (27.98) 0.7407 (28.01) 0.7383 (27.99) 0.7455 (27.92) 0.7109 (29.49) 0.7331 (28.8) 0.7035 (29.76) 1.171 (18.42)
test_benchmark_implementations[dynamo_cuda_graphs-8x33-t5-small] 2.5313 (8.18) 2.428 (8.55) 2.2139 (9.33) 2.6737 (7.78) 2.2875 (9.17) 2.2899 (9.22) 2.2828 (9.17) 2.3705 (9.1)
test_benchmark_implementations[dynamo_no_dropout-8x33-bert-base-uncased] 6.6079 (3.14) 6.6007 (3.14) 6.4626 (3.2) 6.8393 (3.04) 6.9077 (3.04) 6.9296 (3.05) 6.8133 (3.07) 7.1822 (3.0)
test_benchmark_implementations[dynamo_no_dropout-8x33-sentence-transformers/all-MiniLM-L6-v2] 3.286 (6.31) 3.2915 (6.3) 3.2215 (6.41) 3.4304 (6.07) 3.6226 (5.79) 3.6388 (5.8) 3.5641 (5.88) 3.9248 (5.5)
test_benchmark_implementations[dynamo_no_dropout-8x33-t5-small] 11.5364 (1.8) 11.5661 (1.79) 11.4534 (1.8) 11.7617 (1.77) 11.7576 (1.78) 11.8652 (1.78) 11.6882 (1.79) 12.515 (1.72)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x33-bert-base-uncased] 3.8031 (5.45) 3.7956 (5.47) 3.6372 (5.68) 3.9352 (5.29) 4.1047 (5.11) 4.1173 (5.13) 3.9667 (5.28) 4.5495 (4.74)
test_benchmark_implementations[dynamo_optimized-8x33-bert-base-uncased] 14.6094 (1.42) 14.5993 (1.42) 14.4302 (1.43) 14.7282 (1.41) 14.7255 (1.42) 14.8756 (1.42) 14.6691 (1.43) 15.3102 (1.41)
test_benchmark_implementations[dynamo_optimized-8x33-sentence-transformers/all-MiniLM-L6-v2] 7.6042 (2.72) 7.6208 (2.72) 7.5018 (2.75) 7.7885 (2.67) 7.8886 (2.66) 7.9282 (2.66) 7.8061 (2.68) 8.3879 (2.57)
test_benchmark_implementations[dynamo_optimized-8x33-t5-small] 20.7186 (1.0) 20.7488 (1.0) 20.6643 (1.0) 20.8128 (1.0) 20.9668 (1.0) 21.1115 (1.0) 20.939 (1.0) 21.5692 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-bert-base-uncased] 1.8545 (11.17) 1.8336 (11.32) 1.7213 (12.0) 1.8586 (11.2) 1.7389 (12.06) 1.7244 (12.24) 1.6692 (12.54) 1.7698 (12.19)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-sentence-transformers/all-MiniLM-L6-v2] 0.6615 (31.32) 0.6448 (32.18) 0.5806 (35.59) 0.6717 (30.98) 0.6342 (33.06) 0.6358 (33.2) 0.6313 (33.17) 0.7258 (29.72)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-t5-small] 2.0224 (10.24) 2.0223 (10.26) 2.0204 (10.23) 2.0244 (10.28) 1.8329 (11.44) 1.836 (11.5) 1.8303 (11.44) 1.9283 (11.19)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x33-bert-base-uncased] 1.8627 (11.12) 1.8624 (11.14) 1.8586 (11.12) 1.8657 (11.16) 1.7455 (12.01) 1.728 (12.22) 1.6781 (12.48) 1.7733 (12.16)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x33-sentence-transformers/all-MiniLM-L6-v2] 0.6636 (31.22) 0.6633 (31.28) 0.5878 (35.16) 0.6656 (31.27) 0.6348 (33.03) 0.6365 (33.17) 0.632 (33.13) 0.7326 (29.44)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x33-t5-small] 2.0255 (10.23) 2.0255 (10.24) 2.0234 (10.21) 2.0275 (10.27) 1.8349 (11.43) 1.8374 (11.49) 1.8319 (11.43) 1.9282 (11.19)
test_benchmark_implementations[onnx-8x33-bert-base-uncased] 4.308 (4.81) 4.3083 (4.82) 4.2906 (4.82) 4.3346 (4.8) 3.96 (5.29) 3.9579 (5.33) 3.8867 (5.39) 4.2064 (5.13)
test_benchmark_implementations[onnx_optim_fp16-8x33-bert-base-uncased] 2.8897 (7.17) 2.9532 (7.03) 2.8488 (7.25) 3.5533 (5.86) 2.739 (7.66) 2.7596 (7.65) 2.6207 (7.99) 3.2262 (6.69)
test_benchmark_implementations[onnx_optim_fp32-8x33-bert-base-uncased] 4.0632 (5.1) 4.1558 (4.99) 4.0489 (5.1) 4.8374 (4.3) 3.9397 (5.32) 3.9653 (5.32) 3.9061 (5.36) 4.436 (4.86)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-8x384-bert-base-uncased] 19.7489 (1.59) 19.8177 (1.62) 19.5164 (1.61) 20.2269 (1.65) 19.2371 (1.63) 19.4843 (1.6) 18.9864 (1.6) 20.4589 (1.54)
test_benchmark_implementations[baseline-8x384-sentence-transformers/all-MiniLM-L6-v2] 5.5487 (5.67) 5.6055 (5.72) 5.5204 (5.69) 6.4666 (5.17) 5.7862 (5.43) 5.8479 (5.32) 5.6624 (5.36) 6.6094 (4.77)
test_benchmark_implementations[baseline-8x384-t5-small] 19.4161 (1.62) 19.703 (1.63) 19.3772 (1.62) 20.1943 (1.65) 19.7093 (1.59) 20.1023 (1.55) 19.4844 (1.56) 20.9538 (1.51)
test_benchmark_implementations[dynamo-8x384-bert-base-uncased] 19.9598 (1.57) 19.961 (1.61) 19.9464 (1.58) 19.9823 (1.67) 19.3285 (1.62) 19.1226 (1.63) 18.7697 (1.62) 19.3344 (1.63)
test_benchmark_implementations[dynamo-8x384-sentence-transformers/all-MiniLM-L6-v2] 9.7649 (3.22) 9.0292 (3.55) 5.3801 (5.84) 10.1202 (3.3) 5.7692 (5.44) 5.8264 (5.34) 5.6272 (5.39) 6.4468 (4.9)
test_benchmark_implementations[dynamo-8x384-t5-small] 19.5359 (1.61) 19.7155 (1.63) 19.4601 (1.61) 20.5711 (1.62) 19.822 (1.58) 19.9808 (1.56) 19.5762 (1.55) 20.6391 (1.53)
test_benchmark_implementations[dynamo_cuda_graphs-8x384-bert-base-uncased] 20.5046 (1.53) 20.5494 (1.56) 18.8764 (1.66) 22.0915 (1.51) 20.5107 (1.53) 20.4222 (1.52) 19.6579 (1.54) 21.1532 (1.49)
test_benchmark_implementations[dynamo_cuda_graphs-8x384-sentence-transformers/all-MiniLM-L6-v2] 5.5112 (5.7) 5.579 (5.75) 5.3975 (5.82) 6.4911 (5.15) 5.4612 (5.75) 5.4127 (5.75) 5.2997 (5.72) 5.498 (5.74)
test_benchmark_implementations[dynamo_cuda_graphs-8x384-t5-small] 19.1488 (1.64) 19.2 (1.67) 19.1242 (1.64) 19.4335 (1.72) 19.2215 (1.63) 19.3734 (1.61) 18.8843 (1.61) 20.0447 (1.57)
test_benchmark_implementations[dynamo_no_dropout-8x384-bert-base-uncased] 19.1908 (1.64) 19.1947 (1.67) 19.1867 (1.64) 19.2113 (1.74) 19.3104 (1.63) 19.1886 (1.62) 18.8783 (1.61) 19.3726 (1.63)
test_benchmark_implementations[dynamo_no_dropout-8x384-sentence-transformers/all-MiniLM-L6-v2] 6.2966 (4.99) 6.3509 (5.05) 6.0867 (5.16) 6.8485 (4.88) 6.1923 (5.07) 6.2429 (4.98) 5.7114 (5.31) 6.8333 (4.62)
test_benchmark_implementations[dynamo_no_dropout-8x384-t5-small] 19.6413 (1.6) 19.9964 (1.6) 19.6116 (1.6) 20.3827 (1.64) 19.6893 (1.59) 19.7231 (1.58) 19.5565 (1.55) 19.8602 (1.59)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x384-bert-base-uncased] 15.9099 (1.98) 15.9111 (2.01) 15.9007 (1.98) 15.9171 (2.1) 15.8495 (1.98) 15.7446 (1.98) 14.4961 (2.09) 16.8897 (1.87)
test_benchmark_implementations[dynamo_optimized-8x384-bert-base-uncased] 14.4302 (2.18) 14.522 (2.21) 14.3892 (2.18) 14.7292 (2.27) 14.8241 (2.12) 14.88 (2.09) 14.7356 (2.06) 15.2772 (2.07)
test_benchmark_implementations[dynamo_optimized-8x384-sentence-transformers/all-MiniLM-L6-v2] 7.5244 (4.18) 7.5324 (4.26) 7.4775 (4.2) 7.5971 (4.4) 7.856 (4.0) 7.9508 (3.91) 7.7704 (3.9) 8.4444 (3.74)
test_benchmark_implementations[dynamo_optimized-8x384-t5-small] 20.8898 (1.5) 20.9199 (1.53) 20.8159 (1.51) 21.0135 (1.59) 20.9417 (1.5) 21.1092 (1.47) 20.9324 (1.45) 21.5351 (1.47)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-bert-base-uncased] 11.1729 (2.81) 11.1749 (2.87) 11.1698 (2.81) 11.1852 (2.99) 10.6662 (2.94) 10.8444 (2.87) 10.4129 (2.91) 11.2005 (2.82)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-sentence-transformers/all-MiniLM-L6-v2] 4.9152 (6.4) 4.9154 (6.52) 4.9121 (6.4) 4.9203 (6.79) 4.8918 (6.42) 4.8515 (6.41) 4.7233 (6.42) 4.9082 (6.43)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-t5-small] 16.2724 (1.93) 16.2714 (1.97) 16.1966 (1.94) 16.3471 (2.04) 16.2496 (1.93) 16.2216 (1.92) 16.106 (1.88) 16.3134 (1.93)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x384-bert-base-uncased] 12.1231 (2.59) 12.04 (2.66) 11.7811 (2.67) 12.1405 (2.75) 12.1685 (2.58) 11.9373 (2.61) 11.0808 (2.74) 12.5671 (2.51)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x384-sentence-transformers/all-MiniLM-L6-v2] 5.0207 (6.26) 5.0464 (6.35) 5.0156 (6.26) 5.4467 (6.13) 4.978 (6.31) 4.9337 (6.3) 4.8074 (6.31) 5.0126 (6.3)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x384-t5-small] 14.8664 (2.11) 14.8558 (2.16) 14.806 (2.12) 14.892 (2.24) 14.3267 (2.19) 14.2535 (2.18) 14.0668 (2.16) 14.448 (2.18)
test_benchmark_implementations[onnx-8x384-bert-base-uncased] 31.4112 (1.0) 32.0522 (1.0) 31.3354 (1.0) 33.41 (1.0) 31.3964 (1.0) 31.0975 (1.0) 30.338 (1.0) 31.5582 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x384-bert-base-uncased] 16.1556 (1.95) 16.2058 (1.98) 16.126 (1.95) 16.3482 (2.04) 16.2167 (1.94) 16.2208 (1.92) 16.0692 (1.89) 16.4339 (1.92)
test_benchmark_implementations[onnx_optim_fp32-8x384-bert-base-uncased] 31.435 (1.0) 31.9598 (1.0) 31.4194 (1.0) 33.025 (1.01) 31.1871 (1.01) 30.9656 (1.0) 30.317 (1.0) 31.3926 (1.01)
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name Median (CUDA) Mean (CUDA) Min (CUDA) Max (CUDA) Median Mean Min Max
---------------------------------------------------------------------------------------------------------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
test_benchmark_implementations[baseline-8x512-bert-base-uncased] 27.775 (1.69) 27.7722 (1.73) 27.7647 (1.69) 27.777 (1.77) 27.7892 (1.66) 27.6091 (1.72) 27.2374 (1.69) 27.8005 (1.75)
test_benchmark_implementations[baseline-8x512-sentence-transformers/all-MiniLM-L6-v2] 8.8074 (5.33) 8.8136 (5.45) 8.8013 (5.34) 8.83 (5.55) 8.9817 (5.14) 9.1505 (5.19) 8.771 (5.26) 11.3256 (4.3)
test_benchmark_implementations[baseline-8x512-t5-small] 30.7333 (1.53) 30.7756 (1.56) 30.722 (1.53) 30.8716 (1.59) 31.5683 (1.46) 31.7023 (1.5) 31.5614 (1.46) 31.9773 (1.52)
test_benchmark_implementations[dynamo-8x512-bert-base-uncased] 27.7924 (1.69) 27.7958 (1.73) 27.7862 (1.69) 27.8088 (1.76) 27.9149 (1.65) 28.9118 (1.64) 27.8333 (1.66) 30.9872 (1.57)
test_benchmark_implementations[dynamo-8x512-sentence-transformers/all-MiniLM-L6-v2] 8.8484 (5.31) 8.8535 (5.42) 8.8402 (5.31) 8.8801 (5.52) 8.9348 (5.17) 8.9323 (5.31) 8.7845 (5.25) 9.196 (5.3)
test_benchmark_implementations[dynamo-8x512-t5-small] 30.7855 (1.53) 30.7801 (1.56) 30.763 (1.53) 30.7917 (1.59) 30.8503 (1.5) 31.3577 (1.51) 30.8374 (1.5) 32.3853 (1.51)
test_benchmark_implementations[dynamo_cuda_graphs-8x512-bert-base-uncased] 27.6142 (1.7) 27.6143 (1.74) 27.606 (1.7) 27.6226 (1.77) 27.2353 (1.69) 27.1803 (1.75) 26.6198 (1.73) 27.6856 (1.76)
test_benchmark_implementations[dynamo_cuda_graphs-8x512-sentence-transformers/all-MiniLM-L6-v2] 8.8044 (5.34) 8.9263 (5.38) 8.6139 (5.45) 9.2867 (5.28) 8.726 (5.29) 8.7863 (5.4) 8.5524 (5.4) 9.2835 (5.25)
test_benchmark_implementations[dynamo_cuda_graphs-8x512-t5-small] 30.5582 (1.54) 30.5575 (1.57) 30.548 (1.54) 30.5664 (1.6) 30.5396 (1.51) 30.4841 (1.56) 30.2891 (1.52) 30.6236 (1.59)
test_benchmark_implementations[dynamo_no_dropout-8x512-bert-base-uncased] 27.8845 (1.68) 27.8849 (1.72) 27.8733 (1.69) 27.8968 (1.76) 27.919 (1.65) 27.7994 (1.71) 27.4494 (1.68) 28.0299 (1.74)
test_benchmark_implementations[dynamo_no_dropout-8x512-sentence-transformers/all-MiniLM-L6-v2] 8.8371 (5.32) 8.8631 (5.42) 8.8064 (5.33) 8.9569 (5.47) 8.9738 (5.14) 8.9631 (5.29) 8.8468 (5.22) 9.0522 (5.38)
test_benchmark_implementations[dynamo_no_dropout-8x512-t5-small] 30.9012 (1.52) 31.0699 (1.54) 30.8961 (1.52) 31.4122 (1.56) 31.0318 (1.49) 31.4921 (1.51) 30.9202 (1.49) 32.5244 (1.5)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x512-bert-base-uncased] 22.6529 (2.07) 23.6692 (2.03) 21.9116 (2.14) 26.9885 (1.82) 20.891 (2.21) 21.1572 (2.24) 20.503 (2.25) 21.9491 (2.22)
test_benchmark_implementations[dynamo_optimized-8x512-bert-base-uncased] 16.3942 (2.87) 16.3891 (2.93) 16.3 (2.88) 16.4352 (2.98) 16.1713 (2.85) 15.9108 (2.98) 15.0802 (3.06) 16.2421 (3.0)
test_benchmark_implementations[dynamo_optimized-8x512-sentence-transformers/all-MiniLM-L6-v2] 8.108 (5.79) 8.1148 (5.92) 8.1029 (5.8) 8.1377 (6.02) 8.5872 (5.38) 8.5455 (5.55) 8.3134 (5.55) 8.7594 (5.56)
test_benchmark_implementations[dynamo_optimized-8x512-t5-small] 25.1023 (1.87) 25.1221 (1.91) 25.089 (1.87) 25.175 (1.95) 25.6481 (1.8) 25.6533 (1.85) 25.6255 (1.8) 25.6864 (1.9)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-bert-base-uncased] 15.2392 (3.08) 15.2839 (3.14) 15.1798 (3.09) 15.4184 (3.18) 15.3804 (3.0) 15.0315 (3.16) 14.3092 (3.23) 15.4026 (3.16)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-sentence-transformers/all-MiniLM-L6-v2] 7.7394 (6.07) 7.7577 (6.19) 7.7363 (6.07) 7.8213 (6.27) 7.8003 (5.92) 7.7552 (6.12) 7.5767 (6.09) 7.8334 (6.22)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-t5-small] 24.2033 (1.94) 24.2017 (1.98) 24.1572 (1.94) 24.2268 (2.02) 24.1703 (1.91) 24.2008 (1.96) 24.0959 (1.92) 24.2979 (2.01)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x512-bert-base-uncased] 15.7542 (2.98) 15.7588 (3.05) 15.7512 (2.98) 15.7809 (3.11) 15.8499 (2.91) 15.61 (3.04) 14.9463 (3.09) 15.9837 (3.05)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x512-sentence-transformers/all-MiniLM-L6-v2] 7.9503 (5.91) 7.9638 (6.03) 7.9462 (5.91) 8.0384 (6.1) 8.3171 (5.55) 8.5048 (5.58) 7.7766 (5.94) 9.4272 (5.17)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x512-t5-small] 21.0954 (2.23) 21.1128 (2.27) 21.0934 (2.23) 21.1354 (2.32) 21.2004 (2.18) 21.5404 (2.2) 20.9173 (2.21) 22.7782 (2.14)
test_benchmark_implementations[onnx-8x512-bert-base-uncased] 46.8429 (1.0) 46.8429 (1.02) 46.8429 (1.0) 46.8429 (1.05) 45.5672 (1.01) 46.3242 (1.02) 45.5672 (1.01) 47.0812 (1.04)
test_benchmark_implementations[onnx_optim_fp16-8x512-bert-base-uncased] 21.3678 (2.2) 21.4095 (2.24) 21.3636 (2.2) 21.4538 (2.29) 20.8857 (2.21) 21.0481 (2.25) 20.451 (2.26) 21.4421 (2.27)
test_benchmark_implementations[onnx_optim_fp32-8x512-bert-base-uncased] 46.9763 (1.0) 48.0017 (1.0) 46.9763 (1.0) 49.0271 (1.0) 46.1597 (1.0) 47.4509 (1.0) 46.1597 (1.0) 48.7421 (1.0)
====================================================================================================== warnings summary =======================================================================================================
../../../home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/onnxruntime/transformers/float16.py:78: 299 warnings
/home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/onnxruntime/transformers/models/gpt2/../../float16.py:78: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
float32_list = np.fromstring(tensor.raw_data, dtype="float32")
../../../home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/onnxruntime/transformers/float16.py:82: 299 warnings
/home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/onnxruntime/transformers/models/gpt2/../../float16.py:82: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
tensor.raw_data = float16_list.tostring()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================= 425 passed, 136 skipped, 11 deselected, 598 warnings in 4136.99s (1:08:56) ==========================================================================
/mnt/workspace/kernl on feat/more-models !1 ··················································································· took 1h 9m 0s kernl 1.39 24% 46,9G ╱ 0,B at 11:46:31 ─╮
❯ ─╯
We implement the support of T5 models. In this first step it's only kernel replacement, no other optimizations.