anijain2305 commented 1 year ago

(next 2 comments are for max-autotune, warm start run)

AMP RUN

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 90%, 53/59 | 100%, 45/45 | 68%, 41/60  |
|       aot_eager        | 88%, 52/59 | 100%, 45/45 | 92%, 55/60  |
|        inductor        | 78%, 46/59 | 84%, 38/45  | 93%, 56/60  |
| inductor_no_cudagraphs | 78%, 46/59 | 84%, 38/45  | 95%, 57/60  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|        inductor        |   1.58x    |    1.66x    |    1.38x    |
| inductor_no_cudagraphs |   1.57x    |    1.65x    |    1.39x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    4.75    |    7.37     |    5.95     |
|       aot_eager        |    9.38    |    16.06    |    12.68    |
|        inductor        |   228.90   |   199.68    |   334.49    |
| inductor_no_cudagraphs |   30.31    |    50.32    |    43.99    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.97x    |    1.00x    |
|       aot_eager        |   0.86x    |    0.89x    |    0.89x    |
|        inductor        |   0.75x    |    0.91x    |    0.91x    |
| inductor_no_cudagraphs |   0.88x    |    0.91x    |    0.92x    |
+------------------------+------------+-------------+-------------+

anijain2305 commented 1 year ago

Performance speedup

+-----------------------------------+------+--------+-----------+----------+------------------------+
|               name                |  bs  | eager  | aot_eager | inductor | inductor_no_cudagraphs |
+-----------------------------------+------+--------+-----------+----------+------------------------+
|       functorch_dp_cifar10        |  64  | 0.9745 |   0.925   |  3.6666  |         3.6217         |
|           BERT_pytorch            |  16  | 0.9975 |  0.7999   |  3.1791  |         3.248          |
|            densenet121            |  4   | 0.9888 |  0.6947   |  2.7868  |         2.7862         |
|            hf_T5_large            |  2   | 0.9806 |   0.806   |  2.3425  |         2.262          |
|             hf_Albert             |  8   | 0.9963 |  0.9603   |  2.3376  |         2.3399         |
|              hf_Bart              |  4   | 0.9801 |  0.7934   |  2.1449  |         2.4193         |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.9748 |  0.8967   |  2.0857  |         1.8595         |
|         phlippe_densenet          | 128  | 0.9853 |  0.7714   |  2.0062  |         2.0183         |
|           squeezenet1_1           |  32  | 0.9843 |  0.9261   |  2.0043  |         1.8585         |
|        mobilenet_v3_large         |  32  | 0.9958 |  0.7796   |  1.997   |         2.0592         |
|              hf_GPT2              |  4   | 0.9953 |  0.9565   |  1.9265  |         1.9259         |
|               hf_T5               |  8   | 0.9868 |  0.8503   |  1.9215  |         1.9342         |
|              hf_Bert              |  4   | 0.9975 |  0.8397   |  1.8429  |         1.8416         |
|           hf_Longformer           |  2   | 0.9252 |  0.5851   |  1.8011  |         1.8042         |
|          phlippe_resnet           | 128  | 0.9781 |  0.7561   |  1.8006  |         1.8106         |
|          pytorch_struct           | 200  | 0.9518 |  0.7782   |  1.7996  |         1.7666         |
|        speech_transformer         |  32  | 0.9826 |  0.7931   |  1.7197  |         1.7331         |
|          resnext50_32x4d          |  8   | 0.9882 |  0.7072   |  1.7009  |         1.6915         |
|      timm_vision_transformer      |  32  | 0.9836 |  0.8443   |  1.7005  |         1.9707         |
|            mnasnet1_0             |  32  | 0.9905 |  0.7353   |  1.6732  |         1.6549         |
| attention_is_all_you_need_pytorch | 256  | 0.9887 |  0.8359   |  1.6465  |         1.6313         |
|           fastNLP_Bert            |  6   | 0.9847 |  0.8539   |  1.635   |         1.6491         |
|           hf_Bert_large           |  4   | 1.0021 |  0.8623   |  1.6239  |         1.6323         |
|             resnet18              |  16  | 0.9895 |  0.7542   |  1.5738  |         1.5531         |
|        shufflenet_v2_x1_0         | 128  | 0.9938 |  0.7535   |  1.5633  |         1.5206         |
|               dcgan               |  32  | 0.8862 |  0.7092   |  1.4916  |         1.5106         |
|           mobilenet_v2            |  96  | 0.997  |  0.7779   |  1.4766  |         1.4746         |
|           hf_DistilBert           |  8   | 0.9836 |  0.9375   |  1.4722  |         1.4475         |
|            timm_nfnet             | 128  | 0.9864 |  0.9842   |  1.4585  |         1.4648         |
|           timm_resnest            |  32  | 0.9928 |  0.8523   |  1.4553  |         1.4551         |
|                drq                |  1   | 0.9672 |  0.7538   |  1.4447  |         1.4735         |
|           lennard_jones           | 1000 | 0.8676 |  0.7663   |  1.4389  |         1.4672         |
|         timm_efficientnet         |  32  | 0.9317 |  0.6227   |  1.3717  |         1.3928         |
|          LearningToPaint          |  96  | 0.9873 |  0.7763   |  1.2759  |         1.2733         |
|               vgg16               |  64  | 0.9994 |   0.998   |  1.2434  |         1.2439         |
|          pytorch_stargan          |  16  | 0.9948 |  0.8039   |  1.2292  |         1.2232         |
|            Super_SloMo            |  6   | 0.9977 |  0.1781   |  1.2182  |         1.2192         |
|         soft_actor_critic         | 256  | 0.7797 |  0.6707   |  1.2117  |         1.0491         |
|           pytorch_unet            |  1   | 0.9969 |  0.2047   |  1.1718  |         1.1721         |
|        Background_Matting         |  4   | 0.9985 |  0.1371   |  1.1714  |         1.1721         |
|             resnet152             |  32  | 0.9948 |  0.7447   |  1.1616  |         1.2227         |
|             resnet50              |  32  | 0.9955 |  0.7607   |  1.139   |         1.1372         |
|              yolov3               |  16  | 0.9967 |  0.8074   |  1.1153  |         1.1159         |
|              demucs               |  4   | 0.9995 |  1.0006   |  1.0262  |         1.0292         |
|            tts_angular            |  64  | 0.9531 |  0.9167   |  0.9745  |         0.9877         |
|            timm_regnet            |  32  | 0.9145 |  0.7756   |  0.9357  |         0.9334         |
|      nvidia_deeprecommender       | 256  | 0.9991 |  0.9984   |  0.9351  |         0.9353         |
|            timm_vovnet            |  32  |  0.86  |  0.7083   |  0.9249  |         0.9187         |
|   timm_vision_transformer_large   |  32  | 0.9982 |    0.0    |   0.0    |          0.0           |
|             tacotron2             |  0   |  0.0   |    0.0    |   0.0    |          0.0           |
|       doctr_reco_predictor        |  0   |  0.0   |    0.0    |   0.0    |          0.0           |
|        doctr_det_predictor        |  0   |  0.0   |    0.0    |   0.0    |          0.0           |
|               moco                |  32  | 0.9764 |    0.0    |   0.0    |          0.0           |
|           hf_GPT2_large           |  4   | 0.9843 |  0.9721   |   0.0    |         1.7378         |
|            hf_BigBird             |  2   | 0.9753 |  0.7838   |   0.0    |          0.0           |
|               dlrm                | 1024 | 0.9529 |  0.8453   |   0.0    |          0.0           |
|            hf_Reformer            |  4   | 0.9928 |  0.9501   |   0.0    |          0.0           |
|              alexnet              | 128  | 0.9991 |  0.9974   |   0.0    |          0.0           |
|           torchrec_dlrm           |  0   |  0.0   |    0.0    |   0.0    |          0.0           |
+-----------------------------------+------+--------+-----------+----------+------------------------+

Accuracy

+-----------------------------------+-----+------------------+------------------+------------------+------------------------+
|               name                | bs  |      eager       |    aot_eager     |     inductor     | inductor_no_cudagraphs |
+-----------------------------------+-----+------------------+------------------+------------------+------------------------+
|           hf_GPT2_large           |  4  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |    pass_due_to_skip    |
|   timm_vision_transformer_large   |  4  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |    pass_due_to_skip    |
|            hf_T5_large            |  4  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |    pass_due_to_skip    |
|           squeezenet1_1           |  4  |       pass       |       pass       |       pass       |          pass          |
|          pytorch_stargan          | 16  |       pass       |       pass       |       pass       |          pass          |
|          pytorch_struct           | 200 |       pass       |       pass       |       pass       |          pass          |
|             resnet152             |  4  |       pass       |       pass       |       pass       |          pass          |
|             resnet18              |  4  |       pass       |       pass       |       pass       |          pass          |
|             resnet50              |  4  |       pass       |       pass       |       pass       |          pass          |
|          resnext50_32x4d          |  4  |       pass       |       pass       |       pass       |          pass          |
|        shufflenet_v2_x1_0         |  4  |       pass       |       pass       |       pass       |          pass          |
|         soft_actor_critic         | 256 |       pass       |       pass       |       pass       |          pass          |
|        speech_transformer         |  4  |       pass       |       pass       |       pass       |          pass          |
|         timm_efficientnet         |  4  |       pass       |       pass       |       pass       |          pass          |
|         phlippe_densenet          |  4  |       pass       |       pass       |       pass       |          pass          |
|            timm_nfnet             |  4  |       pass       |       pass       |       pass       |          pass          |
|            timm_regnet            |  4  |       pass       |       pass       |       pass       |          pass          |
|           timm_resnest            |  4  |       pass       |       pass       |       pass       |          pass          |
|      timm_vision_transformer      |  4  |       pass       |       pass       |       pass       |          pass          |
|            timm_vovnet            |  4  |       pass       |       pass       |       pass       |          pass          |
|            tts_angular            |  4  |       pass       |       pass       |       pass       |          pass          |
|               vgg16               |  4  |       pass       |       pass       |       pass       |          pass          |
|              yolov3               |  4  |       pass       |       pass       |       pass       |          pass          |
|           BERT_pytorch            |  4  |  fail_accuracy   |       pass       |       pass       |          pass          |
|        mobilenet_v3_large         |  4  |       pass       |       pass       |       pass       |     fail_accuracy      |
|   pytorch_CycleGAN_and_pix2pix    |  1  |       pass       |       pass       |       pass       |          pass          |
|           pytorch_unet            |  2  |       pass       |       pass       |       pass       |          pass          |
|      nvidia_deeprecommender       |  4  |       pass       |       pass       |       pass       |          pass          |
|             hf_Albert             |  4  |       pass       |       pass       |       pass       |          pass          |
|          LearningToPaint          |  4  |       pass       |       pass       |       pass       |          pass          |
|            Super_SloMo            |  4  |       pass       |       pass       |       pass       |          pass          |
|              alexnet              |  4  |       pass       |       pass       |       pass       |          pass          |
| attention_is_all_you_need_pytorch |  4  |       pass       |       pass       |       pass       |          pass          |
|               dcgan               |  4  |       pass       |       pass       |       pass       |          pass          |
|              demucs               |  4  |       pass       |       pass       |       pass       |          pass          |
|           mobilenet_v2            |  4  |       pass       |       pass       |       pass       |          pass          |
|                drq                |  1  |       pass       |       pass       |       pass       |          pass          |
|           fastNLP_Bert            |  4  |       pass       |       pass       |       pass       |          pass          |
|       functorch_dp_cifar10        |  4  |       pass       |       pass       |       pass       |          pass          |
|            densenet121            |  4  |       pass       |       pass       |       pass       |          pass          |
|              hf_Bart              |  4  |       pass       |       pass       |       pass       |          pass          |
|           hf_Bert_large           |  4  |       pass       |       pass       |       pass       |          pass          |
|           hf_DistilBert           |  4  |       pass       |       pass       |       pass       |          pass          |
|              hf_GPT2              |  2  |       pass       |       pass       |       pass       |          pass          |
|           hf_Longformer           |  4  |       pass       |       pass       |       pass       |          pass          |
|            hf_Reformer            |  4  |       pass       |       pass       |       pass       |          pass          |
|               hf_T5               |  4  |       pass       |       pass       |       pass       |          pass          |
|            hf_T5_base             |  4  |       pass       |       pass       |       pass       |          pass          |
|           lennard_jones           |  4  |       pass       |       pass       |       pass       |          pass          |
|            mnasnet1_0             |  4  |       pass       |       pass       |       pass       |          pass          |
|              hf_Bert              |  4  |       pass       |       pass       |       pass       |          pass          |
|               moco                |  4  |       pass       |   fail_to_run    |   fail_to_run    |      fail_to_run       |
|            hf_BigBird             |  4  |       pass       |       pass       |   fail_to_run    |      fail_to_run       |
|               dlrm                |  4  |       pass       |       pass       |   fail_to_run    |      fail_to_run       |
|          phlippe_resnet           |  4  |       pass       |       pass       |  fail_accuracy   |     fail_accuracy      |
|        Background_Matting         |  4  | eager_variation  | eager_variation  | eager_variation  |    eager_variation     |
|          vision_maskrcnn          |  4  | eager_variation  | eager_variation  | eager_variation  |    eager_variation     |
|             tacotron2             |  4  |   fail_to_run    |   fail_to_run    |      0.0000      |         0.0000         |
|        doctr_det_predictor        |  0  |      0.0000      |      0.0000      |      0.0000      |         0.0000         |
|       doctr_reco_predictor        |  0  |      0.0000      |      0.0000      |      0.0000      |         0.0000         |
|               llama               |  0  |      0.0000      |      0.0000      |      0.0000      |         0.0000         |
|           torchrec_dlrm           |  0  |      0.0000      |      0.0000      |      0.0000      |         0.0000         |
+-----------------------------------+-----+------------------+------------------+------------------+------------------------+

Compilation latency (sec)

+-----------------------------------+------+---------+-----------+----------+------------------------+
|               name                |  bs  |  eager  | aot_eager | inductor | inductor_no_cudagraphs |
+-----------------------------------+------+---------+-----------+----------+------------------------+
|        speech_transformer         |  32  | 5.9379  |  13.5706  | 804.1871 |        42.8845         |
| attention_is_all_you_need_pytorch | 256  |  4.324  |  10.7867  | 689.7678 |        38.4134         |
|            hf_T5_large            |  2   | 26.2442 |  54.7406  | 506.3967 |        148.4014        |
|      timm_vision_transformer      |  32  | 3.3655  |  7.1652   | 436.4946 |        26.1915         |
|             hf_Albert             |  8   | 2.4513  |  8.5711   | 422.7417 |         28.404         |
|         phlippe_densenet          | 128  | 3.2458  |  6.9257   | 416.1867 |        25.5266         |
|           fastNLP_Bert            |  6   | 4.9636  |  11.1295  | 398.4251 |        34.5511         |
|          pytorch_struct           | 200  | 0.7813  |  1.3378   | 354.4235 |         6.9333         |
|           BERT_pytorch            |  16  | 4.7902  |  11.449   | 350.6787 |        34.5791         |
|           mobilenet_v2            |  96  |  3.094  |  6.9056   | 321.413  |        24.9613         |
|           hf_Bert_large           |  4   | 10.1418 |  20.7939  | 314.3734 |        60.6028         |
|            mnasnet1_0             |  32  | 3.0959  |  6.6976   | 314.3265 |        23.5121         |
|            densenet121            |  4   | 7.4234  |  17.9994  | 309.2972 |        61.1015         |
|               hf_T5               |  8   | 5.6153  |  13.4608  | 291.5986 |        39.5658         |
|        mobilenet_v3_large         |  32  | 3.3994  |  7.5884   | 256.7481 |        27.1516         |
|                drq                |  1   | 0.6686  |  1.0099   | 253.8761 |         6.3746         |
|      nvidia_deeprecommender       | 256  | 0.4823  |   0.766   | 249.458  |         5.9622         |
|           hf_Longformer           |  2   | 11.2554 |  31.1595  | 243.8476 |        123.5476        |
|              yolov3               |  16  |  4.812  |  10.4255  | 227.5451 |        36.3119         |
|              hf_GPT2              |  4   | 4.6244  |  9.6041   | 218.6052 |        29.5181         |
|        shufflenet_v2_x1_0         | 128  | 3.4342  |  7.6127   | 215.3652 |        26.8468         |
|         timm_efficientnet         |  32  | 4.9429  |  10.0025  | 207.2363 |        30.7292         |
|            timm_nfnet             | 128  | 5.7487  |  10.9879  | 206.3631 |         32.057         |
|            timm_vovnet            |  32  | 3.5961  |  6.2972   | 199.301  |        22.1326         |
|         soft_actor_critic         | 256  | 0.4404  |  0.6177   | 179.8936 |         5.4369         |
|              hf_Bart              |  4   | 10.8484 |  18.0336  | 179.5238 |        49.6952         |
|            timm_regnet            |  32  | 6.6018  |  12.1995  | 179.196  |        33.0846         |
|          LearningToPaint          |  96  | 1.4753  |  2.8955   | 167.1134 |         12.25          |
|             resnet152             |  32  | 8.8693  |  20.1297  | 163.4345 |        58.4397         |
|               vgg16               |  64  | 0.6332  |  1.1205   | 160.4233 |         7.4845         |
|          resnext50_32x4d          |  8   | 3.1743  |  7.4339   | 158.7375 |        22.7869         |
|           lennard_jones           | 1000 | 0.3987  |  0.6209   | 143.0381 |         4.5367         |
|        Background_Matting         |  4   | 3.2032  |  11.4127  | 131.4817 |        26.8162         |
|             resnet18              |  16  |  1.338  |  2.7724   | 128.2701 |         12.183         |
|           pytorch_unet            |  1   | 1.5283  |  4.4352   | 121.5264 |        13.9278         |
|       functorch_dp_cifar10        |  64  | 1.1992  |  2.5475   | 117.7511 |        12.8811         |
|          phlippe_resnet           | 128  |  1.349  |  2.7318   | 113.7247 |        10.8462         |
|              hf_Bert              |  4   | 4.9301  |  10.3482  | 110.8388 |         32.425         |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 1.2026  |  2.8968   | 89.0985  |        12.3657         |
|           timm_resnest            |  32  |  1.822  |  3.8811   | 78.4203  |        16.7527         |
|            Super_SloMo            |  6   | 2.7734  |  9.7645   | 73.1308  |        25.4957         |
|              demucs               |  4   | 1.4955  |  2.2725   | 71.8464  |         9.6114         |
|           hf_DistilBert           |  8   | 2.3655  |  5.6075   |  61.993  |        19.3363         |
|          pytorch_stargan          |  16  | 1.1848  |  3.2111   | 46.9079  |        10.7557         |
|           squeezenet1_1           |  32  | 1.0332  |  1.7378   | 44.4823  |         8.5951         |
|             resnet50              |  32  | 3.1836  |  7.4252   | 23.9834  |        23.0489         |
|               dcgan               |  32  | 0.4331  |  0.7077   | 16.4177  |         5.1875         |
|            tts_angular            |  64  | 0.4423  |  0.5108   |  4.7125  |         3.838          |
|           hf_GPT2_large           |  4   | 14.8619 |  29.6938  |   nan    |        84.9245         |
|            hf_BigBird             |  2   | 12.8484 |  39.0664  |   nan    |          nan           |
|            hf_Reformer            |  4   | 4.1752  |  6.3515   |   nan    |          nan           |
|               dlrm                | 1024 |  0.374  |  0.7853   |   nan    |          nan           |
|              alexnet              | 128  | 0.5032  |  0.7703   |   nan    |          nan           |
|               moco                |  32  | 27.3074 |    nan    |   nan    |          nan           |
|   timm_vision_transformer_large   |  32  | 9.3266  |    nan    |   nan    |          nan           |
|        doctr_det_predictor        |  0   |   nan   |    nan    |   nan    |          nan           |
|       doctr_reco_predictor        |  0   |   nan   |    nan    |   nan    |          nan           |
|             tacotron2             |  0   |   nan   |    nan    |   nan    |          nan           |
|           torchrec_dlrm           |  0   |   nan   |    nan    |   nan    |          nan           |
+-----------------------------------+------+---------+-----------+----------+------------------------+

Peak Memory Compression Ratio

+-----------------------------------+------+--------+-----------+----------+------------------------+
|               name                |  bs  | eager  | aot_eager | inductor | inductor_no_cudagraphs |
+-----------------------------------+------+--------+-----------+----------+------------------------+
|            Super_SloMo            |  6   | 1.0014 |   0.822   |  1.1588  |         1.208          |
|             hf_Albert             |  8   | 0.9599 |  0.9008   |  1.0399  |         1.0863         |
|           mobilenet_v2            |  96  | 0.9864 |  0.7651   |  1.0107  |         1.0572         |
|               hf_T5               |  8   | 0.9507 |  0.8891   |  0.9988  |         1.0163         |
|           fastNLP_Bert            |  6   | 1.0003 |  0.8878   |  0.9953  |         1.052          |
|            tts_angular            |  64  | 0.9957 |  0.9957   |  0.9852  |         0.9852         |
| attention_is_all_you_need_pytorch | 256  | 0.9648 |  0.9066   |  0.9693  |         1.0269         |
|            timm_nfnet             | 128  | 0.907  |  0.8752   |  0.9619  |         0.9678         |
|           BERT_pytorch            |  16  | 1.0003 |  0.8671   |  0.9428  |         0.9428         |
|              hf_Bert              |  4   | 0.9645 |  0.8353   |  0.9421  |         0.9421         |
|              hf_GPT2              |  4   | 0.9357 |  0.8198   |  0.9317  |         0.9319         |
|           hf_Bert_large           |  4   | 0.9845 |  0.8521   |  0.9138  |         0.9401         |
|         timm_efficientnet         |  32  | 0.9865 |   0.819   |  0.874   |         1.072          |
|              yolov3               |  16  | 0.9923 |  0.8257   |  0.8711  |         0.8705         |
|        shufflenet_v2_x1_0         | 128  | 0.9549 |  0.8395   |  0.8621  |         0.8979         |
|        speech_transformer         |  32  | 0.9915 |    0.9    |  0.8583  |         1.0773         |
|            timm_regnet            |  32  | 0.995  |  0.8499   |  0.8501  |         0.8484         |
|           hf_DistilBert           |  8   | 0.9262 |  0.8146   |  0.8456  |         0.8517         |
|             resnet50              |  32  | 0.9922 |  0.8613   |  0.8365  |         0.8344         |
|      timm_vision_transformer      |  32  | 0.9907 |  0.9299   |  0.8357  |         0.9369         |
|        Background_Matting         |  4   | 1.0125 |  0.6487   |  0.834   |         0.8484         |
|             resnet152             |  32  | 0.9959 |  0.8916   |  0.8319  |         0.8684         |
|           timm_resnest            |  32  | 0.9888 |  0.8973   |  0.8297  |         0.9564         |
|            hf_T5_large            |  2   | 0.9831 |  0.8302   |  0.8201  |         0.8201         |
|         phlippe_densenet          | 128  | 0.9983 |  0.9982   |  0.7988  |         1.0061         |
|           pytorch_unet            |  1   | 0.9953 |  0.7154   |  0.7734  |         0.8554         |
|           squeezenet1_1           |  32  | 0.9674 |  0.9309   |  0.773   |         1.0247         |
|          pytorch_stargan          |  16  | 0.9914 |   0.969   |  0.7715  |         0.9248         |
|              demucs               |  4   | 0.9663 |  0.9659   |  0.7661  |         0.7734         |
|              hf_Bart              |  4   | 0.9084 |   0.843   |  0.7545  |         0.7546         |
|            timm_vovnet            |  32  | 0.9892 |  0.8166   |  0.7428  |         0.8185         |
|          pytorch_struct           | 200  | 0.9992 |  0.5168   |  0.7338  |         0.9955         |
|               vgg16               |  64  | 0.9922 |  0.7246   |  0.723   |         0.7231         |
|            mnasnet1_0             |  32  | 0.9819 |  0.8641   |  0.7201  |         0.8596         |
|            densenet121            |  4   | 0.9956 |  0.9802   |  0.7085  |         0.9766         |
|        mobilenet_v3_large         |  32  | 0.9801 |  0.8396   |  0.6992  |         0.9037         |
|      nvidia_deeprecommender       | 256  | 0.9176 |  0.8055   |  0.6585  |         0.6585         |
|          resnext50_32x4d          |  8   | 0.9947 |  0.8438   |  0.6561  |         0.7855         |
|          LearningToPaint          |  96  | 0.9192 |  0.7116   |  0.597   |         0.7089         |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.9965 |  0.8796   |  0.5458  |         0.8393         |
|             resnet18              |  16  | 0.983  |  0.8055   |  0.5409  |         0.7792         |
|           hf_Longformer           |  2   | 0.8565 |  0.8296   |  0.4206  |         0.4205         |
|       functorch_dp_cifar10        |  64  | 0.9953 |  0.8396   |  0.3991  |         0.7086         |
|          phlippe_resnet           | 128  | 0.9881 |   0.864   |  0.3272  |         0.8517         |
|                drq                |  1   | 0.9877 |  0.8852   |  0.1818  |         0.6379         |
|               dcgan               |  32  | 0.9647 |  0.7957   |  0.1811  |         0.7821         |
|         soft_actor_critic         | 256  | 0.9995 |  0.9255   |  0.1109  |         0.6066         |
|           lennard_jones           | 1000 | 0.9996 |  0.9997   |  0.0648  |         0.7073         |
|           hf_GPT2_large           |  4   | 0.9663 |  0.8303   |   nan    |         0.8905         |
|               dlrm                | 1024 | 0.9995 |  0.9944   |   nan    |          nan           |
|            hf_BigBird             |  2   | 0.9493 |  0.9268   |   nan    |          nan           |
|            hf_Reformer            |  4   | 0.8004 |  0.8004   |   nan    |          nan           |
|              alexnet              | 128  | 0.9452 |  0.7935   |   nan    |          nan           |
|   timm_vision_transformer_large   |  32  | 0.9992 |    nan    |   nan    |          nan           |
|               moco                |  32  | 0.9958 |    nan    |   nan    |          nan           |
|        doctr_det_predictor        |  0   |  nan   |    nan    |   nan    |          nan           |
|       doctr_reco_predictor        |  0   |  nan   |    nan    |   nan    |          nan           |
|             tacotron2             |  0   |  nan   |    nan    |   nan    |          nan           |
|           torchrec_dlrm           |  0   |  nan   |    nan    |   nan    |          nan           |
+-----------------------------------+------+--------+-----------+----------+------------------------+

Absolute latency (ms)

+-----------------------------------+------+----------+-----------+----------+------------------------+
|               name                |  bs  |  eager   | aot_eager | inductor | inductor_no_cudagraphs |
+-----------------------------------+------+----------+-----------+----------+------------------------+
|        Background_Matting         |  4   | 126.0957 | 918.6078  | 107.5371 |        107.5328        |
|            hf_T5_large            |  2   | 269.1613 | 273.5731  | 98.1109  |        97.8314         |
|               hf_T5               |  8   | 183.1776 | 210.9367  | 93.3605  |         93.504         |
|            timm_nfnet             | 128  | 119.6264 | 120.2675  | 80.7577  |        80.7667         |
|            Super_SloMo            |  6   | 79.8167  | 446.6856  | 65.4352  |        65.1566         |
|           hf_Longformer           |  2   |  122.75  | 193.0691  | 62.2797  |        62.0164         |
|              yolov3               |  16  | 68.7648  |  84.7996  | 61.4977  |         61.413         |
|            timm_regnet            |  32  | 61.4621  |  71.7881  | 59.6524  |        60.1548         |
|             resnet152             |  32  | 63.4651  |  87.8524  | 54.3377  |        55.1513         |
|               vgg16               |  64  | 66.2892  |  66.3835  | 53.3402  |        53.3047         |
|              demucs               |  4   | 53.5993  |  53.4955  | 52.2167  |        52.3699         |
|           hf_Bert_large           |  4   | 83.6988  |  94.6253  | 50.9415  |        50.8037         |
|           pytorch_unet            |  1   | 39.9741  | 194.5111  | 34.0153  |        33.9792         |
|        speech_transformer         |  32  | 59.5544  |  84.1112  | 33.4183  |        33.0468         |
|           fastNLP_Bert            |  6   | 57.0606  |  60.7527  |  33.011  |        31.1786         |
| attention_is_all_you_need_pytorch | 256  | 58.3394  |  68.5091  | 32.8929  |        32.8822         |
|              hf_Bart              |  4   |  71.723  |  86.4912  | 32.6812  |        33.0355         |
|           mobilenet_v2            |  96  | 47.1521  |  60.4154  | 31.7936  |        31.8657         |
|             hf_Albert             |  8   |  68.645  |  72.3887  | 29.6789  |        29.6467         |
|            timm_vovnet            |  32  | 28.8224  |  35.1916  | 26.8202  |        26.7869         |
|              hf_GPT2              |  4   | 49.3612  |  50.6168  | 25.3235  |        25.2676         |
|         timm_efficientnet         |  32  |  34.564  |  51.7456  | 23.5489  |        23.2529         |
|             resnet50              |  32  | 26.2812  |  37.0821  | 22.8879  |        22.8241         |
|              hf_Bert              |  4   | 40.7494  |  48.2982  | 22.4111  |        22.4317         |
|           hf_DistilBert           |  8   | 32.1005  |  35.7249  | 22.0729  |        22.0321         |
|            densenet121            |  4   | 60.8842  |  86.2346  | 20.9685  |        18.8171         |
|        shufflenet_v2_x1_0         | 128  | 32.1105  |  40.0698  | 19.7102  |        19.7218         |
|           BERT_pytorch            |  16  | 53.4104  |  66.8912  | 17.0935  |        17.0782         |
|      timm_vision_transformer      |  32  |  33.391  |  33.5448  | 16.7177  |        16.6983         |
|           timm_resnest            |  32  | 24.2609  |  28.3734  | 16.6174  |        16.6065         |
|        mobilenet_v3_large         |  32  | 28.9221  |  36.5796  | 13.3713  |        13.8347         |
|            mnasnet1_0             |  32  | 23.6481  |  31.9652  | 13.2494  |        13.1868         |
|          pytorch_stargan          |  16  | 14.7275  |  18.1919  | 11.9033  |        11.8483         |
|         phlippe_densenet          | 128  | 23.9166  |  30.3144  | 11.9014  |         11.541         |
|          resnext50_32x4d          |  8   | 22.3265  |  30.6785  | 11.8135  |        11.6231         |
|      nvidia_deeprecommender       | 256  | 10.2265  |  10.2372  | 10.9236  |        10.9303         |
|          LearningToPaint          |  96  | 12.0771  |  14.3152  |  8.7481  |         8.7039         |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 13.8376  |  15.1046  |  7.2182  |         7.1309         |
|            tts_angular            |  64  |  6.5685  |  6.8807   |  6.5165  |         6.3773         |
|             resnet18              |  16  |  9.8001  |   12.82   |  5.7904  |         5.7424         |
|           squeezenet1_1           |  32  | 10.3157  |  11.8552  |  5.4993  |         5.4342         |
|          phlippe_resnet           | 128  |  9.2808  |  12.0101  |  5.0985  |         5.0227         |
|       functorch_dp_cifar10        |  64  | 10.5392  |  12.2784  |  2.8779  |         2.8591         |
|                drq                |  1   |  3.3757  |   4.345   |  2.8418  |         3.0024         |
|          pytorch_struct           | 200  |  5.596   |  6.1243   |  2.7861  |         2.688          |
|               dcgan               |  32  |  2.3663  |  3.0443   |  1.4545  |         1.4307         |
|         soft_actor_critic         | 256  |  2.6689  |   3.419   |  1.2898  |         1.8432         |
|           lennard_jones           | 1000 |  1.7473  |  2.3537   |  1.0843  |         1.0323         |
|           hf_GPT2_large           |  4   | 213.7892 | 214.8782  |   nan    |        120.237         |
|            hf_BigBird             |  2   | 197.2802 | 275.9741  |   nan    |          nan           |
|            hf_Reformer            |  4   | 81.6483  |  86.0763  |   nan    |          nan           |
|              alexnet              | 128  |  9.8389  |  9.8565   |   nan    |          nan           |
|               dlrm                | 1024 |  4.942   |  5.0729   |   nan    |          nan           |
|   timm_vision_transformer_large   |  32  | 465.2338 |    nan    |   nan    |          nan           |
|               moco                |  32  | 50.3693  |    nan    |   nan    |          nan           |
|        doctr_det_predictor        |  0   |   nan    |    nan    |   nan    |          nan           |
|       doctr_reco_predictor        |  0   |   nan    |    nan    |   nan    |          nan           |
|             tacotron2             |  0   |   nan    |    nan    |   nan    |          nan           |
|           torchrec_dlrm           |  0   |   nan    |    nan    |   nan    |          nan           |
+-----------------------------------+------+----------+-----------+----------+------------------------+

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | OPTForCausalLM | 2 | 0.993 | 0.9322 | 2.4897 | 2.4996 | | GPT2ForSequenceClassification | 4 | 0.9814 | 0.9568 | 2.2984 | 2.2976 | | XGLMForCausalLM | 8 | 0.9698 | 0.7496 | 2.2303 | 2.2367 | | ElectraForQuestionAnswering | 64 | 0.988 | 0.9772 | 2.1243 | 2.1242 | | MT5ForConditionalGeneration | 16 | 0.9921 | 0.8398 | 2.1178 | 2.134 | | MobileBertForMaskedLM | 64 | 0.9448 | 0.8025 | 1.9947 | 1.7704 | | DistillGPT2 | 16 | 0.9903 | 0.9579 | 1.8904 | 1.8944 | | PLBartForCausalLM | 8 | 0.9948 | 0.9622 | 1.8711 | 1.887 | | XLNetLMHeadModel | 8 | 0.9959 | 0.9685 | 1.8479 | 1.8511 | | ElectraForCausalLM | 32 | 0.9826 | 0.9361 | 1.7959 | 1.7954 | | BertForQuestionAnswering | 16 | 0.9856 | 0.9702 | 1.772 | 1.7714 | | RobertaForQuestionAnswering | 16 | 0.9855 | 0.9707 | 1.7664 | 1.7652 | | AllenaiLongformerBase | 4 | 0.9464 | 0.6571 | 1.7653 | 1.7588 | | PLBartForConditionalGeneration | 4 | 0.9923 | 0.933 | 1.7257 | 1.7043 | | RobertaForCausalLM | 16 | 0.988 | 0.9628 | 1.6759 | 1.6768 | | MBartForConditionalGeneration | 2 | 0.9965 | 0.9602 | 1.666 | 1.5361 | | T5Small | 4 | 0.9822 | 0.8534 | 1.665 | 1.6628 | | T5ForConditionalGeneration | 4 | 0.9834 | 0.8546 | 1.6609 | 1.6574 | | MBartForCausalLM | 4 | 0.9931 | 0.9688 | 1.6459 | 1.6427 | | BartForCausalLM | 4 | 0.9924 | 0.9684 | 1.6393 | 1.638 | | MegatronBertForQuestionAnswering | 8 | 0.9811 | 0.9616 | 1.6253 | 1.6249 | | AlbertForQuestionAnswering | 4 | 0.9998 | 0.8856 | 1.6244 | 1.6236 | | CamemBert | 16 | 0.988 | 0.9641 | 1.6202 | 1.6186 | | YituTechConvBert | 16 | 0.9863 | 0.9562 | 1.6156 | 1.6138 | | AlbertForMaskedLM | 4 | 0.9998 | 0.885 | 1.6134 | 1.6143 | | BertForMaskedLM | 16 | 0.9863 | 0.9617 | 1.597 | 1.5965 | | M2M100ForConditionalGeneration | 16 | 1.0331 | 0.819 | 1.5961 | 1.7258 | | LayoutLMForMaskedLM | 16 | 0.9874 | 0.963 | 1.5816 | 1.5888 | | BartForConditionalGeneration | 2 | 0.9946 | 0.9555 | 1.5481 | 1.5446 | | MegatronBertForCausalLM | 4 | 0.9818 | 0.903 | 1.509 | 1.5013 | | Speech2Text2ForCausalLM | 256 | 0.9851 | 0.9354 | 1.4949 | 1.4798 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0006 | 0.8831 | 1.4595 | 1.4665 | | DistilBertForQuestionAnswering | 256 | 0.9944 | 0.9881 | 1.4505 | 1.4518 | | PegasusForCausalLM | 32 | 0.9892 | 0.8861 | 1.3919 | 1.3482 | | TrOCRForCausalLM | 32 | 0.9928 | 0.9635 | 1.3812 | 1.3801 | | BlenderbotSmallForCausalLM | 64 | 0.9778 | 0.8852 | 1.3759 | 1.373 | | PegasusForConditionalGeneration | 32 | 0.9997 | 0.9115 | 1.3254 | 1.2809 | | DistilBertForMaskedLM | 128 | 0.9925 | 0.9507 | 1.2233 | 1.2236 | | MobileBertForQuestionAnswering | 128 | 0.9484 | 0.8065 | 0.7854 | 0.764 | | LayoutLMForSequenceClassification | 16 | 0.9851 | 0.9718 | 0.0 | 0.0 | | DebertaForQuestionAnswering | 8 | 0.9473 | 0.7895 | 0.0 | 0.0 | | BlenderbotForCausalLM | 4 | 0.9715 | 0.7692 | 0.0 | 0.0 | | DebertaForMaskedLM | 4 | 0.8734 | 0.6369 | 0.0 | 0.0 | | DebertaV2ForQuestionAnswering | 2 | 0.8385 | 0.6083 | 0.0 | 0.0 | | DebertaV2ForMaskedLM | 1 | 0.8334 | 0.6075 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | | AlbertForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_accuracy | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | MobileBertForMaskedLM | 64 | 15.2018 | 38.6369 | 612.8043 | 113.3956 | | MobileBertForQuestionAnswering | 128 | 15.3019 | 38.2697 | 601.811 | 114.6848 | | MT5ForConditionalGeneration | 16 | 7.8685 | 18.4295 | 579.4308 | 55.6624 | | ElectraForCausalLM | 32 | 7.4534 | 13.9227 | 409.8685 | 37.9467 | | AlbertForMaskedLM | 4 | 2.3588 | 8.1529 | 321.446 | 27.6028 | | XLNetLMHeadModel | 8 | 10.4839 | 27.2821 | 282.4343 | 84.5672 | | ElectraForQuestionAnswering | 64 | 5.1265 | 11.261 | 273.235 | 33.3345 | | T5ForConditionalGeneration | 4 | 5.5981 | 13.199 | 260.3542 | 40.7701 | | M2M100ForConditionalGeneration | 16 | 11.5678 | 25.368 | 259.9593 | 88.0024 | | AllenaiLongformerBase | 4 | 11.3305 | 31.1314 | 251.2462 | 125.0294 | | GPT2ForSequenceClassification | 4 | 4.7907 | 9.7196 | 238.0007 | 30.1957 | | YituTechConvBert | 16 | 10.3942 | 20.1558 | 236.3194 | 54.8597 | | XGLMForCausalLM | 8 | 9.7456 | 20.9409 | 232.0268 | 74.9239 | | BartForConditionalGeneration | 2 | 11.5589 | 25.6066 | 220.9387 | 78.6658 | | BertForMaskedLM | 16 | 5.1121 | 10.6408 | 217.7401 | 33.0531 | | DistilBertForMaskedLM | 128 | 2.499 | 5.7283 | 215.7511 | 18.7842 | | TrOCRForCausalLM | 32 | 6.4664 | 11.8024 | 215.566 | 38.541 | | DistilBertForQuestionAnswering | 256 | 2.4975 | 5.6655 | 198.5407 | 18.6101 | | BartForCausalLM | 4 | 6.2474 | 11.8835 | 178.4346 | 39.8267 | | BlenderbotSmallForCausalLM | 64 | 4.3127 | 8.3279 | 172.8293 | 28.0008 | | DistillGPT2 | 16 | 2.5314 | 4.9717 | 163.3151 | 17.6674 | | Speech2Text2ForCausalLM | 256 | 3.2033 | 6.0461 | 161.3093 | 23.5722 | | MegatronBertForQuestionAnswering | 8 | 10.0741 | 20.9543 | 146.4264 | 63.9292 | | OPTForCausalLM | 2 | 5.3838 | 10.8433 | 116.3097 | 36.559 | | PegasusForCausalLM | 32 | 5.8891 | 11.4565 | 106.0252 | 38.085 | | MBartForConditionalGeneration | 2 | 11.4184 | 25.7575 | 104.9186 | 88.017 | | MegatronBertForCausalLM | 4 | 9.9645 | 21.1117 | 100.4353 | 64.697 | | PLBartForConditionalGeneration | 4 | 9.1138 | 17.3234 | 97.3324 | 47.6158 | | PegasusForConditionalGeneration | 32 | 4.9909 | 18.9919 | 93.9846 | 75.2113 | | PLBartForCausalLM | 8 | 3.5866 | 6.7189 | 91.7901 | 21.8006 | | BertForQuestionAnswering | 16 | 5.0775 | 11.2128 | 88.0399 | 33.0342 | | AlbertForQuestionAnswering | 4 | 2.308 | 7.9774 | 86.4534 | 27.2846 | | BlenderbotSmallForConditionalGeneration | 64 | 7.5515 | 17.9201 | 85.1413 | 53.4714 | | CamemBert | 16 | 5.1905 | 11.2798 | 63.3132 | 32.8948 | | MBartForCausalLM | 4 | 6.2442 | 12.2141 | 41.3957 | 38.1136 | | T5Small | 4 | 5.5735 | 12.5455 | 40.3395 | 40.4291 | | RobertaForCausalLM | 16 | 5.1457 | 11.2786 | 40.1117 | 33.7172 | | LayoutLMForMaskedLM | 16 | 5.557 | 11.7614 | 35.2617 | 34.1443 | | RobertaForQuestionAnswering | 16 | 5.1141 | 11.2751 | 33.8184 | 32.8534 | | DebertaV2ForQuestionAnswering | 2 | 15.1057 | 27.645 | nan | nan | | DebertaV2ForMaskedLM | 1 | 15.1776 | 26.3611 | nan | nan | | BlenderbotForCausalLM | 4 | 11.5496 | 22.3509 | nan | nan | | DebertaForMaskedLM | 4 | 7.414 | 13.4548 | nan | nan | | DebertaForQuestionAnswering | 8 | 7.1924 | 13.3661 | nan | nan | | LayoutLMForSequenceClassification | 16 | 5.4712 | 11.605 | nan | nan | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | XLNetLMHeadModel | 8 | 0.9843 | 0.9603 | 1.1342 | 1.1342 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.906 | 1.1135 | 1.114 | | ElectraForQuestionAnswering | 64 | 1.0014 | 0.9537 | 1.1114 | 1.1387 | | BertForQuestionAnswering | 16 | 1.0017 | 0.9284 | 1.0868 | 1.0868 | | RobertaForQuestionAnswering | 16 | 1.0012 | 0.9279 | 1.0865 | 1.0865 | | OPTForCausalLM | 2 | 0.9682 | 0.9246 | 1.0617 | 1.062 | | RobertaForCausalLM | 16 | 0.9999 | 0.9209 | 1.0541 | 1.0541 | | T5Small | 4 | 0.9999 | 0.9516 | 1.0382 | 1.0382 | | T5ForConditionalGeneration | 4 | 0.9999 | 0.9516 | 1.0356 | 1.0382 | | BertForMaskedLM | 16 | 0.9998 | 0.9207 | 1.03 | 1.0539 | | DistilBertForQuestionAnswering | 256 | 1.0114 | 0.9556 | 1.0299 | 1.057 | | CamemBert | 16 | 1.0 | 0.9184 | 1.0277 | 1.0511 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9211 | 1.0078 | 1.0078 | | YituTechConvBert | 16 | 0.953 | 0.8749 | 0.9793 | 0.9793 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.7449 | 0.9734 | 0.9734 | | DistillGPT2 | 16 | 1.0 | 0.8591 | 0.9682 | 0.9682 | | AlbertForMaskedLM | 4 | 1.0 | 0.7338 | 0.9574 | 0.9574 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.904 | 0.953 | 0.953 | | PLBartForConditionalGeneration | 4 | 0.93 | 0.8787 | 0.9215 | 0.9575 | | PegasusForConditionalGeneration | 32 | 0.9439 | 0.8957 | 0.8911 | 0.8911 | | MT5ForConditionalGeneration | 16 | 0.9999 | 0.8495 | 0.8906 | 0.9089 | | ElectraForCausalLM | 32 | 0.9161 | 0.7864 | 0.8896 | 0.8896 | | PLBartForCausalLM | 8 | 0.9237 | 0.8168 | 0.8748 | 0.8918 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8468 | 0.8677 | 0.8849 | | MBartForConditionalGeneration | 2 | 1.0 | 0.8946 | 0.8672 | 0.8672 | | TrOCRForCausalLM | 32 | 0.92 | 0.8307 | 0.8628 | 0.8628 | | MBartForCausalLM | 4 | 0.951 | 0.8913 | 0.8501 | 0.8501 | | BartForConditionalGeneration | 2 | 1.0 | 0.8987 | 0.8456 | 0.8456 | | MegatronBertForCausalLM | 4 | 1.0 | 0.8644 | 0.845 | 0.845 | | BartForCausalLM | 4 | 0.951 | 0.8911 | 0.8311 | 0.8311 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8895 | 0.816 | 0.8729 | | PegasusForCausalLM | 32 | 0.9238 | 0.8405 | 0.7966 | 0.7966 | | BlenderbotSmallForCausalLM | 64 | 0.8906 | 0.7493 | 0.787 | 0.808 | | MobileBertForMaskedLM | 64 | 1.0 | 0.8769 | 0.752 | 0.7654 | | Speech2Text2ForCausalLM | 256 | 0.8865 | 0.7573 | 0.7364 | 0.7566 | | XGLMForCausalLM | 8 | 0.9431 | 0.8612 | 0.6744 | 0.6744 | | MobileBertForQuestionAnswering | 128 | 1.0161 | 1.0064 | 0.6505 | 0.6644 | | M2M100ForConditionalGeneration | 16 | 0.955 | 0.8772 | 0.6058 | 0.6058 | | AllenaiLongformerBase | 4 | 0.8568 | 0.7887 | 0.4696 | 0.4697 | | DebertaForQuestionAnswering | 8 | 0.9524 | 1.0537 | nan | nan | | BlenderbotForCausalLM | 4 | 0.9932 | 0.9937 | nan | nan | | DebertaV2ForQuestionAnswering | 2 | 0.9764 | 0.9763 | nan | nan | | LayoutLMForSequenceClassification | 16 | 1.0014 | 0.9295 | nan | nan | | DebertaForMaskedLM | 4 | 0.9326 | 0.9156 | nan | nan | | DebertaV2ForMaskedLM | 1 | 0.977 | 0.9068 | nan | nan | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | MobileBertForQuestionAnswering | 128 | 173.0459 | 210.317 | 216.1644 | 216.2452 | | AlbertForMaskedLM | 4 | 266.3771 | 300.754 | 165.0621 | 164.8743 | | AlbertForQuestionAnswering | 4 | 264.3063 | 298.3179 | 162.6193 | 162.649 | | XLNetLMHeadModel | 8 | 281.629 | 288.8707 | 151.3718 | 151.9249 | | PegasusForConditionalGeneration | 32 | 139.0758 | 157.9987 | 107.5884 | 107.4553 | | AllenaiLongformerBase | 4 | 192.6358 | 274.6697 | 103.0385 | 103.0073 | | TrOCRForCausalLM | 32 | 139.2544 | 142.412 | 100.0796 | 99.95 | | MobileBertForMaskedLM | 64 | 175.2932 | 215.3965 | 96.302 | 95.617 | | MBartForConditionalGeneration | 2 | 139.8787 | 147.0231 | 89.2408 | 89.0775 | | BartForConditionalGeneration | 2 | 138.7751 | 145.6953 | 88.8995 | 88.9364 | | MegatronBertForQuestionAnswering | 8 | 144.4802 | 147.2926 | 87.2724 | 87.2661 | | YituTechConvBert | 16 | 127.3377 | 131.3746 | 77.6685 | 77.7384 | | BlenderbotSmallForConditionalGeneration | 64 | 113.1721 | 141.713 | 75.8347 | 75.7973 | | CamemBert | 16 | 119.9261 | 123.1083 | 73.0439 | 73.1466 | | M2M100ForConditionalGeneration | 16 | 106.1743 | 169.0829 | 71.3574 | 71.253 | | LayoutLMForMaskedLM | 16 | 114.0876 | 116.9931 | 71.1725 | 70.8657 | | DistilBertForQuestionAnswering | 256 | 104.2816 | 104.6358 | 71.1721 | 71.1248 | | MBartForCausalLM | 4 | 114.3137 | 117.1971 | 69.4236 | 69.1229 | | BartForCausalLM | 4 | 114.7808 | 116.9336 | 69.3089 | 69.2921 | | DistilBertForMaskedLM | 128 | 85.2505 | 89.1399 | 69.2377 | 69.1994 | | BertForMaskedLM | 16 | 111.4021 | 114.1966 | 68.8368 | 68.8219 | | PLBartForConditionalGeneration | 4 | 118.6356 | 126.0645 | 68.807 | 68.619 | | RobertaForCausalLM | 16 | 116.4052 | 119.7136 | 68.69 | 68.5675 | | OPTForCausalLM | 2 | 169.854 | 183.0746 | 68.1696 | 68.278 | | T5ForConditionalGeneration | 4 | 106.0459 | 122.561 | 63.0649 | 63.0061 | | T5Small | 4 | 106.7222 | 122.3385 | 62.9937 | 62.9785 | | PLBartForCausalLM | 8 | 115.8989 | 117.9432 | 62.1046 | 62.1039 | | MegatronBertForCausalLM | 4 | 88.7223 | 96.5178 | 57.6411 | 57.5781 | | DistillGPT2 | 16 | 106.8258 | 110.3284 | 55.8867 | 55.807 | | RobertaForQuestionAnswering | 16 | 96.9853 | 98.7218 | 54.1639 | 54.1492 | | ElectraForQuestionAnswering | 64 | 116.1091 | 117.4931 | 53.9407 | 53.893 | | BertForQuestionAnswering | 16 | 96.7365 | 98.4099 | 53.7452 | 53.7093 | | PegasusForCausalLM | 32 | 69.9836 | 83.7333 | 53.2489 | 53.2104 | | XGLMForCausalLM | 8 | 93.387 | 146.3344 | 52.7061 | 52.8895 | | ElectraForCausalLM | 32 | 89.611 | 94.3427 | 49.0565 | 49.0309 | | MT5ForConditionalGeneration | 16 | 92.3155 | 111.3906 | 43.7342 | 43.7212 | | BlenderbotSmallForCausalLM | 64 | 62.8581 | 69.3089 | 42.0479 | 42.0492 | | GPT2ForSequenceClassification | 4 | 93.2532 | 95.4327 | 39.8223 | 39.7509 | | Speech2Text2ForCausalLM | 256 | 53.442 | 56.6301 | 35.7724 | 35.7478 | | DebertaV2ForQuestionAnswering | 2 | 126.2931 | 192.3707 | nan | nan | | DebertaV2ForMaskedLM | 1 | 122.3793 | 192.1189 | nan | nan | | BlenderbotForCausalLM | 4 | 104.363 | 130.0217 | nan | nan | | LayoutLMForSequenceClassification | 16 | 99.1831 | 100.6837 | nan | nan | | DebertaForMaskedLM | 4 | 80.4079 | 98.8314 | nan | nan | | DebertaForQuestionAnswering | 8 | 80.0075 | 95.9283 | nan | nan | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 0.9991 | 0.997 | 3.2993 | 3.3012 | | twins_pcpvt_base | 64 | 0.9981 | 0.9037 | 2.1057 | 2.0902 | | xcit_large_24_p8_224 | 5 | 0.9936 | 0.8702 | 2.0997 | 2.1006 | | coat_lite_mini | 128 | 0.9973 | 0.9957 | 2.0576 | 2.0582 | | gmixer_24_224 | 128 | 0.9949 | 0.8894 | 1.8599 | 1.8621 | | crossvit_9_240 | 128 | 0.9903 | 0.7832 | 1.7896 | 1.7833 | | ghostnet_100 | 128 | 0.9921 | 0.7612 | 1.7783 | 1.7826 | | volo_d1_224 | 64 | 0.9939 | 0.9729 | 1.7272 | 1.7248 | | gmlp_s16_224 | 128 | 0.9944 | 1.0822 | 1.7209 | 1.7202 | | swin_base_patch4_window7_224 | 64 | 0.991 | 0.9544 | 1.7083 | 1.7071 | | convit_base | 64 | 0.998 | 0.9971 | 1.6228 | 1.6226 | | pit_b_224 | 64 | 0.9949 | 0.9924 | 1.602 | 1.6041 | | lcnet_050 | 128 | 0.9413 | 0.73 | 1.5877 | 1.5826 | | jx_nest_base | 32 | 0.9867 | 0.9852 | 1.5469 | 1.5457 | | gluon_inception_v3 | 128 | 0.9963 | 0.8646 | 1.5182 | 1.518 | | adv_inception_v3 | 128 | 0.9959 | 0.8597 | 1.509 | 1.5098 | | inception_v3 | 128 | 0.9981 | 0.8632 | 1.5074 | 1.5053 | | convnext_base | 64 | 0.9836 | 0.9846 | 1.4959 | 1.4947 | | sebotnet33ts_256 | 64 | 0.9576 | 0.7547 | 1.4718 | 1.472 | | dla102 | 128 | 0.9958 | 0.8154 | 1.4684 | 1.4674 | | mobilevit_s | 64 | 0.9618 | 0.7313 | 1.4474 | 1.4471 | | beit_base_patch16_224 | 64 | 0.9969 | 0.9588 | 1.4431 | 1.4446 | | cait_m36_384 | 4 | 0.995 | 0.9924 | 1.4373 | 1.439 | | nfnet_l0 | 128 | 0.9895 | 0.8142 | 1.4362 | 1.4496 | | dm_nfnet_f0 | 128 | 0.9866 | 0.9853 | 1.4131 | 1.4138 | | eca_botnext26ts_256 | 128 | 0.9734 | 0.7194 | 1.4042 | 1.4048 | | resmlp_12_224 | 128 | 0.9927 | 0.8893 | 1.3938 | 1.3925 | | botnet26t_256 | 128 | 0.9734 | 0.851 | 1.3868 | 1.3853 | | mnasnet_100 | 128 | 0.9488 | 0.7407 | 1.3724 | 1.3718 | | resnest101e | 64 | 0.9945 | 0.868 | 1.3632 | 1.3657 | | mixer_b16_224 | 128 | 0.9974 | 1.0178 | 1.3596 | 1.3597 | | selecsls42b | 128 | 0.9984 | 0.8114 | 1.3545 | 1.3524 | | regnety_002 | 128 | 0.9545 | 0.7143 | 1.3509 | 1.3572 | | mobilenetv2_100 | 128 | 0.9493 | 0.7379 | 1.3464 | 1.3488 | | mobilenetv3_large_100 | 128 | 0.9495 | 0.7603 | 1.3463 | 1.347 | | vit_base_patch16_224 | 64 | 0.9961 | 0.9935 | 1.3375 | 1.3342 | | res2net50_14w_8s | 128 | 0.999 | 0.7907 | 1.336 | 1.335 | | hrnet_w18 | 128 | 0.9921 | 0.6427 | 1.3248 | 1.3228 | | res2next50 | 128 | 0.9987 | 0.8247 | 1.3153 | 1.3152 | | deit_base_distilled_patch16_224 | 64 | 0.9966 | 0.9935 | 1.3144 | 1.3149 | | spnasnet_100 | 128 | 0.9412 | 0.7387 | 1.3038 | 1.3053 | | tf_efficientnet_b0 | 128 | 0.9609 | 0.6814 | 1.293 | 1.2933 | | fbnetc_100 | 128 | 0.9501 | 0.7387 | 1.2924 | 1.3164 | | poolformer_m36 | 64 | 0.9859 | 0.9831 | 1.2743 | 1.2761 | | rexnet_100 | 128 | 0.952 | 0.7028 | 1.2474 | 1.245 | | ese_vovnet19b_dw | 128 | 0.9583 | 0.8341 | 1.2474 | 1.248 | | fbnetv3_b | 128 | 0.949 | 0.7691 | 1.2178 | 1.2426 | | visformer_small | 128 | 0.9962 | 0.9446 | 1.1907 | 1.1905 | | tinynet_a | 128 | 0.9472 | 0.6783 | 1.1812 | 1.1823 | | tf_mixnet_l | 128 | 0.9766 | 0.827 | 1.1678 | 1.1665 | | mixnet_l | 128 | 0.9757 | 0.8212 | 1.1554 | 1.1558 | | cspdarknet53 | 64 | 0.9318 | 0.7859 | 1.1486 | 1.148 | | res2net101_26w_4s | 64 | 1.0004 | 0.7876 | 1.122 | 1.123 | | dpn107 | 32 | 0.9319 | 0.8071 | 1.0769 | 1.0769 | | gluon_xception65 | 32 | 0.9923 | 0.8426 | 1.0646 | 1.0652 | | swsl_resnext101_32x16d | 32 | 0.9977 | 0.8398 | 1.0434 | 1.0432 | | repvgg_a2 | 128 | 0.936 | 0.7563 | 1.0375 | 1.0353 | | gernet_l | 128 | 0.9358 | 0.7928 | 1.0066 | 1.0234 | | convmixer_768_32 | 32 | 0.9986 | 0.965 | 0.9959 | 0.9959 | | pnasnet5large | 16 | 0.9857 | 0.91 | 0.9034 | 0.901 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+---------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+----+---------------+---------------+---------------+------------------------+ | adv_inception_v3 | 8 | pass | pass | pass | pass | | resnest101e | 8 | pass | pass | pass | pass | | swin_base_patch4_window7_224 | 8 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 8 | pass | pass | pass | pass | | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | twins_pcpvt_base | 8 | pass | pass | pass | pass | | visformer_small | 8 | pass | pass | pass | pass | | vit_base_patch16_224 | 8 | pass | pass | pass | pass | | volo_d1_224 | 8 | pass | pass | pass | pass | | botnet26t_256 | 8 | fail_accuracy | pass | pass | pass | | cspdarknet53 | 8 | fail_accuracy | pass | pass | pass | | dpn107 | 8 | fail_accuracy | pass | pass | pass | | ese_vovnet19b_dw | 8 | fail_accuracy | pass | pass | pass | | fbnetc_100 | 8 | fail_accuracy | pass | pass | pass | | mixnet_l | 8 | fail_accuracy | pass | pass | pass | | mnasnet_100 | 8 | fail_accuracy | pass | pass | pass | | mobilevit_s | 8 | fail_accuracy | pass | pass | pass | | regnety_002 | 8 | fail_accuracy | pass | pass | pass | | repvgg_a2 | 8 | fail_accuracy | pass | pass | pass | | rexnet_100 | 8 | fail_accuracy | pass | pass | pass | | spnasnet_100 | 8 | fail_accuracy | pass | pass | pass | | tf_efficientnet_b0 | 8 | fail_accuracy | pass | pass | pass | | tf_mixnet_l | 8 | fail_accuracy | pass | pass | pass | | tinynet_a | 8 | fail_accuracy | pass | pass | pass | | eca_botnext26ts_256 | 8 | fail_accuracy | fail_accuracy | pass | pass | | gernet_l | 8 | fail_accuracy | fail_accuracy | pass | pass | | mobilenetv2_100 | 8 | fail_accuracy | fail_accuracy | pass | pass | | beit_base_patch16_224 | 8 | pass | pass | pass | pass | | selecsls42b | 8 | pass | pass | pass | pass | | resmlp_12_224 | 8 | pass | pass | pass | pass | | gmlp_s16_224 | 8 | pass | pass | pass | pass | | cait_m36_384 | 4 | pass | pass | pass | pass | | convit_base | 8 | pass | pass | pass | pass | | convmixer_768_32 | 8 | pass | pass | pass | pass | | convnext_base | 8 | pass | pass | pass | pass | | crossvit_9_240 | 8 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 8 | pass | pass | pass | pass | | dla102 | 8 | pass | pass | pass | pass | | dm_nfnet_f0 | 8 | pass | pass | pass | pass | | ghostnet_100 | 8 | pass | pass | pass | pass | | gluon_inception_v3 | 8 | pass | pass | pass | pass | | gluon_xception65 | 8 | pass | pass | pass | pass | | res2next50 | 8 | pass | pass | pass | pass | | gmixer_24_224 | 8 | pass | pass | pass | pass | | hrnet_w18 | 8 | pass | pass | pass | pass | | inception_v3 | 8 | pass | pass | pass | pass | | jx_nest_base | 8 | pass | pass | pass | pass | | lcnet_050 | 8 | pass | pass | pass | pass | | mixer_b16_224 | 8 | pass | pass | pass | pass | | mobilenetv3_large_100 | 8 | pass | pass | pass | pass | | nfnet_l0 | 8 | pass | pass | pass | pass | | pit_b_224 | 8 | pass | pass | pass | pass | | pnasnet5large | 8 | pass | pass | pass | pass | | poolformer_m36 | 8 | pass | pass | pass | pass | | res2net101_26w_4s | 8 | pass | pass | pass | pass | | res2net50_14w_8s | 8 | pass | pass | pass | pass | | sebotnet33ts_256 | 8 | pass | pass | fail_accuracy | fail_accuracy | | xcit_large_24_p8_224 | 8 | pass | fail_accuracy | fail_accuracy | fail_accuracy | | fbnetv3_b | 8 | fail_accuracy | fail_accuracy | fail_accuracy | fail_accuracy | | coat_lite_mini | 8 | pass | pass | 0.0000 | pass | +---------------------------------+----+---------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+-----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+---------+-----------+-----------+------------------------+ | twins_pcpvt_base | 64 | 10.9375 | 22.9749 | 1493.7265 | 84.4076 | | mobilevit_s | 64 | 5.1798 | 11.1877 | 1421.1621 | 54.0064 | | coat_lite_mini | 128 | 3.2757 | 7.7508 | 1266.3653 | 39.1627 | | crossvit_9_240 | 128 | 5.7183 | 13.177 | 1126.9456 | 53.4964 | | volo_d1_224 | 64 | 4.9475 | 11.5611 | 960.0025 | 50.4496 | | xcit_large_24_p8_224 | 5 | 12.256 | 27.6366 | 952.2158 | 94.8712 | | swin_base_patch4_window7_224 | 64 | 8.2057 | 19.0326 | 909.9143 | 76.4838 | | pit_b_224 | 64 | 3.3985 | 7.8688 | 907.1938 | 34.0108 | | cait_m36_384 | 4 | 13.429 | 30.2438 | 904.5886 | 109.7344 | | jx_nest_base | 32 | 6.4038 | 14.6293 | 898.7522 | 61.1489 | | sebotnet33ts_256 | 64 | 4.0935 | 8.6344 | 607.0759 | 36.0296 | | tnt_s_patch16_224 | 128 | 6.3572 | 15.7293 | 519.0725 | 61.7616 | | botnet26t_256 | 128 | 2.8803 | 6.2326 | 467.1591 | 27.2511 | | convnext_base | 64 | 6.597 | 12.6165 | 450.4418 | 42.0147 | | ghostnet_100 | 128 | 7.6607 | 14.4827 | 443.5871 | 45.6448 | | rexnet_100 | 128 | 5.4778 | 10.994 | 409.2706 | 38.3455 | | convit_base | 64 | 3.4205 | 9.0273 | 367.8235 | 37.0268 | | pnasnet5large | 16 | 7.612 | 25.3454 | 350.2498 | 95.9074 | | visformer_small | 128 | 2.572 | 5.989 | 348.9025 | 24.2141 | | res2net101_26w_4s | 64 | 10.1806 | 24.2164 | 339.5458 | 74.7437 | | hrnet_w18 | 128 | 8.7326 | 34.5125 | 338.9126 | 134.1646 | | adv_inception_v3 | 128 | 5.9209 | 13.0317 | 332.1356 | 45.519 | | gmixer_24_224 | 128 | 5.6274 | 12.7052 | 299.2786 | 42.3277 | | mixnet_l | 128 | 8.1281 | 15.8957 | 292.1469 | 43.4132 | | fbnetc_100 | 128 | 4.9523 | 9.3137 | 289.0544 | 32.0899 | | res2net50_14w_8s | 128 | 8.7271 | 21.7623 | 281.2083 | 69.0181 | | beit_base_patch16_224 | 64 | 4.0954 | 9.1919 | 276.1219 | 32.1217 | | tinynet_a | 128 | 5.8447 | 11.9565 | 275.2935 | 36.2317 | | eca_botnext26ts_256 | 128 | 3.0388 | 6.7046 | 272.341 | 29.1281 | | deit_base_distilled_patch16_224 | 64 | 3.2211 | 6.9901 | 256.0044 | 29.7817 | | dpn107 | 32 | 9.6262 | 18.9494 | 248.3074 | 53.3347 | | fbnetv3_b | 128 | 8.1861 | 16.7199 | 247.0458 | 51.1052 | | mixer_b16_224 | 128 | 2.6709 | 5.8103 | 229.0474 | 22.4949 | | poolformer_m36 | 64 | 7.7828 | 13.4975 | 206.1682 | 54.5545 | | regnety_002 | 128 | 4.7592 | 8.665 | 191.623 | 26.015 | | cspdarknet53 | 64 | 5.6278 | 10.6572 | 185.1414 | 33.367 | | gmlp_s16_224 | 128 | 5.5001 | 11.8491 | 180.5791 | 42.9278 | | resnest101e | 64 | 10.7308 | 23.8881 | 180.0071 | 68.2027 | | resmlp_12_224 | 128 | 2.7816 | 5.76 | 161.4423 | 20.7392 | | nfnet_l0 | 128 | 5.2234 | 10.7758 | 160.1165 | 29.6723 | | gernet_l | 128 | 4.8682 | 8.771 | 156.1277 | 26.2432 | | dla102 | 128 | 6.3819 | 13.8443 | 154.9948 | 43.9537 | | gluon_xception65 | 32 | 7.5637 | 16.5978 | 153.068 | 49.0476 | | repvgg_a2 | 128 | 4.7286 | 8.6174 | 120.9601 | 25.3145 | | mnasnet_100 | 128 | 3.9199 | 7.4746 | 113.4024 | 25.4573 | | tf_efficientnet_b0 | 128 | 5.0115 | 10.2675 | 111.0374 | 32.2744 | | res2next50 | 128 | 4.9405 | 11.8552 | 101.3286 | 39.4143 | | ese_vovnet19b_dw | 128 | 2.6367 | 4.5434 | 100.8023 | 18.15 | | convmixer_768_32 | 32 | 1.6536 | 6.7341 | 93.2744 | 25.0943 | | selecsls42b | 128 | 2.4469 | 5.3327 | 89.5487 | 22.5624 | | tf_mixnet_l | 128 | 8.8405 | 16.5804 | 80.9724 | 46.1405 | | mobilenetv3_large_100 | 128 | 4.1222 | 8.2565 | 79.517 | 28.7507 | | mobilenetv2_100 | 128 | 3.9283 | 7.7782 | 68.9263 | 26.5835 | | swsl_resnext101_32x16d | 32 | 5.8336 | 13.197 | 66.7856 | 40.2228 | | lcnet_050 | 128 | 2.4885 | 5.2702 | 60.757 | 19.2951 | | vit_base_patch16_224 | 64 | 3.0516 | 6.9148 | 46.9498 | 29.0295 | | inception_v3 | 128 | 5.8675 | 12.2238 | 45.4211 | 46.2169 | | gluon_inception_v3 | 128 | 5.812 | 12.1337 | 45.3753 | 46.3915 | | spnasnet_100 | 128 | 4.8849 | 9.1503 | 37.0112 | 30.0364 | | dm_nfnet_f0 | 128 | 5.9129 | 11.2757 | 31.9991 | 32.3646 | +---------------------------------+-----+---------+-----------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | gmlp_s16_224 | 128 | 1.0015 | 0.9787 | 1.1839 | 1.2053 | | pnasnet5large | 16 | 1.0593 | 0.9927 | 1.1539 | 1.1723 | | gmixer_24_224 | 128 | 1.0014 | 0.9787 | 1.1127 | 1.1381 | | convit_base | 64 | 1.0 | 0.8505 | 1.0948 | 1.0997 | | mobilenetv2_100 | 128 | 0.9996 | 0.7725 | 1.0266 | 1.0431 | | dm_nfnet_f0 | 128 | 0.9808 | 0.9006 | 1.0129 | 1.0129 | | resmlp_12_224 | 128 | 0.9999 | 0.9667 | 1.0097 | 1.0742 | | tinynet_a | 128 | 0.9998 | 0.7975 | 0.9985 | 1.025 | | resnest101e | 64 | 0.9998 | 1.0033 | 0.9933 | 0.9971 | | tf_efficientnet_b0 | 128 | 0.9992 | 0.7813 | 0.9873 | 0.9872 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9781 | 0.9834 | 0.9981 | | rexnet_100 | 128 | 1.0 | 0.7935 | 0.9746 | 0.9984 | | twins_pcpvt_base | 64 | 1.0001 | 0.9273 | 0.9727 | 1.0054 | | convmixer_768_32 | 32 | 1.0 | 0.9812 | 0.9657 | 0.9764 | | dla102 | 128 | 0.9709 | 0.9221 | 0.9535 | 0.9535 | | mixer_b16_224 | 128 | 1.0 | 0.9644 | 0.9438 | 0.9522 | | vit_base_patch16_224 | 64 | 1.0001 | 0.936 | 0.9362 | 0.9362 | | tf_mixnet_l | 128 | 0.9995 | 0.8647 | 0.9345 | 0.9345 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9344 | 0.9306 | 0.9306 | | mobilevit_s | 64 | 0.9998 | 0.7836 | 0.9262 | 0.9557 | | visformer_small | 128 | 1.0005 | 0.9328 | 0.9245 | 0.9347 | | fbnetv3_b | 128 | 0.9989 | 0.8019 | 0.9167 | 0.9227 | | nfnet_l0 | 128 | 1.0005 | 0.8489 | 0.9101 | 0.9214 | | cspdarknet53 | 64 | 0.9996 | 0.86 | 0.9098 | 0.9098 | | deit_base_distilled_patch16_224 | 64 | 0.9995 | 0.9358 | 0.9071 | 0.9352 | | volo_d1_224 | 64 | 1.001 | 0.9514 | 0.9067 | 0.9327 | | ese_vovnet19b_dw | 128 | 0.9986 | 0.9082 | 0.8975 | 0.9046 | | sebotnet33ts_256 | 64 | 0.9957 | 0.7151 | 0.8908 | 0.9207 | | adv_inception_v3 | 128 | 1.0 | 0.8752 | 0.8902 | 0.8902 | | gluon_inception_v3 | 128 | 1.0 | 0.8752 | 0.8902 | 0.8902 | | inception_v3 | 128 | 1.0 | 0.8752 | 0.8902 | 0.8902 | | hrnet_w18 | 128 | 0.9999 | 0.9269 | 0.8872 | 0.8918 | | gluon_xception65 | 32 | 0.9998 | 0.8877 | 0.8832 | 0.8832 | | spnasnet_100 | 128 | 0.9992 | 0.8982 | 0.8787 | 0.8787 | | xcit_large_24_p8_224 | 5 | 0.9989 | 0.8874 | 0.8761 | 0.8964 | | eca_botnext26ts_256 | 128 | 0.9995 | 0.7791 | 0.8738 | 0.8738 | | dpn107 | 32 | 0.9932 | 0.9066 | 0.8687 | 0.8833 | | mixnet_l | 128 | 0.9997 | 0.8539 | 0.8686 | 0.8686 | | mnasnet_100 | 128 | 0.9992 | 0.8897 | 0.8683 | 0.8684 | | res2next50 | 128 | 1.0003 | 0.918 | 0.866 | 0.866 | | mobilenetv3_large_100 | 128 | 0.9993 | 0.8597 | 0.8649 | 0.8885 | | cait_m36_384 | 4 | 0.9998 | 0.913 | 0.8637 | 0.8637 | | poolformer_m36 | 64 | 1.0014 | 0.9514 | 0.8598 | 0.8769 | | fbnetc_100 | 128 | 0.9989 | 0.8651 | 0.8596 | 0.8963 | | pit_b_224 | 64 | 1.0005 | 0.8033 | 0.8566 | 0.8744 | | res2net101_26w_4s | 64 | 1.0002 | 0.9186 | 0.8505 | 0.8813 | | res2net50_14w_8s | 128 | 1.0002 | 0.9151 | 0.8496 | 0.8712 | | gernet_l | 128 | 0.9989 | 0.8652 | 0.8493 | 0.8499 | | swsl_resnext101_32x16d | 32 | 1.0001 | 0.8706 | 0.8477 | 0.8477 | | selecsls42b | 128 | 1.0006 | 0.8947 | 0.8472 | 0.8784 | | ghostnet_100 | 128 | 0.9983 | 0.8894 | 0.8416 | 0.8972 | | coat_lite_mini | 128 | 1.0445 | 0.929 | 0.8401 | 0.8647 | | convnext_base | 64 | 1.0052 | 0.9275 | 0.832 | 0.8504 | | botnet26t_256 | 128 | 0.9994 | 0.8791 | 0.824 | 0.8239 | | lcnet_050 | 128 | 0.9982 | 0.8057 | 0.8172 | 0.8281 | | regnety_002 | 128 | 0.9992 | 0.8629 | 0.7846 | 0.8214 | | repvgg_a2 | 128 | 0.9997 | 0.7933 | 0.7738 | 0.7738 | | crossvit_9_240 | 128 | 0.999 | 0.8819 | 0.7526 | 0.776 | | swin_base_patch4_window7_224 | 64 | 1.001 | 0.9237 | 0.7214 | 0.7384 | | jx_nest_base | 32 | 1.0006 | 0.8943 | 0.6693 | 0.6838 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------+------------------------+ | convmixer_768_32 | 32 | 301.0442 | 311.3992 | 301.7929 | 301.8158 | | pnasnet5large | 16 | 199.1005 | 214.9961 | 218.5681 | 218.7462 | | hrnet_w18 | 128 | 281.2086 | 433.1533 | 211.6484 | 210.718 | | tf_mixnet_l | 128 | 193.8799 | 229.2666 | 162.3698 | 162.474 | | mixnet_l | 128 | 185.5385 | 220.6245 | 156.7373 | 156.7082 | | resnest101e | 64 | 165.1075 | 188.9326 | 120.3588 | 120.599 | | dla102 | 128 | 172.5703 | 210.5239 | 117.0292 | 117.0832 | | cait_m36_384 | 4 | 167.8886 | 167.9834 | 116.1649 | 116.2469 | | poolformer_m36 | 64 | 146.8675 | 147.2348 | 113.6049 | 113.5832 | | swsl_resnext101_32x16d | 32 | 118.7959 | 141.3735 | 113.603 | 113.5603 | | gluon_inception_v3 | 128 | 160.4058 | 184.9758 | 105.3727 | 105.4292 | | inception_v3 | 128 | 159.2487 | 183.7234 | 105.2961 | 105.3201 | | adv_inception_v3 | 128 | 159.8353 | 185.0633 | 105.2829 | 105.3636 | | res2net50_14w_8s | 128 | 140.5677 | 177.7007 | 105.2146 | 105.3745 | | convit_base | 64 | 163.1107 | 163.213 | 100.4048 | 100.3903 | | dpn107 | 32 | 114.1529 | 131.61 | 98.606 | 98.6911 | | tnt_s_patch16_224 | 128 | 323.7242 | 324.0157 | 97.8962 | 97.8557 | | res2next50 | 128 | 125.5394 | 152.0323 | 95.4568 | 95.5517 | | gluon_xception65 | 32 | 99.6856 | 117.3351 | 93.1259 | 92.9022 | | fbnetv3_b | 128 | 115.2621 | 142.3501 | 90.0136 | 88.1486 | | dm_nfnet_f0 | 128 | 128.3444 | 128.9464 | 89.5194 | 89.5123 | | res2net101_26w_4s | 64 | 98.7865 | 125.6659 | 87.4592 | 87.4773 | | mixer_b16_224 | 128 | 116.7944 | 114.4476 | 85.6791 | 85.6619 | | swin_base_patch4_window7_224 | 64 | 147.4891 | 153.1044 | 85.486 | 85.793 | | convnext_base | 64 | 124.5485 | 124.1639 | 81.795 | 81.8644 | | gmlp_s16_224 | 128 | 137.5815 | 126.5099 | 79.5972 | 79.6404 | | nfnet_l0 | 128 | 112.5881 | 136.7591 | 77.4578 | 77.4729 | | cspdarknet53 | 64 | 94.9862 | 112.6971 | 77.0842 | 77.1985 | | visformer_small | 128 | 91.3647 | 96.3827 | 76.4604 | 76.4324 | | eca_botnext26ts_256 | 128 | 108.873 | 147.1515 | 75.3926 | 75.4105 | | pit_b_224 | 64 | 118.9212 | 119.0104 | 73.7484 | 73.6664 | | gernet_l | 128 | 77.8019 | 91.6916 | 72.3449 | 71.1734 | | botnet26t_256 | 128 | 101.8712 | 116.5567 | 71.5378 | 71.612 | | beit_base_patch16_224 | 64 | 101.5135 | 105.727 | 70.1323 | 70.1878 | | repvgg_a2 | 128 | 77.693 | 96.0585 | 70.1066 | 70.1973 | | volo_d1_224 | 64 | 121.0063 | 123.5599 | 69.8246 | 69.8118 | | vit_base_patch16_224 | 64 | 87.0679 | 87.2499 | 64.9352 | 65.045 | | jx_nest_base | 32 | 101.6843 | 101.5289 | 64.9104 | 64.9633 | | deit_base_distilled_patch16_224 | 64 | 85.0397 | 85.1759 | 64.47 | 64.4549 | | gmixer_24_224 | 128 | 118.2172 | 131.9519 | 63.4326 | 63.2182 | | tf_efficientnet_b0 | 128 | 84.6866 | 119.5088 | 63.0207 | 62.9982 | | rexnet_100 | 128 | 80.0 | 108.5706 | 61.0416 | 61.2412 | | fbnetc_100 | 128 | 82.7582 | 106.4172 | 60.9432 | 59.841 | | xcit_large_24_p8_224 | 5 | 124.3672 | 142.2014 | 60.5284 | 60.3945 | | tinynet_a | 128 | 73.6418 | 102.5825 | 58.9293 | 58.9064 | | mobilevit_s | 64 | 84.5795 | 111.3608 | 56.2617 | 56.2496 | | twins_pcpvt_base | 64 | 131.6595 | 140.647 | 55.9878 | 55.9654 | | coat_lite_mini | 128 | 113.1055 | 113.0707 | 54.7766 | 54.8024 | | sebotnet33ts_256 | 64 | 80.5015 | 102.3617 | 52.2965 | 52.3101 | | spnasnet_100 | 128 | 70.4101 | 89.7348 | 50.8511 | 50.8508 | | ghostnet_100 | 128 | 90.6519 | 118.0659 | 50.5423 | 50.3824 | | ese_vovnet19b_dw | 128 | 64.5959 | 74.2357 | 49.6637 | 49.6421 | | mobilenetv2_100 | 128 | 65.4573 | 84.3407 | 46.1979 | 46.0899 | | crossvit_9_240 | 128 | 82.6129 | 104.4326 | 45.6656 | 45.9106 | | mnasnet_100 | 128 | 64.1768 | 82.2424 | 44.3672 | 44.4069 | | selecsls42b | 128 | 60.0486 | 73.9224 | 44.2568 | 44.3691 | | mobilenetv3_large_100 | 128 | 61.2516 | 76.5285 | 43.1577 | 43.2367 | | resmlp_12_224 | 128 | 53.4711 | 59.8274 | 38.1437 | 38.2947 | | regnety_002 | 128 | 41.3628 | 51.3802 | 27.4551 | 27.3733 | | lcnet_050 | 128 | 31.6182 | 40.9209 | 18.7843 | 18.8229 | +---------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (max autotune, with cold start)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 90%, 53/59 | 100%, 45/45 | 68%, 41/60  |
|       aot_eager        | 88%, 52/59 | 100%, 45/45 | 92%, 55/60  |
|        inductor        | 78%, 46/59 | 84%, 38/45  | 93%, 56/60  |
| inductor_no_cudagraphs | 78%, 46/59 | 84%, 38/45  | 92%, 55/60  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|        inductor        |   1.59x    |    1.67x    |    1.38x    |
| inductor_no_cudagraphs |   1.57x    |    1.68x    |    1.39x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    4.73    |    7.46     |    5.96     |
|       aot_eager        |    9.28    |    16.12    |    12.80    |
|        inductor        |   272.07   |   338.74    |   458.29    |
| inductor_no_cudagraphs |   273.46   |   324.87    |   448.96    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.97x    |    1.00x    |
|       aot_eager        |   0.86x    |    0.89x    |    0.89x    |
|        inductor        |   0.75x    |    0.90x    |    0.90x    |
| inductor_no_cudagraphs |   0.75x    |    0.90x    |    0.90x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+-------------------------------+-----------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+-----------------+------------------------+ | torchbench | moco | fail_to_run | fail_to_run | | torchbench | dlrm | fail_to_run | fail_to_run | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | phlippe_resnet | fail_accuracy | fail_accuracy | | torchbench | Background_Matting | eager_variation | eager_variation | | torchbench | vision_maskrcnn | eager_variation | eager_variation | | torchbench | tacotron2 | 0.0000 | 0.0000 | | torchbench | doctr_det_predictor | 0.0000 | 0.0000 | | torchbench | doctr_reco_predictor | 0.0000 | 0.0000 | | torchbench | llama | 0.0000 | 0.0000 | | torchbench | torchrec_dlrm | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | fail_to_run | | huggingface | AlbertForQuestionAnswering | fail_accuracy | fail_accuracy | | timm_models | gluon_xception65 | pass | fail_accuracy | | timm_models | sebotnet33ts_256 | pass | fail_accuracy | | timm_models | twins_pcpvt_base | pass | 0.0000 | | timm_models | dla102 | fail_accuracy | pass | | timm_models | xcit_large_24_p8_224 | fail_accuracy | fail_accuracy | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | coat_lite_mini | 0.0000 | pass | +-------------+-------------------------------+-----------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------+----------+------------------------+ | torchbench | timm_regnet | 0.9445 | 0.9507 | | torchbench | timm_vovnet | 0.9414 | 0.9241 | | torchbench | nvidia_deeprecommender | 0.9355 | 0.9348 | | torchbench | alexnet | 0.0 | 0.0 | | torchbench | hf_Reformer | 0.0 | 0.0 | | torchbench | hf_GPT2_large | 0.0 | 0.0 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | timm_vision_transformer_large | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | torchbench | doctr_det_predictor | 0.0 | 0.0 | | torchbench | doctr_reco_predictor | 0.0 | 0.0 | | torchbench | tacotron2 | 0.0 | 0.0 | | torchbench | torchrec_dlrm | 0.0 | 0.0 | | huggingface | MobileBertForQuestionAnswering | 0.9272 | 0.9363 | | huggingface | LayoutLMForSequenceClassification | 0.0 | 0.0 | | huggingface | DebertaForQuestionAnswering | 0.0 | 0.0 | | huggingface | BlenderbotForCausalLM | 0.0 | 0.0 | | huggingface | DebertaForMaskedLM | 0.0 | 0.0 | | huggingface | DebertaV2ForQuestionAnswering | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 0.0 | 0.0 | | timm_models | pnasnet5large | 0.9045 | 0.8892 | +-------------+-----------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------------------------+-----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+-----------+------------------------+ | torchbench | speech_transformer | 811.9404 | 801.5224 | | torchbench | attention_is_all_you_need_pytorch | 671.8298 | 713.6033 | | torchbench | hf_T5_large | 514.4005 | 510.0798 | | torchbench | hf_Longformer | 448.7565 | 444.5843 | | torchbench | timm_vision_transformer | 430.9528 | 432.7792 | | torchbench | phlippe_densenet | 424.9286 | 426.024 | | torchbench | mobilenet_v3_large | 420.8786 | 415.3274 | | torchbench | fastNLP_Bert | 408.0208 | 436.9673 | | torchbench | hf_Albert | 407.7363 | 401.7922 | | torchbench | hf_GPT2 | 377.4055 | 376.3293 | | torchbench | timm_efficientnet | 370.9045 | 370.5136 | | torchbench | BERT_pytorch | 348.6068 | 347.8106 | | torchbench | hf_Bert_large | 344.6791 | 343.3888 | | torchbench | hf_Bert | 343.1352 | 333.2602 | | torchbench | pytorch_struct | 343.1183 | 342.3984 | | torchbench | mobilenet_v2 | 338.0467 | 338.612 | | torchbench | densenet121 | 328.6991 | 343.919 | | torchbench | hf_Bart | 326.7358 | 322.7043 | | torchbench | mnasnet1_0 | 323.0717 | 324.1732 | | torchbench | hf_T5 | 282.9559 | 281.6635 | | torchbench | hf_DistilBert | 279.7668 | 279.3107 | | torchbench | yolov3 | 262.6355 | 259.9473 | | torchbench | resnet152 | 251.3667 | 246.7747 | | torchbench | timm_vovnet | 246.8284 | 250.8095 | | torchbench | drq | 243.14 | 263.0037 | | torchbench | nvidia_deeprecommender | 242.6257 | 251.857 | | torchbench | timm_nfnet | 229.374 | 225.6032 | | torchbench | timm_resnest | 229.2705 | 229.4557 | | torchbench | shufflenet_v2_x1_0 | 224.7522 | 228.6695 | | torchbench | resnet50 | 214.3405 | 211.5939 | | torchbench | timm_regnet | 207.459 | 202.3439 | | torchbench | resnext50_32x4d | 181.0959 | 181.4499 | | torchbench | LearningToPaint | 173.8694 | 175.1049 | | torchbench | soft_actor_critic | 172.2714 | 181.3975 | | torchbench | resnet18 | 157.2362 | 148.1282 | | torchbench | vgg16 | 156.8115 | 158.9342 | | torchbench | phlippe_resnet | 142.043 | 140.9482 | | torchbench | lennard_jones | 140.1834 | 136.1011 | | torchbench | pytorch_unet | 138.7686 | 140.8722 | | torchbench | Background_Matting | 135.1568 | 136.699 | | huggingface | PegasusForCausalLM | 814.8897 | 255.29 | | huggingface | MobileBertForMaskedLM | 621.2472 | 641.7833 | | huggingface | MobileBertForQuestionAnswering | 582.039 | 579.7287 | | huggingface | MT5ForConditionalGeneration | 551.3717 | 560.1428 | | huggingface | YituTechConvBert | 465.3614 | 466.0206 | | huggingface | ElectraForCausalLM | 437.293 | 439.8019 | | huggingface | AllenaiLongformerBase | 411.2795 | 410.4339 | | huggingface | M2M100ForConditionalGeneration | 380.7446 | 380.955 | | huggingface | XLNetLMHeadModel | 352.8004 | 352.7516 | | huggingface | AlbertForMaskedLM | 352.0903 | 370.1729 | | huggingface | XGLMForCausalLM | 346.8905 | 344.7347 | | huggingface | MegatronBertForCausalLM | 340.8456 | 344.1151 | | huggingface | ElectraForQuestionAnswering | 340.3166 | 341.0878 | | huggingface | PegasusForConditionalGeneration | 339.9314 | 317.5693 | | huggingface | MBartForConditionalGeneration | 332.702 | 333.4432 | | huggingface | T5Small | 324.3855 | 325.5567 | | huggingface | T5ForConditionalGeneration | 323.6672 | 325.1325 | | huggingface | MegatronBertForQuestionAnswering | 323.4524 | 324.4397 | | huggingface | GPT2ForSequenceClassification | 323.1691 | 321.3846 | | huggingface | BartForConditionalGeneration | 321.1777 | 333.1578 | | huggingface | AlbertForQuestionAnswering | 314.2912 | 328.9652 | | huggingface | PLBartForConditionalGeneration | 297.6701 | 278.446 | | huggingface | BlenderbotSmallForConditionalGeneration | 289.8975 | 292.8094 | | huggingface | BlenderbotSmallForCausalLM | 274.757 | 261.6962 | | huggingface | RobertaForQuestionAnswering | 273.882 | 274.5867 | | huggingface | DistillGPT2 | 267.031 | 267.848 | | huggingface | BertForQuestionAnswering | 265.9109 | 281.6978 | | huggingface | LayoutLMForMaskedLM | 261.6479 | 262.2402 | | huggingface | CamemBert | 260.9713 | 266.0405 | | huggingface | DistilBertForMaskedLM | 256.5461 | 240.982 | | huggingface | RobertaForCausalLM | 255.0796 | 254.953 | | huggingface | BertForMaskedLM | 246.8009 | 261.1474 | | huggingface | DistilBertForQuestionAnswering | 245.7984 | 244.739 | | huggingface | BartForCausalLM | 240.4101 | 239.5187 | | huggingface | OPTForCausalLM | 237.5466 | 237.6625 | | huggingface | MBartForCausalLM | 235.4426 | 235.7554 | | huggingface | Speech2Text2ForCausalLM | 234.1054 | 234.3185 | | huggingface | TrOCRForCausalLM | 230.3112 | 228.9748 | | huggingface | PLBartForCausalLM | 212.6627 | 213.84 | | timm_models | twins_pcpvt_base | 1549.1182 | 1549.3968 | | timm_models | mobilevit_s | 1260.1864 | 1250.1549 | | timm_models | coat_lite_mini | 1256.0368 | 1261.5438 | | timm_models | crossvit_9_240 | 1181.8398 | 1163.6234 | | timm_models | swin_base_patch4_window7_224 | 1144.8752 | 1145.9039 | | timm_models | volo_d1_224 | 959.9922 | 972.2927 | | timm_models | pit_b_224 | 939.9928 | 944.8402 | | timm_models | xcit_large_24_p8_224 | 919.4447 | 930.1613 | | timm_models | jx_nest_base | 909.285 | 909.8677 | | timm_models | cait_m36_384 | 877.7569 | 889.0475 | | timm_models | sebotnet33ts_256 | 723.3903 | 715.0022 | | timm_models | tnt_s_patch16_224 | 639.1069 | 644.4422 | | timm_models | convit_base | 606.4733 | 607.7017 | | timm_models | eca_botnext26ts_256 | 576.4692 | 577.7936 | | timm_models | ghostnet_100 | 574.4232 | 585.6942 | | timm_models | rexnet_100 | 574.0523 | 579.1604 | | timm_models | botnet26t_256 | 564.6397 | 558.1972 | | timm_models | hrnet_w18 | 517.5075 | 521.5671 | | timm_models | visformer_small | 451.9125 | 458.0089 | | timm_models | convnext_base | 441.4118 | 458.4532 | | timm_models | fbnetv3_b | 407.5326 | 401.4525 | | timm_models | res2net50_14w_8s | 405.0288 | 402.7811 | | timm_models | tinynet_a | 379.1793 | 385.4118 | | timm_models | tf_efficientnet_b0 | 373.0437 | 376.6531 | | timm_models | adv_inception_v3 | 371.087 | 371.0584 | | timm_models | gluon_inception_v3 | 367.0138 | 368.7707 | | timm_models | mobilenetv3_large_100 | 366.7194 | 368.975 | | timm_models | inception_v3 | 365.8731 | 372.7785 | | timm_models | tf_mixnet_l | 364.5554 | 355.9007 | | timm_models | pnasnet5large | 361.0583 | 357.5933 | | timm_models | mixnet_l | 357.2374 | 360.5247 | | timm_models | spnasnet_100 | 356.122 | 360.5331 | | timm_models | fbnetc_100 | 354.4488 | 358.268 | | timm_models | res2net101_26w_4s | 347.2845 | 352.7768 | | timm_models | deit_base_distilled_patch16_224 | 333.189 | 330.9312 | | timm_models | vit_base_patch16_224 | 331.7911 | 330.6702 | | timm_models | mobilenetv2_100 | 327.4934 | 325.4134 | | timm_models | resnest101e | 322.625 | 328.1115 | | timm_models | beit_base_patch16_224 | 318.5332 | 325.0581 | | timm_models | mnasnet_100 | 316.9831 | 323.0384 | | timm_models | gmixer_24_224 | 289.7795 | 296.1159 | | timm_models | poolformer_m36 | 278.4865 | 277.6924 | | timm_models | dpn107 | 277.8568 | 280.9468 | | timm_models | res2next50 | 275.9085 | 277.7791 | | timm_models | cspdarknet53 | 272.0032 | 273.5491 | | timm_models | selecsls42b | 259.2803 | 260.0981 | | timm_models | regnety_002 | 258.7406 | 252.8447 | | timm_models | gmlp_s16_224 | 257.6034 | 258.6444 | | timm_models | resmlp_12_224 | 249.7601 | 252.793 | | timm_models | mixer_b16_224 | 249.4048 | 245.5265 | | timm_models | gluon_xception65 | 239.5066 | 234.587 | | timm_models | lcnet_050 | 229.6674 | 221.3622 | | timm_models | ese_vovnet19b_dw | 223.3347 | 223.677 | | timm_models | dm_nfnet_f0 | 219.4157 | 222.9388 | | timm_models | gernet_l | 216.739 | 216.6382 | | timm_models | dla102 | 192.1844 | 191.9956 | | timm_models | nfnet_l0 | 189.6519 | 186.6258 | | timm_models | swsl_resnext101_32x16d | 187.444 | 191.9566 | | timm_models | repvgg_a2 | 176.7575 | 178.8257 | +-------------+-----------------------------------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_efficientnet | 0.9293 | 0.8747 | | torchbench | hf_Bert | 0.8815 | 0.8815 | | torchbench | yolov3 | 0.87 | 0.8701 | | torchbench | shufflenet_v2_x1_0 | 0.8596 | 0.8599 | | torchbench | speech_transformer | 0.8583 | 0.8583 | | torchbench | timm_regnet | 0.8512 | 0.8498 | | torchbench | hf_DistilBert | 0.8456 | 0.8456 | | torchbench | timm_resnest | 0.8414 | 0.8304 | | torchbench | timm_vision_transformer | 0.8357 | 0.8357 | | torchbench | Background_Matting | 0.834 | 0.834 | | torchbench | resnet152 | 0.8296 | 0.8286 | | torchbench | hf_T5_large | 0.8201 | 0.8201 | | torchbench | phlippe_densenet | 0.806 | 0.7988 | | torchbench | mobilenet_v3_large | 0.7848 | 0.7275 | | torchbench | pytorch_unet | 0.7734 | 0.7734 | | torchbench | squeezenet1_1 | 0.773 | 0.773 | | torchbench | pytorch_stargan | 0.7715 | 0.7715 | | torchbench | demucs | 0.7665 | 0.7665 | | torchbench | hf_Bart | 0.7545 | 0.7543 | | torchbench | resnet50 | 0.7515 | 0.7522 | | torchbench | timm_vovnet | 0.7427 | 0.7427 | | torchbench | mnasnet1_0 | 0.742 | 0.7486 | | torchbench | pytorch_struct | 0.7338 | 0.7274 | | torchbench | vgg16 | 0.723 | 0.723 | | torchbench | densenet121 | 0.7085 | 0.7085 | | torchbench | resnext50_32x4d | 0.6608 | 0.6608 | | torchbench | nvidia_deeprecommender | 0.6585 | 0.6585 | | torchbench | LearningToPaint | 0.6018 | 0.6018 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.5458 | 0.5458 | | torchbench | resnet18 | 0.5409 | 0.5409 | | torchbench | hf_Longformer | 0.4203 | 0.4206 | | torchbench | functorch_dp_cifar10 | 0.3991 | 0.3991 | | torchbench | phlippe_resnet | 0.3202 | 0.3202 | | torchbench | drq | 0.1818 | 0.1818 | | torchbench | dcgan | 0.1811 | 0.1811 | | torchbench | soft_actor_critic | 0.1078 | 0.1078 | | torchbench | lennard_jones | 0.0648 | 0.0648 | | huggingface | PegasusForConditionalGeneration | 0.8911 | 0.8911 | | huggingface | MT5ForConditionalGeneration | 0.8906 | 0.8906 | | huggingface | ElectraForCausalLM | 0.8896 | 0.8896 | | huggingface | PLBartForCausalLM | 0.8748 | 0.8748 | | huggingface | DistilBertForMaskedLM | 0.8677 | 0.8677 | | huggingface | MBartForConditionalGeneration | 0.8672 | 0.8672 | | huggingface | TrOCRForCausalLM | 0.8558 | 0.8558 | | huggingface | MBartForCausalLM | 0.8501 | 0.8501 | | huggingface | BartForConditionalGeneration | 0.8456 | 0.8456 | | huggingface | MegatronBertForCausalLM | 0.845 | 0.845 | | huggingface | BartForCausalLM | 0.8311 | 0.8311 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.816 | 0.816 | | huggingface | PegasusForCausalLM | 0.7966 | 0.7966 | | huggingface | BlenderbotSmallForCausalLM | 0.787 | 0.787 | | huggingface | MobileBertForMaskedLM | 0.7473 | 0.7473 | | huggingface | Speech2Text2ForCausalLM | 0.7364 | 0.7364 | | huggingface | XGLMForCausalLM | 0.6744 | 0.6744 | | huggingface | MobileBertForQuestionAnswering | 0.6505 | 0.6505 | | huggingface | M2M100ForConditionalGeneration | 0.6058 | 0.6058 | | huggingface | AllenaiLongformerBase | 0.4696 | 0.4696 | | timm_models | ese_vovnet19b_dw | 0.8975 | 0.8975 | | timm_models | sebotnet33ts_256 | 0.891 | 0.8908 | | timm_models | gluon_inception_v3 | 0.8902 | 0.8902 | | timm_models | inception_v3 | 0.8902 | 0.8902 | | timm_models | adv_inception_v3 | 0.8902 | 0.8902 | | timm_models | hrnet_w18 | 0.8872 | 0.8872 | | timm_models | gluon_xception65 | 0.8832 | 0.8832 | | timm_models | spnasnet_100 | 0.8786 | 0.8786 | | timm_models | xcit_large_24_p8_224 | 0.8761 | 0.8761 | | timm_models | eca_botnext26ts_256 | 0.8738 | 0.8738 | | timm_models | mixnet_l | 0.8686 | 0.8686 | | timm_models | dpn107 | 0.8685 | 0.8685 | | timm_models | mnasnet_100 | 0.8683 | 0.8683 | | timm_models | cait_m36_384 | 0.8637 | 0.8637 | | timm_models | poolformer_m36 | 0.8598 | 0.8598 | | timm_models | fbnetc_100 | 0.8596 | 0.8596 | | timm_models | pit_b_224 | 0.8566 | 0.8566 | | timm_models | res2net101_26w_4s | 0.8505 | 0.8505 | | timm_models | res2net50_14w_8s | 0.8497 | 0.8494 | | timm_models | gernet_l | 0.8495 | 0.8496 | | timm_models | swsl_resnext101_32x16d | 0.8477 | 0.8477 | | timm_models | selecsls42b | 0.8471 | 0.8472 | | timm_models | res2next50 | 0.8452 | 0.8452 | | timm_models | ghostnet_100 | 0.8416 | 0.8416 | | timm_models | mobilenetv3_large_100 | 0.8413 | 0.8413 | | timm_models | coat_lite_mini | 0.8401 | 0.8401 | | timm_models | convnext_base | 0.832 | 0.832 | | timm_models | botnet26t_256 | 0.824 | 0.824 | | timm_models | lcnet_050 | 0.8048 | 0.8048 | | timm_models | repvgg_a2 | 0.7738 | 0.7738 | | timm_models | regnety_002 | 0.76 | 0.76 | | timm_models | crossvit_9_240 | 0.7525 | 0.7526 | | timm_models | swin_base_patch4_window7_224 | 0.7214 | 0.7214 | | timm_models | jx_nest_base | 0.6693 | 0.6693 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | functorch_dp_cifar10 | 64 | 0.9746 | 0.9236 | 3.662 | 3.5979 | | BERT_pytorch | 16 | 0.9898 | 0.7963 | 3.1291 | 3.3661 | | densenet121 | 4 | 0.9895 | 0.7021 | 2.8065 | 2.6894 | | hf_T5_large | 2 | 0.9821 | 0.8081 | 2.509 | 2.3003 | | hf_Bart | 4 | 1.0209 | 0.7921 | 2.4068 | 1.8198 | | hf_Albert | 8 | 0.9967 | 0.9594 | 2.3429 | 2.3003 | | phlippe_densenet | 128 | 0.9882 | 0.7687 | 2.0411 | 2.0366 | | mobilenet_v3_large | 32 | 0.996 | 0.7824 | 1.9934 | 1.975 | | hf_GPT2 | 4 | 0.9934 | 0.959 | 1.9291 | 1.9049 | | hf_T5 | 8 | 0.9874 | 0.8549 | 1.9207 | 1.9199 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.979 | 0.8999 | 1.8918 | 1.9186 | | squeezenet1_1 | 32 | 0.9849 | 0.9328 | 1.86 | 1.8459 | | hf_Bert | 4 | 0.9956 | 0.8407 | 1.8589 | 1.7836 | | phlippe_resnet | 128 | 0.9899 | 0.7542 | 1.8075 | 1.8209 | | attention_is_all_you_need_pytorch | 256 | 0.989 | 0.8368 | 1.8016 | 1.6518 | | hf_Longformer | 2 | 0.9252 | 0.6047 | 1.79 | 1.7971 | | resnext50_32x4d | 8 | 0.991 | 0.7309 | 1.7085 | 1.7171 | | mnasnet1_0 | 32 | 0.9897 | 0.7359 | 1.7049 | 1.6477 | | speech_transformer | 32 | 0.9833 | 0.7947 | 1.6978 | 1.6917 | | timm_vision_transformer | 32 | 0.9858 | 0.8468 | 1.6895 | 1.7383 | | pytorch_struct | 200 | 0.9475 | 0.7728 | 1.6787 | 1.7968 | | fastNLP_Bert | 6 | 0.9904 | 0.7991 | 1.6516 | 1.652 | | shufflenet_v2_x1_0 | 128 | 0.9937 | 0.7578 | 1.6243 | 1.5341 | | hf_Bert_large | 4 | 0.9986 | 0.869 | 1.6226 | 1.7705 | | drq | 1 | 0.9652 | 0.7589 | 1.6021 | 1.5104 | | resnet18 | 16 | 0.9862 | 0.7674 | 1.5523 | 1.5773 | | dcgan | 32 | 0.8865 | 0.714 | 1.4782 | 1.4896 | | mobilenet_v2 | 96 | 0.997 | 0.7769 | 1.4765 | 1.4777 | | hf_DistilBert | 8 | 1.0012 | 0.9406 | 1.4693 | 1.4257 | | timm_nfnet | 128 | 0.9866 | 0.9838 | 1.468 | 1.4635 | | lennard_jones | 1000 | 0.8601 | 0.7407 | 1.4464 | 1.5381 | | timm_resnest | 32 | 0.9924 | 0.853 | 1.4459 | 1.4536 | | timm_efficientnet | 32 | 0.9379 | 0.6245 | 1.3829 | 1.3947 | | soft_actor_critic | 256 | 0.8171 | 0.6609 | 1.3122 | 1.262 | | LearningToPaint | 96 | 0.9884 | 0.7668 | 1.2739 | 1.2848 | | vgg16 | 64 | 0.9992 | 0.998 | 1.2445 | 1.2447 | | pytorch_stargan | 16 | 0.9922 | 0.807 | 1.2273 | 1.2254 | | Super_SloMo | 6 | 0.9968 | 0.1781 | 1.2179 | 1.2169 | | Background_Matting | 4 | 0.9991 | 0.1369 | 1.173 | 1.173 | | pytorch_unet | 1 | 0.9962 | 0.2049 | 1.1719 | 1.1712 | | resnet152 | 32 | 0.9963 | 0.7511 | 1.1568 | 1.1564 | | resnet50 | 32 | 0.9972 | 0.7605 | 1.1374 | 1.1891 | | yolov3 | 16 | 0.996 | 0.8059 | 1.115 | 1.1151 | | demucs | 4 | 1.0004 | 1.001 | 1.0261 | 1.0291 | | tts_angular | 64 | 0.9587 | 0.9191 | 0.9848 | 0.9907 | | timm_regnet | 32 | 0.9153 | 0.7728 | 0.9445 | 0.9507 | | timm_vovnet | 32 | 0.8477 | 0.7091 | 0.9414 | 0.9241 | | nvidia_deeprecommender | 256 | 0.9992 | 0.9981 | 0.9355 | 0.9348 | | alexnet | 128 | 0.9988 | 0.9978 | 0.0 | 0.0 | | hf_Reformer | 4 | 0.9922 | 0.9932 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 0.983 | 0.9716 | 0.0 | 0.0 | | dlrm | 1024 | 0.9391 | 0.8483 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9798 | 0.7923 | 0.0 | 0.0 | | timm_vision_transformer_large | 32 | 0.9981 | 0.0 | 0.0 | 0.0 | | moco | 32 | 0.9808 | 0.0 | 0.0 | 0.0 | | doctr_det_predictor | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | doctr_reco_predictor | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | speech_transformer | 4 | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | | resnet152 | 4 | pass | pass | pass | pass | | resnet18 | 4 | pass | pass | pass | pass | | resnet50 | 4 | pass | pass | pass | pass | | resnext50_32x4d | 4 | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 4 | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | | squeezenet1_1 | 4 | pass | pass | pass | pass | | nvidia_deeprecommender | 4 | pass | pass | pass | pass | | timm_efficientnet | 4 | pass | pass | pass | pass | | timm_nfnet | 4 | pass | pass | pass | pass | | timm_regnet | 4 | pass | pass | pass | pass | | timm_resnest | 4 | pass | pass | pass | pass | | timm_vision_transformer | 4 | pass | pass | pass | pass | | timm_vovnet | 4 | pass | pass | pass | pass | | tts_angular | 4 | pass | pass | pass | pass | | vgg16 | 4 | pass | pass | pass | pass | | yolov3 | 4 | pass | pass | pass | pass | | BERT_pytorch | 4 | fail_accuracy | pass | pass | pass | | phlippe_densenet | 4 | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | | mobilenet_v3_large | 4 | pass | pass | pass | pass | | hf_Albert | 4 | pass | pass | pass | pass | | LearningToPaint | 4 | pass | pass | pass | pass | | Super_SloMo | 4 | pass | pass | pass | pass | | alexnet | 4 | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 4 | pass | pass | pass | pass | | dcgan | 4 | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | | mobilenet_v2 | 4 | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | | fastNLP_Bert | 4 | pass | pass | pass | pass | | functorch_dp_cifar10 | 4 | pass | pass | pass | pass | | densenet121 | 4 | pass | pass | pass | pass | | hf_Bart | 4 | pass | pass | pass | pass | | hf_Bert_large | 4 | pass | pass | pass | pass | | hf_DistilBert | 4 | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | | hf_Longformer | 4 | pass | pass | pass | pass | | hf_Reformer | 4 | pass | pass | pass | pass | | hf_T5 | 4 | pass | pass | pass | pass | | hf_T5_base | 4 | pass | pass | pass | pass | | lennard_jones | 4 | pass | pass | pass | pass | | mnasnet1_0 | 4 | pass | pass | pass | pass | | hf_Bert | 4 | pass | pass | pass | pass | | moco | 4 | pass | fail_to_run | fail_to_run | fail_to_run | | dlrm | 4 | pass | pass | fail_to_run | fail_to_run | | hf_BigBird | 4 | pass | pass | fail_to_run | fail_to_run | | phlippe_resnet | 4 | pass | pass | fail_accuracy | fail_accuracy | | Background_Matting | 4 | eager_variation | eager_variation | eager_variation | eager_variation | | vision_maskrcnn | 4 | eager_variation | 0.0000 | eager_variation | eager_variation | | tacotron2 | 4 | fail_to_run | fail_to_run | 0.0000 | 0.0000 | | doctr_det_predictor | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | doctr_reco_predictor | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | llama | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | torchrec_dlrm | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------+------------------------+ | speech_transformer | 32 | 5.8681 | 13.5367 | 811.9404 | 801.5224 | | attention_is_all_you_need_pytorch | 256 | 4.3576 | 10.6986 | 671.8298 | 713.6033 | | hf_T5_large | 2 | 26.0158 | 54.0349 | 514.4005 | 510.0798 | | hf_Longformer | 2 | 11.7053 | 30.7402 | 448.7565 | 444.5843 | | timm_vision_transformer | 32 | 3.2698 | 7.2605 | 430.9528 | 432.7792 | | phlippe_densenet | 128 | 3.3621 | 6.9743 | 424.9286 | 426.024 | | mobilenet_v3_large | 32 | 3.3586 | 7.6126 | 420.8786 | 415.3274 | | fastNLP_Bert | 6 | 5.0195 | 10.9845 | 408.0208 | 436.9673 | | hf_Albert | 8 | 2.4723 | 8.5721 | 407.7363 | 401.7922 | | hf_GPT2 | 4 | 4.5578 | 9.5606 | 377.4055 | 376.3293 | | timm_efficientnet | 32 | 4.9059 | 10.1556 | 370.9045 | 370.5136 | | BERT_pytorch | 16 | 4.7936 | 11.4043 | 348.6068 | 347.8106 | | hf_Bert_large | 4 | 9.9442 | 20.9677 | 344.6791 | 343.3888 | | hf_Bert | 4 | 4.9178 | 10.4553 | 343.1352 | 333.2602 | | pytorch_struct | 200 | 0.7768 | 1.321 | 343.1183 | 342.3984 | | mobilenet_v2 | 96 | 3.0792 | 6.8929 | 338.0467 | 338.612 | | densenet121 | 4 | 7.453 | 17.5669 | 328.6991 | 343.919 | | hf_Bart | 4 | 10.7711 | 17.9662 | 326.7358 | 322.7043 | | mnasnet1_0 | 32 | 3.0395 | 6.6307 | 323.0717 | 324.1732 | | hf_T5 | 8 | 5.7026 | 13.3363 | 282.9559 | 281.6635 | | hf_DistilBert | 8 | 2.4985 | 5.5365 | 279.7668 | 279.3107 | | yolov3 | 16 | 4.7603 | 11.3064 | 262.6355 | 259.9473 | | resnet152 | 32 | 8.877 | 19.9992 | 251.3667 | 246.7747 | | timm_vovnet | 32 | 3.53 | 6.3249 | 246.8284 | 250.8095 | | drq | 1 | 0.6622 | 1.0061 | 243.14 | 263.0037 | | nvidia_deeprecommender | 256 | 0.4753 | 0.7606 | 242.6257 | 251.857 | | timm_nfnet | 128 | 5.6618 | 11.0017 | 229.374 | 225.6032 | | timm_resnest | 32 | 1.8281 | 3.8931 | 229.2705 | 229.4557 | | shufflenet_v2_x1_0 | 128 | 3.4138 | 7.5888 | 224.7522 | 228.6695 | | resnet50 | 32 | 3.1864 | 7.3638 | 214.3405 | 211.5939 | | timm_regnet | 32 | 6.717 | 12.1563 | 207.459 | 202.3439 | | resnext50_32x4d | 8 | 3.1695 | 6.9125 | 181.0959 | 181.4499 | | LearningToPaint | 96 | 1.4565 | 2.8318 | 173.8694 | 175.1049 | | soft_actor_critic | 256 | 0.445 | 0.6082 | 172.2714 | 181.3975 | | resnet18 | 16 | 1.3341 | 2.7142 | 157.2362 | 148.1282 | | vgg16 | 64 | 0.6314 | 1.1167 | 156.8115 | 158.9342 | | phlippe_resnet | 128 | 1.3364 | 2.7032 | 142.043 | 140.9482 | | lennard_jones | 1000 | 0.3913 | 0.5997 | 140.1834 | 136.1011 | | pytorch_unet | 1 | 1.6234 | 4.367 | 138.7686 | 140.8722 | | Background_Matting | 4 | 3.0396 | 11.433 | 135.1568 | 136.699 | | functorch_dp_cifar10 | 64 | 1.207 | 2.3889 | 117.332 | 117.371 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.2559 | 2.8988 | 101.5353 | 101.5587 | | demucs | 4 | 1.4184 | 2.1307 | 78.2632 | 80.3662 | | Super_SloMo | 6 | 2.738 | 9.6775 | 78.1261 | 78.2861 | | pytorch_stargan | 16 | 1.1868 | 3.1623 | 48.8599 | 51.7515 | | squeezenet1_1 | 32 | 1.0267 | 1.7364 | 47.1619 | 46.8625 | | dcgan | 32 | 0.4283 | 0.7074 | 19.3249 | 17.817 | | tts_angular | 64 | 0.4437 | 0.5138 | 6.1078 | 6.1647 | | hf_BigBird | 2 | 12.76 | 36.8124 | nan | nan | | hf_GPT2_large | 4 | 14.2063 | 29.5919 | nan | nan | | hf_Reformer | 4 | 4.117 | 5.9118 | nan | nan | | dlrm | 1024 | 0.3698 | 0.7866 | nan | nan | | alexnet | 128 | 0.481 | 0.7624 | nan | nan | | moco | 32 | 27.4118 | nan | nan | nan | | timm_vision_transformer_large | 32 | 9.2332 | nan | nan | nan | | doctr_det_predictor | 0 | nan | nan | nan | nan | | doctr_reco_predictor | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | Super_SloMo | 6 | 1.0014 | 0.822 | 1.1588 | 1.1588 | | hf_Albert | 8 | 0.9599 | 0.9008 | 1.0399 | 1.0399 | | mobilenet_v2 | 96 | 0.9866 | 0.7652 | 1.0106 | 1.0111 | | hf_T5 | 8 | 0.9507 | 0.8891 | 0.9988 | 0.9988 | | fastNLP_Bert | 6 | 1.0003 | 0.8878 | 0.9953 | 0.9953 | | tts_angular | 64 | 0.9957 | 0.9957 | 0.9852 | 0.9852 | | attention_is_all_you_need_pytorch | 256 | 0.9648 | 0.9066 | 0.9693 | 0.9693 | | timm_nfnet | 128 | 0.9071 | 0.8746 | 0.9614 | 0.9612 | | BERT_pytorch | 16 | 1.0003 | 0.8671 | 0.9428 | 0.9428 | | hf_GPT2 | 4 | 0.9357 | 0.8198 | 0.9317 | 0.9317 | | timm_efficientnet | 32 | 0.9874 | 0.7663 | 0.9293 | 0.8747 | | hf_Bert_large | 4 | 0.9845 | 0.8521 | 0.9138 | 0.9138 | | hf_Bert | 4 | 0.9645 | 0.8338 | 0.8815 | 0.8815 | | yolov3 | 16 | 0.9877 | 0.8253 | 0.87 | 0.8701 | | shufflenet_v2_x1_0 | 128 | 0.954 | 0.8383 | 0.8596 | 0.8599 | | speech_transformer | 32 | 0.9914 | 0.901 | 0.8583 | 0.8583 | | timm_regnet | 32 | 0.9908 | 0.8499 | 0.8512 | 0.8498 | | hf_DistilBert | 8 | 0.9262 | 0.8146 | 0.8456 | 0.8456 | | timm_resnest | 32 | 0.9888 | 0.8817 | 0.8414 | 0.8304 | | timm_vision_transformer | 32 | 0.9907 | 0.9299 | 0.8357 | 0.8357 | | Background_Matting | 4 | 1.0125 | 0.6486 | 0.834 | 0.834 | | resnet152 | 32 | 0.996 | 0.8915 | 0.8296 | 0.8286 | | hf_T5_large | 2 | 0.9831 | 0.8302 | 0.8201 | 0.8201 | | phlippe_densenet | 128 | 0.9983 | 0.9982 | 0.806 | 0.7988 | | mobilenet_v3_large | 32 | 0.9793 | 0.8396 | 0.7848 | 0.7275 | | pytorch_unet | 1 | 0.9953 | 0.7154 | 0.7734 | 0.7734 | | squeezenet1_1 | 32 | 0.9674 | 0.9353 | 0.773 | 0.773 | | pytorch_stargan | 16 | 0.9914 | 0.969 | 0.7715 | 0.7715 | | demucs | 4 | 0.9663 | 0.9664 | 0.7665 | 0.7665 | | hf_Bart | 4 | 0.9084 | 0.843 | 0.7545 | 0.7543 | | resnet50 | 32 | 0.9909 | 0.8638 | 0.7515 | 0.7522 | | timm_vovnet | 32 | 0.9892 | 0.8166 | 0.7427 | 0.7427 | | mnasnet1_0 | 32 | 0.9801 | 0.8686 | 0.742 | 0.7486 | | pytorch_struct | 200 | 0.9992 | 0.5168 | 0.7338 | 0.7274 | | vgg16 | 64 | 0.9922 | 0.7246 | 0.723 | 0.723 | | densenet121 | 4 | 0.9944 | 0.9823 | 0.7085 | 0.7085 | | resnext50_32x4d | 8 | 0.9947 | 0.8434 | 0.6608 | 0.6608 | | nvidia_deeprecommender | 256 | 0.9176 | 0.8055 | 0.6585 | 0.6585 | | LearningToPaint | 96 | 0.9202 | 0.7116 | 0.6018 | 0.6018 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9965 | 0.8594 | 0.5458 | 0.5458 | | resnet18 | 16 | 0.983 | 0.8055 | 0.5409 | 0.5409 | | hf_Longformer | 2 | 0.8565 | 0.8296 | 0.4203 | 0.4206 | | functorch_dp_cifar10 | 64 | 0.9953 | 0.8396 | 0.3991 | 0.3991 | | phlippe_resnet | 128 | 0.9881 | 0.864 | 0.3202 | 0.3202 | | drq | 1 | 0.9877 | 0.8852 | 0.1818 | 0.1818 | | dcgan | 32 | 0.9647 | 0.7957 | 0.1811 | 0.1811 | | soft_actor_critic | 256 | 0.9995 | 0.9255 | 0.1078 | 0.1078 | | lennard_jones | 1000 | 0.9996 | 0.9997 | 0.0648 | 0.0648 | | dlrm | 1024 | 0.9995 | 0.9944 | nan | nan | | hf_BigBird | 2 | 0.9493 | 0.9268 | nan | nan | | hf_GPT2_large | 4 | 0.9663 | 0.8303 | nan | nan | | hf_Reformer | 4 | 0.8004 | 0.8004 | nan | nan | | alexnet | 128 | 0.9452 | 0.7919 | nan | nan | | timm_vision_transformer_large | 32 | 0.9992 | nan | nan | nan | | moco | 32 | 0.99 | nan | nan | nan | | doctr_det_predictor | 0 | nan | nan | nan | nan | | doctr_reco_predictor | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------+------------------------+ | Background_Matting | 4 | 126.188 | 920.9148 | 107.3641 | 107.3898 | | hf_T5_large | 2 | 226.1605 | 273.5668 | 97.5123 | 97.364 | | hf_T5 | 8 | 183.5685 | 212.1523 | 93.4684 | 93.424 | | timm_nfnet | 128 | 120.3776 | 120.3382 | 80.7816 | 80.7568 | | Super_SloMo | 6 | 79.768 | 446.9936 | 65.2653 | 65.3729 | | hf_Longformer | 2 | 134.7514 | 185.8467 | 63.0971 | 62.7711 | | yolov3 | 16 | 68.7545 | 85.0545 | 61.4474 | 61.6063 | | timm_regnet | 32 | 61.6607 | 72.541 | 59.1829 | 59.1793 | | resnet152 | 32 | 66.9857 | 88.1438 | 54.3423 | 54.388 | | vgg16 | 64 | 66.3144 | 66.3727 | 53.3586 | 53.3111 | | demucs | 4 | 53.7002 | 53.4274 | 51.9174 | 52.292 | | hf_Bert_large | 4 | 82.7755 | 95.0971 | 50.9809 | 51.0579 | | pytorch_unet | 1 | 40.0974 | 194.485 | 34.0614 | 34.0626 | | speech_transformer | 32 | 67.6605 | 82.6321 | 33.5419 | 34.1472 | | attention_is_all_you_need_pytorch | 256 | 58.0688 | 67.5439 | 32.9818 | 32.9453 | | hf_Bart | 4 | 72.3349 | 86.4992 | 32.4278 | 32.1897 | | mobilenet_v2 | 96 | 47.0772 | 60.4758 | 31.8545 | 31.8096 | | fastNLP_Bert | 6 | 57.1553 | 70.013 | 31.4759 | 31.4673 | | hf_Albert | 8 | 68.5193 | 72.423 | 29.6847 | 29.6681 | | timm_vovnet | 32 | 28.972 | 35.2093 | 26.8181 | 26.8979 | | hf_GPT2 | 4 | 49.085 | 50.7364 | 25.272 | 25.2969 | | timm_efficientnet | 32 | 34.504 | 52.1242 | 23.3891 | 23.4013 | | resnet50 | 32 | 27.0455 | 36.8316 | 22.9154 | 22.8704 | | hf_Bert | 4 | 45.3523 | 48.2591 | 22.4861 | 22.6212 | | hf_DistilBert | 8 | 33.4487 | 34.5952 | 22.0925 | 22.0207 | | shufflenet_v2_x1_0 | 128 | 32.1145 | 39.8152 | 19.6246 | 19.6546 | | densenet121 | 4 | 61.1776 | 84.6034 | 18.6762 | 20.0414 | | BERT_pytorch | 16 | 53.911 | 65.6561 | 17.2022 | 17.2928 | | timm_vision_transformer | 32 | 28.9208 | 33.1916 | 16.6903 | 16.7272 | | timm_resnest | 32 | 24.2712 | 28.3163 | 16.6836 | 16.6077 | | mnasnet1_0 | 32 | 21.8716 | 31.7637 | 13.9869 | 13.3091 | | mobilenet_v3_large | 32 | 28.6989 | 34.3167 | 13.2958 | 13.2472 | | pytorch_stargan | 16 | 14.7159 | 17.9245 | 11.8429 | 11.8773 | | resnext50_32x4d | 8 | 20.0615 | 27.0873 | 11.6723 | 11.7853 | | phlippe_densenet | 128 | 25.8489 | 30.5896 | 11.6326 | 11.6586 | | nvidia_deeprecommender | 256 | 10.2264 | 10.2346 | 10.9126 | 10.9174 | | LearningToPaint | 96 | 12.0674 | 15.0326 | 8.7502 | 8.7575 | | pytorch_CycleGAN_and_pix2pix | 1 | 15.1254 | 14.8061 | 7.1496 | 7.1753 | | tts_angular | 64 | 6.623 | 6.9183 | 6.3719 | 7.0997 | | resnet18 | 16 | 9.0644 | 11.6187 | 5.769 | 6.2752 | | squeezenet1_1 | 32 | 10.2129 | 11.6679 | 5.449 | 5.4429 | | phlippe_resnet | 128 | 9.1012 | 11.8759 | 5.0423 | 5.1005 | | pytorch_struct | 200 | 5.7718 | 6.0466 | 3.2621 | 3.0943 | | functorch_dp_cifar10 | 64 | 10.6986 | 11.1796 | 2.8798 | 2.8826 | | drq | 1 | 3.4406 | 4.3278 | 2.2122 | 2.2131 | | dcgan | 32 | 2.3487 | 3.2693 | 1.4337 | 1.4339 | | soft_actor_critic | 256 | 2.6852 | 2.4219 | 1.394 | 1.2928 | | lennard_jones | 1000 | 1.7279 | 2.1056 | 1.0799 | 1.1465 | | hf_BigBird | 2 | 193.8281 | 274.9347 | nan | nan | | hf_GPT2_large | 4 | 212.6495 | 215.2812 | nan | nan | | hf_Reformer | 4 | 81.6056 | 81.5701 | nan | nan | | alexnet | 128 | 9.8388 | 9.8515 | nan | nan | | dlrm | 1024 | 4.4203 | 4.9248 | nan | nan | | timm_vision_transformer_large | 32 | 465.1919 | nan | nan | nan | | moco | 32 | 52.5112 | nan | nan | nan | | doctr_det_predictor | 0 | nan | nan | nan | nan | | doctr_reco_predictor | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+----------+-----------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9942 | 0.9359 | 2.4989 | 2.505 | | GPT2ForSequenceClassification | 4 | 0.9809 | 0.9575 | 2.3042 | 2.3047 | | ElectraForQuestionAnswering | 64 | 0.9879 | 0.9777 | 2.1261 | 2.1251 | | MT5ForConditionalGeneration | 16 | 0.994 | 0.8331 | 2.1198 | 2.3462 | | XGLMForCausalLM | 8 | 0.9689 | 0.756 | 2.0803 | 2.2268 | | M2M100ForConditionalGeneration | 16 | 1.0386 | 0.8242 | 2.0196 | 2.2301 | | MBartForConditionalGeneration | 2 | 0.9944 | 0.9533 | 1.8923 | 1.536 | | DistillGPT2 | 16 | 0.9906 | 0.9577 | 1.8889 | 1.8886 | | PLBartForCausalLM | 8 | 0.9921 | 0.9626 | 1.8785 | 1.8758 | | XLNetLMHeadModel | 8 | 0.9966 | 0.9671 | 1.8342 | 1.8388 | | MobileBertForMaskedLM | 64 | 0.9474 | 0.8102 | 1.8218 | 1.8041 | | ElectraForCausalLM | 32 | 0.9833 | 0.9369 | 1.7939 | 1.7971 | | RobertaForQuestionAnswering | 16 | 0.9855 | 0.9698 | 1.7829 | 1.7818 | | BertForQuestionAnswering | 16 | 0.9854 | 0.971 | 1.7696 | 1.7711 | | AllenaiLongformerBase | 4 | 0.9455 | 0.6584 | 1.766 | 1.7542 | | PLBartForConditionalGeneration | 4 | 0.9926 | 0.9393 | 1.7384 | 1.7431 | | RobertaForCausalLM | 16 | 0.9881 | 0.9635 | 1.6676 | 1.6706 | | T5ForConditionalGeneration | 4 | 0.9836 | 0.855 | 1.6653 | 1.6562 | | T5Small | 4 | 0.9834 | 0.8595 | 1.6621 | 1.6593 | | MBartForCausalLM | 4 | 0.9933 | 0.9655 | 1.645 | 1.6471 | | BartForCausalLM | 4 | 0.9942 | 0.9666 | 1.638 | 1.6438 | | MegatronBertForQuestionAnswering | 8 | 0.9813 | 0.9614 | 1.62 | 1.6249 | | CamemBert | 16 | 0.988 | 0.9633 | 1.619 | 1.6201 | | AlbertForQuestionAnswering | 4 | 0.9999 | 0.8856 | 1.6163 | 1.6168 | | YituTechConvBert | 16 | 0.9863 | 0.9494 | 1.613 | 1.6131 | | AlbertForMaskedLM | 4 | 0.9997 | 0.8849 | 1.6071 | 1.6083 | | BertForMaskedLM | 16 | 0.9863 | 0.9606 | 1.5977 | 1.5978 | | BlenderbotSmallForConditionalGeneration | 64 | 0.999 | 0.9031 | 1.5926 | 1.5913 | | LayoutLMForMaskedLM | 16 | 0.9861 | 0.9631 | 1.5826 | 1.5936 | | BartForConditionalGeneration | 2 | 0.9959 | 0.9619 | 1.5433 | 1.5814 | | MegatronBertForCausalLM | 4 | 0.9928 | 0.9097 | 1.5092 | 1.4978 | | Speech2Text2ForCausalLM | 256 | 0.9886 | 0.9262 | 1.498 | 1.5172 | | DistilBertForQuestionAnswering | 256 | 0.9946 | 0.9873 | 1.4495 | 1.4508 | | PegasusForCausalLM | 32 | 0.9718 | 0.8882 | 1.3946 | 1.3935 | | BlenderbotSmallForCausalLM | 64 | 0.9844 | 0.8981 | 1.3803 | 1.4522 | | TrOCRForCausalLM | 32 | 0.9934 | 0.9634 | 1.3757 | 1.3755 | | PegasusForConditionalGeneration | 32 | 0.9862 | 0.8712 | 1.3015 | 1.4156 | | DistilBertForMaskedLM | 128 | 0.9935 | 0.9525 | 1.2233 | 1.2222 | | MobileBertForQuestionAnswering | 128 | 0.946 | 0.8143 | 0.9272 | 0.9363 | | LayoutLMForSequenceClassification | 16 | 0.9854 | 0.9721 | 0.0 | 0.0 | | DebertaForQuestionAnswering | 8 | 0.946 | 0.7679 | 0.0 | 0.0 | | BlenderbotForCausalLM | 4 | 0.9651 | 0.7567 | 0.0 | 0.0 | | DebertaForMaskedLM | 4 | 0.8554 | 0.6488 | 0.0 | 0.0 | | DebertaV2ForQuestionAnswering | 2 | 0.837 | 0.6083 | 0.0 | 0.0 | | DebertaV2ForMaskedLM | 1 | 0.8338 | 0.6069 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | | AlbertForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_accuracy | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | PegasusForCausalLM | 32 | 5.9802 | 11.4431 | 814.8897 | 255.29 | | MobileBertForMaskedLM | 64 | 15.699 | 40.5806 | 621.2472 | 641.7833 | | MobileBertForQuestionAnswering | 128 | 15.7757 | 40.3503 | 582.039 | 579.7287 | | MT5ForConditionalGeneration | 16 | 8.135 | 18.6691 | 551.3717 | 560.1428 | | YituTechConvBert | 16 | 10.6034 | 20.2192 | 465.3614 | 466.0206 | | ElectraForCausalLM | 32 | 7.5894 | 13.9728 | 437.293 | 439.8019 | | AllenaiLongformerBase | 4 | 11.4177 | 30.8158 | 411.2795 | 410.4339 | | M2M100ForConditionalGeneration | 16 | 11.7046 | 25.3976 | 380.7446 | 380.955 | | XLNetLMHeadModel | 8 | 10.2437 | 27.9182 | 352.8004 | 352.7516 | | AlbertForMaskedLM | 4 | 2.3681 | 8.1171 | 352.0903 | 370.1729 | | XGLMForCausalLM | 8 | 9.6187 | 20.6035 | 346.8905 | 344.7347 | | MegatronBertForCausalLM | 4 | 10.3539 | 21.2987 | 340.8456 | 344.1151 | | ElectraForQuestionAnswering | 64 | 5.2707 | 10.5109 | 340.3166 | 341.0878 | | PegasusForConditionalGeneration | 32 | 5.1667 | 19.225 | 339.9314 | 317.5693 | | MBartForConditionalGeneration | 2 | 11.7647 | 25.5629 | 332.702 | 333.4432 | | T5Small | 4 | 5.5074 | 13.2374 | 324.3855 | 325.5567 | | T5ForConditionalGeneration | 4 | 5.5495 | 13.1651 | 323.6672 | 325.1325 | | MegatronBertForQuestionAnswering | 8 | 10.3207 | 21.2929 | 323.4524 | 324.4397 | | GPT2ForSequenceClassification | 4 | 4.77 | 9.7326 | 323.1691 | 321.3846 | | BartForConditionalGeneration | 2 | 11.427 | 25.7008 | 321.1777 | 333.1578 | | AlbertForQuestionAnswering | 4 | 2.3495 | 8.0052 | 314.2912 | 328.9652 | | PLBartForConditionalGeneration | 4 | 8.9844 | 16.5692 | 297.6701 | 278.446 | | BlenderbotSmallForConditionalGeneration | 64 | 7.5996 | 17.1367 | 289.8975 | 292.8094 | | BlenderbotSmallForCausalLM | 64 | 4.3221 | 8.2531 | 274.757 | 261.6962 | | RobertaForQuestionAnswering | 16 | 5.1173 | 11.1714 | 273.882 | 274.5867 | | DistillGPT2 | 16 | 2.5543 | 5.0368 | 267.031 | 267.848 | | BertForQuestionAnswering | 16 | 5.1209 | 10.5719 | 265.9109 | 281.6978 | | LayoutLMForMaskedLM | 16 | 5.6772 | 11.1826 | 261.6479 | 262.2402 | | CamemBert | 16 | 5.2619 | 11.2078 | 260.9713 | 266.0405 | | DistilBertForMaskedLM | 128 | 2.5551 | 5.7136 | 256.5461 | 240.982 | | RobertaForCausalLM | 16 | 5.1179 | 11.2959 | 255.0796 | 254.953 | | BertForMaskedLM | 16 | 5.1511 | 10.6419 | 246.8009 | 261.1474 | | DistilBertForQuestionAnswering | 256 | 2.5246 | 5.4178 | 245.7984 | 244.739 | | BartForCausalLM | 4 | 6.2038 | 11.7528 | 240.4101 | 239.5187 | | OPTForCausalLM | 2 | 5.6827 | 11.3146 | 237.5466 | 237.6625 | | MBartForCausalLM | 4 | 6.5467 | 12.1503 | 235.4426 | 235.7554 | | Speech2Text2ForCausalLM | 256 | 3.3142 | 6.2447 | 234.1054 | 234.3185 | | TrOCRForCausalLM | 32 | 6.1579 | 11.9922 | 230.3112 | 228.9748 | | PLBartForCausalLM | 8 | 3.7152 | 6.7484 | 212.6627 | 213.84 | | DebertaV2ForQuestionAnswering | 2 | 15.2019 | 27.8129 | nan | nan | | DebertaV2ForMaskedLM | 1 | 15.3387 | 26.5443 | nan | nan | | BlenderbotForCausalLM | 4 | 11.6699 | 22.2352 | nan | nan | | DebertaForQuestionAnswering | 8 | 7.2183 | 13.7242 | nan | nan | | DebertaForMaskedLM | 4 | 7.2914 | 13.3464 | nan | nan | | LayoutLMForSequenceClassification | 16 | 5.553 | 11.6407 | nan | nan | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | XLNetLMHeadModel | 8 | 0.9843 | 0.9603 | 1.1342 | 1.1342 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.906 | 1.1135 | 1.1135 | | ElectraForQuestionAnswering | 64 | 1.0014 | 0.9537 | 1.1114 | 1.1114 | | RobertaForQuestionAnswering | 16 | 1.0012 | 0.9279 | 1.0816 | 1.0816 | | BertForQuestionAnswering | 16 | 1.0017 | 0.9284 | 1.0789 | 1.0789 | | OPTForCausalLM | 2 | 0.9682 | 0.9246 | 1.0615 | 1.0615 | | RobertaForCausalLM | 16 | 0.9999 | 0.9209 | 1.0541 | 1.0541 | | T5ForConditionalGeneration | 4 | 0.9999 | 0.9516 | 1.0356 | 1.0356 | | T5Small | 4 | 0.9999 | 0.9516 | 1.0356 | 1.0356 | | BertForMaskedLM | 16 | 0.9998 | 0.9207 | 1.03 | 1.03 | | DistilBertForQuestionAnswering | 256 | 1.0114 | 0.9556 | 1.0299 | 1.0299 | | CamemBert | 16 | 1.0 | 0.9184 | 1.0277 | 1.0277 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9211 | 0.9867 | 0.9867 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.7449 | 0.9734 | 0.9734 | | DistillGPT2 | 16 | 1.0 | 0.8591 | 0.9682 | 0.9682 | | YituTechConvBert | 16 | 0.953 | 0.8749 | 0.9575 | 0.9575 | | AlbertForMaskedLM | 4 | 1.0 | 0.7338 | 0.9574 | 0.9574 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.904 | 0.953 | 0.953 | | PLBartForConditionalGeneration | 4 | 0.93 | 0.8787 | 0.9215 | 0.9215 | | PegasusForConditionalGeneration | 32 | 0.9439 | 0.8957 | 0.8911 | 0.8911 | | MT5ForConditionalGeneration | 16 | 0.9999 | 0.8495 | 0.8906 | 0.8906 | | ElectraForCausalLM | 32 | 0.9161 | 0.7864 | 0.8896 | 0.8896 | | PLBartForCausalLM | 8 | 0.9237 | 0.8168 | 0.8748 | 0.8748 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8468 | 0.8677 | 0.8677 | | MBartForConditionalGeneration | 2 | 1.0 | 0.8946 | 0.8672 | 0.8672 | | TrOCRForCausalLM | 32 | 0.92 | 0.8307 | 0.8558 | 0.8558 | | MBartForCausalLM | 4 | 0.951 | 0.8913 | 0.8501 | 0.8501 | | BartForConditionalGeneration | 2 | 1.0 | 0.8987 | 0.8456 | 0.8456 | | MegatronBertForCausalLM | 4 | 1.0 | 0.8644 | 0.845 | 0.845 | | BartForCausalLM | 4 | 0.951 | 0.8911 | 0.8311 | 0.8311 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8895 | 0.816 | 0.816 | | PegasusForCausalLM | 32 | 0.9238 | 0.8405 | 0.7966 | 0.7966 | | BlenderbotSmallForCausalLM | 64 | 0.8906 | 0.7493 | 0.787 | 0.787 | | MobileBertForMaskedLM | 64 | 1.0 | 0.8769 | 0.7473 | 0.7473 | | Speech2Text2ForCausalLM | 256 | 0.8865 | 0.7573 | 0.7364 | 0.7364 | | XGLMForCausalLM | 8 | 0.9431 | 0.8612 | 0.6744 | 0.6744 | | MobileBertForQuestionAnswering | 128 | 1.0161 | 1.0064 | 0.6505 | 0.6505 | | M2M100ForConditionalGeneration | 16 | 0.955 | 0.8772 | 0.6058 | 0.6058 | | AllenaiLongformerBase | 4 | 0.8568 | 0.7887 | 0.4696 | 0.4696 | | DebertaForQuestionAnswering | 8 | 0.9524 | 1.0537 | nan | nan | | BlenderbotForCausalLM | 4 | 0.9932 | 0.9937 | nan | nan | | DebertaV2ForQuestionAnswering | 2 | 0.9762 | 0.9764 | nan | nan | | LayoutLMForSequenceClassification | 16 | 1.0014 | 0.9295 | nan | nan | | DebertaForMaskedLM | 4 | 0.9326 | 0.9156 | nan | nan | | DebertaV2ForMaskedLM | 1 | 0.977 | 0.9068 | nan | nan | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | MobileBertForQuestionAnswering | 128 | 182.5226 | 235.974 | 215.1751 | 216.4484 | | AlbertForMaskedLM | 4 | 266.3614 | 300.7619 | 165.8751 | 165.7583 | | AlbertForQuestionAnswering | 4 | 264.1258 | 298.0116 | 163.5808 | 163.5286 | | XLNetLMHeadModel | 8 | 281.0762 | 287.9997 | 151.6205 | 151.4111 | | PegasusForConditionalGeneration | 32 | 147.1145 | 181.5002 | 107.5555 | 108.9964 | | AllenaiLongformerBase | 4 | 192.8537 | 273.1966 | 103.1999 | 102.9686 | | TrOCRForCausalLM | 32 | 139.0841 | 143.3156 | 99.98 | 100.4592 | | MobileBertForMaskedLM | 64 | 186.7725 | 241.7768 | 95.3639 | 95.5129 | | MBartForConditionalGeneration | 2 | 145.0382 | 157.2691 | 90.9443 | 89.4301 | | BartForConditionalGeneration | 2 | 139.8443 | 144.7792 | 89.1163 | 91.0751 | | MegatronBertForQuestionAnswering | 8 | 144.707 | 147.5001 | 87.5361 | 87.4944 | | YituTechConvBert | 16 | 127.1205 | 132.4614 | 77.8216 | 77.8727 | | BlenderbotSmallForConditionalGeneration | 64 | 113.4108 | 133.5939 | 76.6448 | 76.8754 | | CamemBert | 16 | 119.8878 | 123.1316 | 73.1863 | 73.2015 | | M2M100ForConditionalGeneration | 16 | 116.6824 | 180.4988 | 72.1 | 71.9561 | | DistilBertForQuestionAnswering | 256 | 103.9112 | 104.6181 | 71.5622 | 71.5815 | | LayoutLMForMaskedLM | 16 | 114.3058 | 116.9869 | 71.1742 | 70.8273 | | MBartForCausalLM | 4 | 114.2678 | 118.2349 | 69.3464 | 69.4083 | | DistilBertForMaskedLM | 128 | 85.2726 | 89.0205 | 69.3028 | 69.3028 | | BartForCausalLM | 4 | 114.0892 | 117.8833 | 69.2815 | 69.5855 | | PLBartForConditionalGeneration | 4 | 117.413 | 126.8599 | 69.2785 | 69.06 | | RobertaForCausalLM | 16 | 116.517 | 119.6429 | 69.1365 | 68.8249 | | BertForMaskedLM | 16 | 111.6765 | 114.3151 | 68.8883 | 68.9191 | | OPTForCausalLM | 2 | 172.5161 | 180.0393 | 68.2571 | 68.1263 | | T5Small | 4 | 106.202 | 124.1665 | 63.0574 | 63.1405 | | T5ForConditionalGeneration | 4 | 106.3016 | 123.0624 | 62.9952 | 63.0701 | | PLBartForCausalLM | 8 | 115.1702 | 117.927 | 62.1232 | 62.1436 | | MegatronBertForCausalLM | 4 | 88.8001 | 94.8498 | 57.5259 | 57.7304 | | DistillGPT2 | 16 | 106.873 | 110.3903 | 55.9704 | 55.96 | | ElectraForQuestionAnswering | 64 | 116.1725 | 117.0499 | 53.847 | 53.8888 | | BertForQuestionAnswering | 16 | 96.648 | 97.9249 | 53.7727 | 53.7566 | | RobertaForQuestionAnswering | 16 | 96.9622 | 99.8688 | 53.749 | 53.702 | | PegasusForCausalLM | 32 | 74.0405 | 84.9498 | 53.2941 | 53.2093 | | XGLMForCausalLM | 8 | 94.4726 | 143.9937 | 52.5737 | 52.7642 | | ElectraForCausalLM | 32 | 89.7358 | 94.1721 | 49.1765 | 49.2004 | | MT5ForConditionalGeneration | 16 | 94.1218 | 111.5718 | 43.6003 | 43.7569 | | BlenderbotSmallForCausalLM | 64 | 58.6737 | 65.1917 | 42.0874 | 42.3345 | | GPT2ForSequenceClassification | 4 | 93.2112 | 95.5255 | 39.7861 | 39.8909 | | Speech2Text2ForCausalLM | 256 | 53.5922 | 58.1395 | 35.8121 | 35.9591 | | DebertaV2ForMaskedLM | 1 | 123.9398 | 192.0318 | nan | nan | | DebertaV2ForQuestionAnswering | 2 | 127.6671 | 191.7707 | nan | nan | | BlenderbotForCausalLM | 4 | 104.6523 | 154.9237 | nan | nan | | DebertaForMaskedLM | 4 | 70.4682 | 105.8811 | nan | nan | | LayoutLMForSequenceClassification | 16 | 99.2054 | 100.6683 | nan | nan | | DebertaForQuestionAnswering | 8 | 80.1068 | 98.6121 | nan | nan | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 0.9976 | 0.9965 | 3.3016 | 3.3034 | | xcit_large_24_p8_224 | 5 | 0.9917 | 0.8684 | 2.4064 | 2.0561 | | twins_pcpvt_base | 64 | 0.9973 | 0.9068 | 2.0944 | 2.0897 | | coat_lite_mini | 128 | 0.9973 | 0.9954 | 2.0584 | 2.0591 | | gmixer_24_224 | 128 | 0.9953 | 0.8894 | 1.8678 | 1.8633 | | crossvit_9_240 | 128 | 0.9932 | 0.7829 | 1.7906 | 1.7845 | | ghostnet_100 | 128 | 0.9922 | 0.7644 | 1.7815 | 1.7813 | | volo_d1_224 | 64 | 0.9944 | 0.9734 | 1.7241 | 1.7279 | | gmlp_s16_224 | 128 | 0.9942 | 1.0826 | 1.7207 | 1.7198 | | swin_base_patch4_window7_224 | 64 | 0.9913 | 0.9525 | 1.7048 | 1.7109 | | convit_base | 64 | 0.9981 | 0.9968 | 1.6208 | 1.6213 | | pit_b_224 | 64 | 0.995 | 0.9924 | 1.6034 | 1.6029 | | lcnet_050 | 128 | 0.9417 | 0.7353 | 1.5874 | 1.5867 | | jx_nest_base | 32 | 0.987 | 0.9858 | 1.5438 | 1.5463 | | gluon_inception_v3 | 128 | 0.9962 | 0.8648 | 1.5201 | 1.5191 | | adv_inception_v3 | 128 | 0.9969 | 0.8593 | 1.5082 | 1.5107 | | inception_v3 | 128 | 0.9961 | 0.8633 | 1.5067 | 1.5071 | | convnext_base | 64 | 0.9837 | 0.984 | 1.4956 | 1.4988 | | sebotnet33ts_256 | 64 | 0.9576 | 0.7538 | 1.473 | 1.4567 | | dla102 | 128 | 0.9956 | 0.8149 | 1.4697 | 1.469 | | nfnet_l0 | 128 | 0.9897 | 0.8134 | 1.4515 | 1.4418 | | mobilevit_s | 64 | 0.9623 | 0.7315 | 1.4473 | 1.4482 | | beit_base_patch16_224 | 64 | 0.997 | 0.9664 | 1.4441 | 1.4451 | | cait_m36_384 | 4 | 0.9948 | 0.9459 | 1.4405 | 1.4398 | | dm_nfnet_f0 | 128 | 0.9865 | 0.9845 | 1.413 | 1.4148 | | resmlp_12_224 | 128 | 0.9927 | 0.889 | 1.3963 | 1.3978 | | eca_botnext26ts_256 | 128 | 0.9725 | 0.7194 | 1.3919 | 1.4062 | | botnet26t_256 | 128 | 0.9739 | 0.8514 | 1.3841 | 1.3863 | | mnasnet_100 | 128 | 0.9488 | 0.7405 | 1.3742 | 1.3719 | | resnest101e | 64 | 0.9944 | 0.8672 | 1.3659 | 1.3658 | | mixer_b16_224 | 128 | 0.9974 | 1.0175 | 1.3612 | 1.3615 | | selecsls42b | 128 | 0.9992 | 0.8114 | 1.355 | 1.3549 | | mobilenetv3_large_100 | 128 | 0.95 | 0.7594 | 1.3501 | 1.3488 | | mobilenetv2_100 | 128 | 0.9494 | 0.7369 | 1.3493 | 1.3468 | | regnety_002 | 128 | 0.9535 | 0.7166 | 1.3462 | 1.4305 | | vit_base_patch16_224 | 64 | 0.9961 | 0.9933 | 1.3365 | 1.3361 | | res2net50_14w_8s | 128 | 0.999 | 0.7891 | 1.336 | 1.3354 | | hrnet_w18 | 128 | 0.9921 | 0.6354 | 1.3291 | 1.3244 | | fbnetc_100 | 128 | 0.9499 | 0.7389 | 1.3239 | 1.3223 | | res2next50 | 128 | 0.9989 | 0.8238 | 1.3155 | 1.3161 | | deit_base_distilled_patch16_224 | 64 | 0.9963 | 0.9935 | 1.3137 | 1.3151 | | spnasnet_100 | 128 | 0.9424 | 0.7387 | 1.3006 | 1.3052 | | tf_efficientnet_b0 | 128 | 0.9604 | 0.6813 | 1.293 | 1.2928 | | poolformer_m36 | 64 | 0.9865 | 0.9826 | 1.2762 | 1.276 | | rexnet_100 | 128 | 0.9522 | 0.7028 | 1.2496 | 1.2498 | | ese_vovnet19b_dw | 128 | 0.9579 | 0.8336 | 1.2493 | 1.2474 | | fbnetv3_b | 128 | 0.9498 | 0.7687 | 1.2442 | 1.2436 | | visformer_small | 128 | 0.9963 | 0.9448 | 1.1907 | 1.1918 | | tinynet_a | 128 | 0.9466 | 0.6782 | 1.1811 | 1.1585 | | tf_mixnet_l | 128 | 0.9764 | 0.8271 | 1.1666 | 1.1664 | | mixnet_l | 128 | 0.9762 | 0.8209 | 1.1561 | 1.1552 | | cspdarknet53 | 64 | 0.9329 | 0.7856 | 1.151 | 1.1503 | | res2net101_26w_4s | 64 | 1.0005 | 0.7889 | 1.1295 | 1.1286 | | dpn107 | 32 | 0.9307 | 0.8074 | 1.0747 | 1.0746 | | gluon_xception65 | 32 | 0.9924 | 0.8435 | 1.0652 | 1.064 | | swsl_resnext101_32x16d | 32 | 0.9977 | 0.8395 | 1.0436 | 1.0433 | | gernet_l | 128 | 0.9361 | 0.7923 | 1.0214 | 1.0219 | | repvgg_a2 | 128 | 0.9362 | 0.7555 | 1.0167 | 1.037 | | convmixer_768_32 | 32 | 0.9986 | 0.9635 | 0.996 | 0.9959 | | pnasnet5large | 16 | 0.9858 | 0.912 | 0.9045 | 0.8892 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+---------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+----+---------------+---------------+---------------+------------------------+ | adv_inception_v3 | 8 | pass | pass | pass | pass | | selecsls42b | 8 | pass | pass | pass | pass | | beit_base_patch16_224 | 8 | pass | pass | pass | pass | | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | visformer_small | 8 | pass | pass | pass | pass | | vit_base_patch16_224 | 8 | pass | pass | pass | pass | | volo_d1_224 | 8 | pass | pass | pass | pass | | botnet26t_256 | 8 | fail_accuracy | pass | pass | pass | | cspdarknet53 | 8 | fail_accuracy | pass | pass | pass | | dpn107 | 8 | fail_accuracy | pass | pass | pass | | ese_vovnet19b_dw | 8 | fail_accuracy | pass | pass | pass | | fbnetc_100 | 8 | fail_accuracy | pass | pass | pass | | mixnet_l | 8 | fail_accuracy | pass | pass | pass | | mnasnet_100 | 8 | fail_accuracy | pass | pass | pass | | mobilevit_s | 8 | fail_accuracy | pass | pass | pass | | regnety_002 | 8 | fail_accuracy | pass | pass | pass | | repvgg_a2 | 8 | fail_accuracy | pass | pass | pass | | rexnet_100 | 8 | fail_accuracy | pass | pass | pass | | spnasnet_100 | 8 | fail_accuracy | pass | pass | pass | | tf_efficientnet_b0 | 8 | fail_accuracy | pass | pass | pass | | tf_mixnet_l | 8 | fail_accuracy | pass | pass | pass | | tinynet_a | 8 | fail_accuracy | pass | pass | pass | | eca_botnext26ts_256 | 8 | fail_accuracy | fail_accuracy | pass | pass | | gernet_l | 8 | fail_accuracy | fail_accuracy | pass | pass | | mobilenetv2_100 | 8 | fail_accuracy | fail_accuracy | pass | pass | | gluon_xception65 | 8 | pass | pass | pass | fail_accuracy | | sebotnet33ts_256 | 8 | pass | pass | pass | fail_accuracy | | swin_base_patch4_window7_224 | 8 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 8 | pass | pass | pass | pass | | resnest101e | 8 | pass | pass | pass | pass | | hrnet_w18 | 8 | pass | pass | pass | pass | | cait_m36_384 | 4 | pass | pass | pass | pass | | convit_base | 8 | pass | pass | pass | pass | | convmixer_768_32 | 8 | pass | pass | pass | pass | | convnext_base | 8 | pass | pass | pass | pass | | crossvit_9_240 | 8 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 8 | pass | pass | pass | pass | | dm_nfnet_f0 | 8 | pass | pass | pass | pass | | ghostnet_100 | 8 | pass | pass | pass | pass | | gluon_inception_v3 | 8 | pass | pass | pass | pass | | gmixer_24_224 | 8 | pass | pass | pass | pass | | resmlp_12_224 | 8 | pass | pass | pass | pass | | gmlp_s16_224 | 8 | pass | pass | pass | pass | | inception_v3 | 8 | pass | pass | pass | pass | | pnasnet5large | 8 | pass | pass | pass | pass | | res2next50 | 8 | pass | pass | pass | pass | | res2net50_14w_8s | 8 | pass | pass | pass | pass | | res2net101_26w_4s | 8 | pass | pass | pass | pass | | jx_nest_base | 8 | pass | pass | pass | pass | | poolformer_m36 | 8 | pass | pass | pass | pass | | pit_b_224 | 8 | pass | pass | pass | pass | | nfnet_l0 | 8 | pass | pass | pass | pass | | mobilenetv3_large_100 | 8 | pass | pass | pass | pass | | mixer_b16_224 | 8 | pass | pass | pass | pass | | lcnet_050 | 8 | pass | pass | pass | pass | | twins_pcpvt_base | 8 | pass | pass | pass | 0.0000 | | dla102 | 8 | pass | pass | fail_accuracy | pass | | xcit_large_24_p8_224 | 8 | pass | fail_accuracy | fail_accuracy | fail_accuracy | | fbnetv3_b | 8 | fail_accuracy | fail_accuracy | fail_accuracy | fail_accuracy | | coat_lite_mini | 8 | pass | pass | 0.0000 | pass | +---------------------------------+----+---------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+-----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+---------+-----------+-----------+------------------------+ | twins_pcpvt_base | 64 | 10.9058 | 24.2534 | 1549.1182 | 1549.3968 | | mobilevit_s | 64 | 5.4637 | 11.1892 | 1260.1864 | 1250.1549 | | coat_lite_mini | 128 | 3.2474 | 7.9128 | 1256.0368 | 1261.5438 | | crossvit_9_240 | 128 | 5.9921 | 13.1413 | 1181.8398 | 1163.6234 | | swin_base_patch4_window7_224 | 64 | 8.6916 | 20.0535 | 1144.8752 | 1145.9039 | | volo_d1_224 | 64 | 4.9355 | 12.331 | 959.9922 | 972.2927 | | pit_b_224 | 64 | 3.3643 | 8.2352 | 939.9928 | 944.8402 | | xcit_large_24_p8_224 | 5 | 12.3087 | 27.5101 | 919.4447 | 930.1613 | | jx_nest_base | 32 | 6.8481 | 14.5014 | 909.285 | 909.8677 | | cait_m36_384 | 4 | 13.3383 | 32.1735 | 877.7569 | 889.0475 | | sebotnet33ts_256 | 64 | 4.2668 | 8.7328 | 723.3903 | 715.0022 | | tnt_s_patch16_224 | 128 | 6.3291 | 15.7476 | 639.1069 | 644.4422 | | convit_base | 64 | 3.6151 | 9.0698 | 606.4733 | 607.7017 | | eca_botnext26ts_256 | 128 | 3.0188 | 6.7253 | 576.4692 | 577.7936 | | ghostnet_100 | 128 | 7.6126 | 14.4555 | 574.4232 | 585.6942 | | rexnet_100 | 128 | 5.4521 | 10.9926 | 574.0523 | 579.1604 | | botnet26t_256 | 128 | 2.8889 | 5.9092 | 564.6397 | 558.1972 | | hrnet_w18 | 128 | 8.7491 | 34.9528 | 517.5075 | 521.5671 | | visformer_small | 128 | 2.5587 | 5.9233 | 451.9125 | 458.0089 | | convnext_base | 64 | 6.9519 | 12.5288 | 441.4118 | 458.4532 | | fbnetv3_b | 128 | 8.1352 | 17.7019 | 407.5326 | 401.4525 | | res2net50_14w_8s | 128 | 8.7918 | 21.7777 | 405.0288 | 402.7811 | | tinynet_a | 128 | 5.8002 | 12.6407 | 379.1793 | 385.4118 | | tf_efficientnet_b0 | 128 | 4.9927 | 10.2662 | 373.0437 | 376.6531 | | adv_inception_v3 | 128 | 5.5206 | 12.2555 | 371.087 | 371.0584 | | gluon_inception_v3 | 128 | 5.5473 | 12.2703 | 367.0138 | 368.7707 | | mobilenetv3_large_100 | 128 | 4.1129 | 8.3298 | 366.7194 | 368.975 | | inception_v3 | 128 | 5.7569 | 13.1324 | 365.8731 | 372.7785 | | tf_mixnet_l | 128 | 8.7849 | 16.5201 | 364.5554 | 355.9007 | | pnasnet5large | 16 | 7.7177 | 25.259 | 361.0583 | 357.5933 | | mixnet_l | 128 | 8.1606 | 16.0154 | 357.2374 | 360.5247 | | spnasnet_100 | 128 | 4.8912 | 9.0884 | 356.122 | 360.5331 | | fbnetc_100 | 128 | 4.9076 | 9.2235 | 354.4488 | 358.268 | | res2net101_26w_4s | 64 | 10.3673 | 24.4714 | 347.2845 | 352.7768 | | deit_base_distilled_patch16_224 | 64 | 3.2818 | 6.96 | 333.189 | 330.9312 | | vit_base_patch16_224 | 64 | 3.0129 | 6.9671 | 331.7911 | 330.6702 | | mobilenetv2_100 | 128 | 4.1163 | 7.7334 | 327.4934 | 325.4134 | | resnest101e | 64 | 10.6575 | 23.9364 | 322.625 | 328.1115 | | beit_base_patch16_224 | 64 | 4.0731 | 9.155 | 318.5332 | 325.0581 | | mnasnet_100 | 128 | 3.9122 | 7.4883 | 316.9831 | 323.0384 | | gmixer_24_224 | 128 | 5.5918 | 12.7248 | 289.7795 | 296.1159 | | poolformer_m36 | 64 | 7.4032 | 13.5467 | 278.4865 | 277.6924 | | dpn107 | 32 | 9.3986 | 19.044 | 277.8568 | 280.9468 | | res2next50 | 128 | 4.9122 | 11.8557 | 275.9085 | 277.7791 | | cspdarknet53 | 64 | 5.6079 | 10.569 | 272.0032 | 273.5491 | | selecsls42b | 128 | 2.4384 | 5.358 | 259.2803 | 260.0981 | | regnety_002 | 128 | 4.7881 | 9.2158 | 258.7406 | 252.8447 | | gmlp_s16_224 | 128 | 5.4311 | 11.8688 | 257.6034 | 258.6444 | | resmlp_12_224 | 128 | 2.7509 | 5.3595 | 249.7601 | 252.793 | | mixer_b16_224 | 128 | 2.6356 | 5.8652 | 249.4048 | 245.5265 | | gluon_xception65 | 32 | 7.5463 | 16.5946 | 239.5066 | 234.587 | | lcnet_050 | 128 | 2.4614 | 4.9731 | 229.6674 | 221.3622 | | ese_vovnet19b_dw | 128 | 2.4879 | 4.4711 | 223.3347 | 223.677 | | dm_nfnet_f0 | 128 | 5.8579 | 11.242 | 219.4157 | 222.9388 | | gernet_l | 128 | 4.8269 | 8.7057 | 216.739 | 216.6382 | | dla102 | 128 | 5.9674 | 13.9042 | 192.1844 | 191.9956 | | nfnet_l0 | 128 | 5.2141 | 10.8368 | 189.6519 | 186.6258 | | swsl_resnext101_32x16d | 32 | 5.9419 | 13.3517 | 187.444 | 191.9566 | | repvgg_a2 | 128 | 4.6791 | 8.5646 | 176.7575 | 178.8257 | | convmixer_768_32 | 32 | 1.6543 | 6.9368 | 101.2559 | 99.4474 | +---------------------------------+-----+---------+-----------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | gmlp_s16_224 | 128 | 1.0015 | 0.9787 | 1.1839 | 1.1839 | | pnasnet5large | 16 | 1.0593 | 0.9927 | 1.1539 | 1.1525 | | gmixer_24_224 | 128 | 1.0014 | 0.9787 | 1.1127 | 1.1127 | | convit_base | 64 | 1.0 | 0.8505 | 1.0948 | 1.0948 | | mobilenetv2_100 | 128 | 0.9996 | 0.7725 | 1.0266 | 1.0266 | | dm_nfnet_f0 | 128 | 0.9808 | 0.9006 | 1.0129 | 1.0129 | | resmlp_12_224 | 128 | 0.9999 | 0.9667 | 1.0097 | 1.0097 | | tinynet_a | 128 | 0.9998 | 0.7975 | 0.9985 | 0.9985 | | resnest101e | 64 | 0.9998 | 1.0033 | 0.9933 | 0.9933 | | tf_efficientnet_b0 | 128 | 0.9992 | 0.7813 | 0.9873 | 0.9873 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9781 | 0.9834 | 0.9834 | | rexnet_100 | 128 | 1.0 | 0.7935 | 0.9745 | 0.9745 | | twins_pcpvt_base | 64 | 0.9995 | 0.9273 | 0.9727 | 0.9727 | | convmixer_768_32 | 32 | 1.0 | 0.9812 | 0.9657 | 0.9657 | | dla102 | 128 | 0.9708 | 0.9218 | 0.9535 | 0.9535 | | mixer_b16_224 | 128 | 1.0 | 0.9644 | 0.9438 | 0.9438 | | tf_mixnet_l | 128 | 0.9995 | 0.8647 | 0.9345 | 0.9345 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9344 | 0.9306 | 0.9306 | | mobilevit_s | 64 | 0.9998 | 0.7836 | 0.9262 | 0.9262 | | visformer_small | 128 | 1.0005 | 0.9328 | 0.9245 | 0.9245 | | fbnetv3_b | 128 | 0.9989 | 0.8019 | 0.9167 | 0.9167 | | nfnet_l0 | 128 | 1.0005 | 0.8489 | 0.9101 | 0.9101 | | cspdarknet53 | 64 | 0.9996 | 0.86 | 0.9098 | 0.9098 | | vit_base_patch16_224 | 64 | 1.0001 | 0.936 | 0.9078 | 0.9078 | | deit_base_distilled_patch16_224 | 64 | 0.9995 | 0.9358 | 0.9071 | 0.9071 | | volo_d1_224 | 64 | 1.001 | 0.9514 | 0.9067 | 0.9067 | | ese_vovnet19b_dw | 128 | 0.9986 | 0.9082 | 0.8975 | 0.8975 | | sebotnet33ts_256 | 64 | 0.9957 | 0.7151 | 0.891 | 0.8908 | | gluon_inception_v3 | 128 | 1.0 | 0.8752 | 0.8902 | 0.8902 | | inception_v3 | 128 | 1.0 | 0.8752 | 0.8902 | 0.8902 | | adv_inception_v3 | 128 | 1.0 | 0.8752 | 0.8902 | 0.8902 | | hrnet_w18 | 128 | 0.9999 | 0.9269 | 0.8872 | 0.8872 | | gluon_xception65 | 32 | 0.9998 | 0.8877 | 0.8832 | 0.8832 | | spnasnet_100 | 128 | 0.9992 | 0.8982 | 0.8786 | 0.8786 | | xcit_large_24_p8_224 | 5 | 0.9989 | 0.8874 | 0.8761 | 0.8761 | | eca_botnext26ts_256 | 128 | 0.9995 | 0.7791 | 0.8738 | 0.8738 | | mixnet_l | 128 | 0.9997 | 0.8539 | 0.8686 | 0.8686 | | dpn107 | 32 | 0.9932 | 0.9066 | 0.8685 | 0.8685 | | mnasnet_100 | 128 | 0.9992 | 0.8897 | 0.8683 | 0.8683 | | cait_m36_384 | 4 | 0.9998 | 0.913 | 0.8637 | 0.8637 | | poolformer_m36 | 64 | 1.0014 | 0.9514 | 0.8598 | 0.8598 | | fbnetc_100 | 128 | 0.9989 | 0.8651 | 0.8596 | 0.8596 | | pit_b_224 | 64 | 1.0005 | 0.8033 | 0.8566 | 0.8566 | | res2net101_26w_4s | 64 | 1.0002 | 0.9186 | 0.8505 | 0.8505 | | res2net50_14w_8s | 128 | 1.0002 | 0.9151 | 0.8497 | 0.8494 | | gernet_l | 128 | 0.9989 | 0.8652 | 0.8495 | 0.8496 | | swsl_resnext101_32x16d | 32 | 1.0002 | 0.8706 | 0.8477 | 0.8477 | | selecsls42b | 128 | 1.0006 | 0.8947 | 0.8471 | 0.8472 | | res2next50 | 128 | 1.0003 | 0.918 | 0.8452 | 0.8452 | | ghostnet_100 | 128 | 0.9983 | 0.8894 | 0.8416 | 0.8416 | | mobilenetv3_large_100 | 128 | 0.9993 | 0.8597 | 0.8413 | 0.8413 | | coat_lite_mini | 128 | 1.0445 | 0.929 | 0.8401 | 0.8401 | | convnext_base | 64 | 1.0052 | 0.9275 | 0.832 | 0.832 | | botnet26t_256 | 128 | 0.9994 | 0.8791 | 0.824 | 0.824 | | lcnet_050 | 128 | 0.9982 | 0.8057 | 0.8048 | 0.8048 | | repvgg_a2 | 128 | 0.9997 | 0.7933 | 0.7738 | 0.7738 | | regnety_002 | 128 | 0.9992 | 0.8629 | 0.76 | 0.76 | | crossvit_9_240 | 128 | 0.999 | 0.8819 | 0.7525 | 0.7526 | | swin_base_patch4_window7_224 | 64 | 1.001 | 0.9237 | 0.7214 | 0.7214 | | jx_nest_base | 32 | 1.0006 | 0.8943 | 0.6693 | 0.6693 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------+------------------------+ | convmixer_768_32 | 32 | 301.0462 | 312.2161 | 302.1458 | 301.8995 | | pnasnet5large | 16 | 198.8496 | 214.7347 | 218.3926 | 221.4154 | | hrnet_w18 | 128 | 281.0378 | 439.5517 | 210.4836 | 210.9151 | | tf_mixnet_l | 128 | 194.1237 | 229.2241 | 162.5183 | 162.555 | | mixnet_l | 128 | 185.6321 | 220.698 | 156.7757 | 156.7845 | | resnest101e | 64 | 165.7725 | 189.7388 | 120.6507 | 120.6243 | | dla102 | 128 | 172.5634 | 210.823 | 116.9649 | 116.9901 | | cait_m36_384 | 4 | 168.4862 | 181.6364 | 116.1641 | 115.9536 | | poolformer_m36 | 64 | 146.9774 | 147.4164 | 113.6654 | 113.617 | | swsl_resnext101_32x16d | 32 | 118.7176 | 141.2897 | 113.5715 | 113.5552 | | adv_inception_v3 | 128 | 159.508 | 185.1819 | 105.6749 | 105.1826 | | inception_v3 | 128 | 159.1964 | 183.9483 | 105.4387 | 105.2299 | | res2net50_14w_8s | 128 | 140.8834 | 178.1801 | 105.2017 | 105.3541 | | gluon_inception_v3 | 128 | 160.4461 | 184.7298 | 105.1827 | 105.2334 | | convit_base | 64 | 163.2989 | 163.4412 | 100.5729 | 100.5841 | | dpn107 | 32 | 114.1878 | 131.4156 | 98.7964 | 98.7699 | | tnt_s_patch16_224 | 128 | 323.9905 | 324.3737 | 98.0196 | 97.8647 | | res2next50 | 128 | 125.7301 | 152.3633 | 95.4602 | 95.5511 | | gluon_xception65 | 32 | 99.7088 | 117.1326 | 92.8894 | 93.1743 | | dm_nfnet_f0 | 128 | 128.8182 | 128.9241 | 89.5698 | 89.5179 | | fbnetv3_b | 128 | 115.195 | 142.6675 | 87.9638 | 88.1022 | | res2net101_26w_4s | 64 | 99.1765 | 125.6375 | 87.4515 | 87.4971 | | mixer_b16_224 | 128 | 116.789 | 114.4068 | 86.0718 | 86.0705 | | swin_base_patch4_window7_224 | 64 | 147.7465 | 153.5656 | 85.6142 | 85.4956 | | convnext_base | 64 | 124.5935 | 124.2454 | 81.9299 | 81.6359 | | gmlp_s16_224 | 128 | 137.7481 | 126.3972 | 79.6968 | 79.8003 | | nfnet_l0 | 128 | 112.5273 | 136.7954 | 77.4258 | 77.4401 | | cspdarknet53 | 64 | 94.8595 | 112.7862 | 77.0728 | 77.0337 | | visformer_small | 128 | 91.3297 | 96.3689 | 76.4668 | 76.4226 | | eca_botnext26ts_256 | 128 | 108.8211 | 147.228 | 76.2004 | 75.4198 | | pit_b_224 | 64 | 118.8002 | 119.1475 | 73.7353 | 73.683 | | botnet26t_256 | 128 | 101.8523 | 116.4763 | 71.6663 | 71.6238 | | repvgg_a2 | 128 | 77.6633 | 96.2935 | 71.6151 | 70.1499 | | gernet_l | 128 | 77.6879 | 91.8388 | 71.1741 | 71.2745 | | beit_base_patch16_224 | 64 | 101.6785 | 104.701 | 70.0992 | 70.0879 | | volo_d1_224 | 64 | 120.95 | 123.7946 | 69.8023 | 69.816 | | vit_base_patch16_224 | 64 | 87.0411 | 87.2011 | 64.934 | 64.922 | | jx_nest_base | 32 | 101.7517 | 101.6071 | 64.8841 | 64.9252 | | deit_base_distilled_patch16_224 | 64 | 85.2421 | 85.144 | 64.4973 | 64.4607 | | gmixer_24_224 | 128 | 117.897 | 132.1535 | 63.091 | 63.2013 | | tf_efficientnet_b0 | 128 | 84.7414 | 119.5686 | 63.0366 | 63.0261 | | rexnet_100 | 128 | 79.9465 | 108.5101 | 60.996 | 60.9486 | | xcit_large_24_p8_224 | 5 | 124.4075 | 141.7627 | 60.3619 | 60.1882 | | fbnetc_100 | 128 | 82.8684 | 106.4635 | 59.4945 | 59.4972 | | tinynet_a | 128 | 73.5495 | 102.8014 | 59.0413 | 60.2127 | | mobilevit_s | 64 | 84.9274 | 111.1643 | 56.2777 | 56.1736 | | twins_pcpvt_base | 64 | 131.6094 | 140.7932 | 56.0408 | 56.1467 | | coat_lite_mini | 128 | 113.1211 | 113.2933 | 54.7608 | 54.7565 | | sebotnet33ts_256 | 64 | 80.684 | 102.4806 | 52.3276 | 53.1544 | | spnasnet_100 | 128 | 70.3295 | 89.8098 | 51.0077 | 50.8003 | | ghostnet_100 | 128 | 90.9596 | 117.569 | 50.5172 | 50.5421 | | ese_vovnet19b_dw | 128 | 64.5898 | 74.2655 | 49.5358 | 49.6475 | | mobilenetv2_100 | 128 | 65.4813 | 84.4073 | 46.111 | 46.2088 | | crossvit_9_240 | 128 | 82.8826 | 104.4348 | 45.6622 | 45.9007 | | mnasnet_100 | 128 | 64.1628 | 82.2534 | 44.3909 | 44.4532 | | selecsls42b | 128 | 59.9773 | 73.8027 | 44.2711 | 44.2379 | | mobilenetv3_large_100 | 128 | 61.3151 | 76.591 | 43.1009 | 43.1787 | | resmlp_12_224 | 128 | 53.4639 | 59.7417 | 38.1176 | 38.0715 | | regnety_002 | 128 | 41.3086 | 57.0772 | 27.4955 | 27.4312 | | lcnet_050 | 128 | 31.6613 | 40.5824 | 18.7968 | 18.8276 | +---------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

Build Summary

### Run name ### day_079_20_03_23_performance_amp_778 ### Commit hashes ### pytorch commit: 9423b863f800c6d20b9b3de4422558cbb338fb83 pytorch commit date: 2023-03-23 00:32:51+00:00 torchbench commit: d618fa8e06c13bbe441cc929c5d3bf498d0f369c torchbench commit date: 2023-03-22 15:27:07-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+git9423b86 ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (inference, no max-autotune)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 94%, 59/63 | 100%, 46/46 | 100%, 60/60 |
|       aot_eager        | 90%, 57/63 | 100%, 46/46 | 100%, 60/60 |
|        inductor        | 84%, 53/63 | 100%, 46/46 | 98%, 59/60  |
| inductor_no_cudagraphs | 86%, 54/63 | 100%, 46/46 | 98%, 59/60  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|        inductor        |   1.49x    |    1.37x    |    1.33x    |
| inductor_no_cudagraphs |   1.39x    |    1.33x    |    1.32x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   16.76    |    3.72     |    2.65     |
|       aot_eager        |   26.52    |    6.18     |    5.35     |
|        inductor        |   16.49    |    21.29    |    21.32    |
| inductor_no_cudagraphs |   15.67    |    19.13    |    20.99    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.04x    |    1.01x    |    1.17x    |
|       aot_eager        |   1.01x    |    1.01x    |    1.18x    |
|        inductor        |   0.92x    |    1.16x    |    1.09x    |
| inductor_no_cudagraphs |   0.99x    |    1.25x    |    1.16x    |
+------------------------+------------+-------------+-------------+

Warnings

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | drq | 1 | 0.9944 | 0.9762 | 3.5568 | 1.5196 | | soft_actor_critic | 256 | 0.9433 | 0.9402 | 2.7425 | 1.3304 | | shufflenet_v2_x1_0 | 128 | 0.9897 | 1.0318 | 2.4279 | 2.3955 | | lennard_jones | 1000 | 0.8317 | 0.8248 | 2.1716 | 0.8541 | | hf_Albert | 16 | 0.9984 | 0.9968 | 2.0078 | 1.9641 | | dlrm | 1 | 0.9848 | 1.0698 | 1.9589 | 1.1354 | | phlippe_densenet | 128 | 0.9604 | 0.7585 | 1.9379 | 1.6092 | | hf_Reformer | 8 | 0.983 | 0.9829 | 1.8413 | 2.03 | | hf_BigBird | 4 | 0.9836 | 0.9587 | 1.7951 | 1.555 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9795 | 1.0056 | 1.7835 | 1.6679 | | timm_nfnet | 128 | 0.9628 | 0.9636 | 1.7602 | 1.7364 | | hf_T5_large | 1 | 0.7306 | 0.6522 | 1.7409 | 1.1671 | | hf_GPT2 | 16 | 0.9871 | 0.9864 | 1.738 | 1.7322 | | timm_resnest | 256 | 0.9966 | 0.9959 | 1.7028 | 1.6947 | | hf_GPT2_large | 1 | 0.9657 | 0.9247 | 1.6772 | 1.6616 | | squeezenet1_1 | 256 | 0.9941 | 0.9902 | 1.6761 | 1.6757 | | densenet121 | 64 | 0.9829 | 0.9633 | 1.6597 | 1.6069 | | speech_transformer | 1 | 0.9528 | 0.8302 | 1.6203 | 1.5788 | | hf_T5 | 4 | 0.9576 | 0.941 | 1.5934 | 1.695 | | resnet50 | 64 | 0.9901 | 0.979 | 1.5348 | 1.5185 | | mobilenet_v2 | 128 | 0.9894 | 0.9765 | 1.5342 | 1.5344 | | Background_Matting | 1 | 0.9959 | 0.6317 | 1.5107 | 1.4961 | | resnext50_32x4d | 64 | 0.9929 | 0.9841 | 1.5065 | 1.4864 | | mnasnet1_0 | 128 | 0.9847 | 0.9722 | 1.496 | 1.4936 | | resnet152 | 64 | 0.9893 | 0.9748 | 1.4779 | 1.4556 | | hf_T5_base | 1 | 0.9066 | 0.876 | 1.445 | 1.3874 | | pytorch_unet | 4 | 0.9979 | 0.6971 | 1.4357 | 1.4338 | | mobilenet_v3_large | 128 | 0.9793 | 0.9678 | 1.4198 | 1.4143 | | fastNLP_Bert | 16 | 0.9824 | 0.9785 | 1.3965 | 1.392 | | hf_Bert_large | 4 | 1.0037 | 0.8763 | 1.3776 | 1.3774 | | doctr_det_predictor | 4 | 0.9943 | 0.748 | 1.3662 | 1.3625 | | Super_SloMo | 8 | 0.9978 | 0.7862 | 1.3618 | 1.3627 | | hf_DistilBert | 16 | 0.9743 | 0.9703 | 1.3531 | 1.3411 | | resnet18 | 256 | 0.9945 | 0.9918 | 1.3134 | 1.3143 | | hf_Longformer | 4 | 0.9983 | 0.4373 | 1.3074 | 1.3465 | | timm_regnet | 32 | 0.9156 | 0.9025 | 1.3059 | 1.2223 | | timm_efficientnet | 128 | 0.9487 | 0.9415 | 1.3038 | 1.2898 | | vgg16 | 8 | 0.9908 | 0.9809 | 1.3015 | 1.2668 | | LearningToPaint | 256 | 0.9908 | 1.0023 | 1.2919 | 1.3194 | | hf_Bart | 8 | 0.9257 | 0.8594 | 1.2873 | 1.2325 | | BERT_pytorch | 32 | 0.9499 | 0.9402 | 1.2855 | 1.2555 | | yolov3 | 8 | 0.9829 | 0.9118 | 1.2847 | 1.2705 | | hf_Bert | 8 | 0.9095 | 0.9045 | 1.2752 | 1.2506 | | phlippe_resnet | 256 | 0.9632 | 0.7686 | 1.27 | 1.2245 | | timm_vovnet | 128 | 0.9398 | 0.9371 | 1.2699 | 1.2619 | | alexnet | 1024 | 0.9986 | 0.9983 | 1.2516 | 1.2808 | | functorch_dp_cifar10 | 512 | 0.9744 | 0.9747 | 1.2247 | 1.1604 | | pytorch_stargan | 16 | 0.9891 | 0.8871 | 1.2033 | 1.2068 | | doctr_reco_predictor | 64 | 0.9933 | 0.9797 | 1.2022 | 1.1924 | | timm_vision_transformer | 128 | 0.9862 | 0.9859 | 1.1951 | 1.1848 | | dcgan | 1024 | 0.994 | 1.034 | 1.1875 | 1.1965 | | demucs | 32 | 0.9994 | 0.9993 | 1.1593 | 1.2062 | | attention_is_all_you_need_pytorch | 256 | 0.969 | 0.9596 | 1.1067 | 1.0889 | | timm_vision_transformer_large | 32 | 0.994 | 0.9935 | 1.0854 | 1.0777 | | vision_maskrcnn | 4 | 0.8912 | 0.852 | 1.0774 | 1.084 | | tts_angular | 512 | 0.9891 | 0.9884 | 0.989 | 0.9905 | | nvidia_deeprecommender | 512 | 0.9943 | 0.9936 | 0.8654 | 0.9916 | | tacotron2 | 128 | 0.992 | 0.9919 | 0.0 | 0.0 | | llama | 1024 | 0.9819 | 0.6014 | 0.0 | 0.0 | | moco | 64 | 0.988 | 0.0 | 0.0 | 0.0 | | detectron2_fcos_r_50_fpn | 4 | 0.802 | 0.0 | 0.0 | 0.0 | | DALLE2_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | soft_actor_critic | 256 | pass | pass | pass | pass | | nvidia_deeprecommender | 4 | pass | pass | pass | pass | | phlippe_resnet | 4 | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | | resnet152 | 4 | pass | pass | pass | pass | | resnet18 | 4 | pass | pass | pass | pass | | resnet50 | 4 | pass | pass | pass | pass | | resnext50_32x4d | 4 | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 4 | pass | pass | pass | pass | | speech_transformer | 4 | pass | pass | pass | pass | | mobilenet_v2 | 4 | pass | pass | pass | pass | | squeezenet1_1 | 4 | pass | pass | pass | pass | | timm_efficientnet | 4 | pass | pass | pass | pass | | timm_nfnet | 4 | pass | pass | pass | pass | | timm_regnet | 4 | pass | pass | pass | pass | | timm_resnest | 4 | pass | pass | pass | pass | | timm_vision_transformer | 4 | pass | pass | pass | pass | | timm_vovnet | 4 | pass | pass | pass | pass | | tts_angular | 4 | pass | pass | pass | pass | | vgg16 | 4 | pass | pass | pass | pass | | yolov3 | 4 | pass | pass | pass | pass | | mobilenet_v3_large | 4 | pass | pass | pass | pass | | phlippe_densenet | 4 | pass | pass | pass | pass | | mnasnet1_0 | 4 | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | | BERT_pytorch | 4 | pass | pass | pass | pass | | Background_Matting | 1 | pass | pass | pass | pass | | LearningToPaint | 4 | pass | pass | pass | pass | | Super_SloMo | 4 | pass | pass | pass | pass | | alexnet | 4 | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 4 | pass | pass | pass | pass | | dcgan | 4 | pass | pass | pass | pass | | lennard_jones | 4 | pass | pass | pass | pass | | dlrm | 4 | pass | pass | pass | pass | | doctr_det_predictor | 4 | pass | pass | pass | pass | | doctr_reco_predictor | 4 | pass | pass | pass | pass | | densenet121 | 4 | pass | pass | pass | pass | | fastNLP_Bert | 4 | pass | pass | pass | pass | | hf_DistilBert | 4 | pass | pass | pass | pass | | functorch_dp_cifar10 | 4 | pass | pass | pass | pass | | hf_Reformer | 4 | pass | pass | pass | pass | | hf_Longformer | 4 | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | | hf_T5 | 4 | pass | pass | pass | pass | | hf_Bert_large | 4 | pass | pass | pass | pass | | hf_Bert | 4 | pass | pass | pass | pass | | hf_Bart | 4 | pass | pass | pass | pass | | hf_Albert | 4 | pass | pass | pass | pass | | llama | 4 | pass | pass | fail_to_run | fail_to_run | | detectron2_fcos_r_50_fpn | 4 | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 4 | pass | fail_to_run | fail_to_run | fail_to_run | | DALLE2_pytorch | 4 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 4 | pass | pass | 0.0000 | pass | | tacotron2 | 4 | pass | pass | 0.0000 | 0.0000 | | vision_maskrcnn | 4 | pass | pass | 0.0000 | 0.0000 | | demucs | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | torchrec_dlrm | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------+------------------------+ | vision_maskrcnn | 4 | 12.6624 | 24.6886 | 143.7892 | 112.1183 | | hf_Longformer | 4 | 7.4484 | 18.8382 | 74.5018 | 60.8011 | | hf_T5_large | 1 | 14.9758 | 23.4288 | 67.1839 | 43.3799 | | hf_BigBird | 4 | 9.7464 | 19.8755 | 60.6145 | 49.509 | | hf_T5_base | 1 | 7.7194 | 12.5269 | 38.6659 | 26.6758 | | hf_Reformer | 8 | 2.8316 | 4.4848 | 32.6493 | 28.2142 | | hf_GPT2_large | 1 | 8.4584 | 14.2657 | 28.1275 | 27.3305 | | hf_T5 | 4 | 4.4212 | 7.0701 | 27.1365 | 20.0244 | | speech_transformer | 1 | 2.1767 | 4.7553 | 25.9839 | 25.7781 | | densenet121 | 64 | 2.7661 | 6.5193 | 25.7481 | 24.4043 | | timm_vision_transformer_large | 32 | 4.1539 | 9.1503 | 25.2112 | 24.7234 | | hf_Bart | 8 | 3.8656 | 6.5933 | 24.9794 | 22.3447 | | yolov3 | 8 | 2.0772 | 4.6528 | 23.9742 | 23.3447 | | timm_nfnet | 128 | 3.2002 | 5.4945 | 22.976 | 22.9633 | | attention_is_all_you_need_pytorch | 256 | 1.7707 | 3.7662 | 22.3304 | 21.2 | | resnet152 | 64 | 2.9275 | 7.644 | 20.461 | 19.791 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.5017 | 1.1717 | 19.304 | 18.6227 | | hf_Bert_large | 4 | 5.166 | 9.2175 | 17.8246 | 17.7783 | | Super_SloMo | 8 | 1.5558 | 4.06 | 17.2359 | 17.1916 | | functorch_dp_cifar10 | 512 | 0.3612 | 0.6794 | 16.7978 | 16.0351 | | pytorch_stargan | 16 | 0.5123 | 1.4316 | 16.5372 | 16.2262 | | timm_regnet | 32 | 3.2648 | 5.3385 | 16.3202 | 15.7472 | | fastNLP_Bert | 16 | 2.0923 | 3.9621 | 16.2619 | 14.0235 | | hf_Albert | 16 | 1.9754 | 3.9679 | 15.9788 | 14.7029 | | timm_efficientnet | 128 | 2.0938 | 3.9804 | 15.7054 | 15.4181 | | hf_GPT2 | 16 | 2.4689 | 4.3735 | 15.2184 | 15.058 | | mobilenet_v3_large | 128 | 1.1834 | 2.8617 | 14.8561 | 15.1658 | | phlippe_densenet | 128 | 1.1248 | 2.7419 | 14.5965 | 14.4419 | | doctr_reco_predictor | 64 | 0.5259 | 1.0603 | 13.5329 | 12.2191 | | timm_vision_transformer | 128 | 1.3039 | 2.9015 | 13.4639 | 13.8721 | | BERT_pytorch | 32 | 2.0095 | 3.7978 | 13.1224 | 13.3125 | | shufflenet_v2_x1_0 | 128 | 1.1867 | 2.9829 | 13.0808 | 12.9476 | | doctr_det_predictor | 4 | 1.5603 | 4.1548 | 12.8966 | 12.8151 | | demucs | 32 | 0.3578 | 0.567 | 12.8307 | 10.9186 | | timm_resnest | 256 | 0.7442 | 1.5706 | 12.6194 | 12.5018 | | mobilenet_v2 | 128 | 1.0405 | 2.6644 | 12.2616 | 10.9306 | | hf_DistilBert | 16 | 1.0569 | 2.1336 | 12.0845 | 11.7387 | | resnext50_32x4d | 64 | 1.074 | 2.6712 | 11.9345 | 10.7818 | | resnet50 | 64 | 1.0672 | 2.6556 | 11.8492 | 10.8521 | | mnasnet1_0 | 128 | 0.9902 | 2.5045 | 11.743 | 11.3677 | | timm_vovnet | 128 | 1.872 | 2.9855 | 11.6538 | 11.3552 | | hf_Bert | 8 | 2.4012 | 4.4845 | 11.4191 | 10.95 | | Background_Matting | 1 | 1.0652 | 2.933 | 10.1229 | 10.3911 | | phlippe_resnet | 256 | 0.4847 | 1.1107 | 9.0674 | 8.7675 | | resnet18 | 256 | 0.5115 | 1.128 | 9.0495 | 7.9961 | | LearningToPaint | 256 | 0.5141 | 1.1709 | 8.166 | 6.7555 | | pytorch_unet | 4 | 0.6357 | 1.6981 | 7.7293 | 7.6636 | | squeezenet1_1 | 256 | 0.3256 | 0.536 | 6.4693 | 5.4472 | | drq | 1 | 0.3346 | 0.4368 | 5.7531 | 5.4296 | | alexnet | 1024 | 0.1951 | 0.3048 | 5.7208 | 4.5737 | | soft_actor_critic | 256 | 0.2428 | 0.3001 | 5.1839 | 3.9874 | | vgg16 | 8 | 0.2344 | 0.401 | 5.092 | 4.9237 | | dcgan | 1024 | 0.1918 | 0.3153 | 4.7729 | 4.4336 | | dlrm | 1 | 0.2857 | 0.4284 | 4.5918 | 4.1552 | | nvidia_deeprecommender | 512 | 0.2301 | 0.3088 | 4.5063 | 4.2944 | | tts_angular | 512 | 0.1595 | 0.189 | 4.0973 | 3.9736 | | lennard_jones | 1000 | 0.1663 | 0.2289 | 4.0944 | 3.7544 | | tacotron2 | 128 | 827.0071 | 1255.5524 | nan | nan | | llama | 1024 | 1.3762 | 2.8133 | nan | nan | | moco | 64 | 24.2323 | nan | nan | nan | | detectron2_fcos_r_50_fpn | 4 | 8.2336 | nan | nan | nan | | DALLE2_pytorch | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+----------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | pytorch_unet | 4 | 1.5928 | 1.3424 | 1.5789 | 1.5928 | | timm_vovnet | 128 | 1.2915 | 1.2915 | 1.4547 | 1.5115 | | mobilenet_v2 | 128 | 1.0717 | 1.0717 | 1.4418 | 1.5814 | | squeezenet1_1 | 256 | 1.0 | 1.0 | 1.3548 | 1.5039 | | mobilenet_v3_large | 128 | 1.0989 | 1.0989 | 1.2775 | 1.1405 | | yolov3 | 8 | 1.2988 | 1.2422 | 1.276 | 1.2988 | | timm_nfnet | 128 | 1.1078 | 1.5535 | 1.2336 | 1.2775 | | attention_is_all_you_need_pytorch | 256 | 1.0311 | 1.0311 | 1.2219 | 1.2333 | | Background_Matting | 1 | 1.311 | 0.6879 | 1.2134 | 1.2299 | | timm_efficientnet | 128 | 1.2128 | 1.2128 | 1.1388 | 1.2128 | | doctr_det_predictor | 4 | 1.0601 | 0.7397 | 1.0501 | 1.0501 | | hf_Bart | 8 | 0.909 | 0.9092 | 1.0336 | 1.1927 | | hf_Albert | 16 | 1.0229 | 1.0229 | 1.0072 | 1.0229 | | hf_DistilBert | 16 | 1.0156 | 1.0156 | 1.0049 | 1.0156 | | dlrm | 1 | 1.0 | 1.0 | 0.9979 | 1.0 | | nvidia_deeprecommender | 512 | 1.001 | 1.001 | 0.9977 | 1.142 | | hf_GPT2_large | 1 | 1.0 | 1.0 | 0.9973 | 0.9996 | | hf_GPT2 | 16 | 1.0 | 1.0 | 0.9972 | 1.0 | | hf_Bert | 8 | 1.0087 | 1.0087 | 0.9969 | 1.0087 | | hf_Bert_large | 4 | 1.0033 | 1.0033 | 0.9966 | 1.0033 | | resnext50_32x4d | 64 | 1.0093 | 1.0477 | 0.9963 | 1.0093 | | vgg16 | 8 | 1.0 | 1.0 | 0.9827 | 1.0 | | timm_regnet | 32 | 1.0 | 1.0 | 0.9753 | 1.0 | | functorch_dp_cifar10 | 512 | 1.0 | 1.0 | 0.9681 | 1.0 | | timm_vision_transformer_large | 32 | 1.0155 | 1.0155 | 0.9629 | 0.9667 | | resnet152 | 64 | 1.0389 | 1.0388 | 0.9594 | 1.0 | | tts_angular | 512 | 1.001 | 1.001 | 0.9583 | 1.001 | | fastNLP_Bert | 16 | 1.0617 | 1.0616 | 0.9518 | 0.9574 | | resnet50 | 64 | 1.0 | 1.0 | 0.9488 | 1.0494 | | mnasnet1_0 | 128 | 1.124 | 1.0471 | 0.947 | 1.0471 | | phlippe_densenet | 128 | 1.0594 | 1.0594 | 0.9418 | 0.9727 | | dcgan | 1024 | 1.0 | 1.0 | 0.9404 | 1.0 | | resnet18 | 256 | 1.0 | 1.0 | 0.9296 | 1.0 | | timm_resnest | 256 | 1.0 | 1.0 | 0.9074 | 0.9474 | | alexnet | 1024 | 1.0 | 1.0 | 0.8928 | 1.0662 | | LearningToPaint | 256 | 1.0 | 1.0 | 0.8634 | 1.0 | | doctr_reco_predictor | 64 | 1.0 | 1.0 | 0.8433 | 0.852 | | vision_maskrcnn | 4 | 1.3759 | 1.3753 | 0.8204 | 1.3758 | | BERT_pytorch | 32 | 1.0264 | 1.0264 | 0.8033 | 0.809 | | drq | 1 | 0.9613 | 0.9613 | 0.7762 | 0.9613 | | hf_T5_large | 1 | 0.9541 | 0.9528 | 0.7346 | 0.9584 | | shufflenet_v2_x1_0 | 128 | 1.0511 | 0.9994 | 0.7328 | 0.8127 | | pytorch_stargan | 16 | 1.0494 | 1.0492 | 0.7292 | 0.7292 | | demucs | 32 | 0.8934 | 0.8934 | 0.7055 | 0.8934 | | soft_actor_critic | 256 | 1.0 | 1.0 | 0.7024 | 1.0 | | timm_vision_transformer | 128 | 1.1031 | 1.1031 | 0.6935 | 0.7525 | | speech_transformer | 1 | 1.0551 | 1.0551 | 0.6699 | 0.6723 | | Super_SloMo | 8 | 1.0852 | 0.7407 | 0.6691 | 0.6691 | | densenet121 | 64 | 1.196 | 1.1984 | 0.6671 | 0.6076 | | lennard_jones | 1000 | 1.0 | 1.0 | 0.5327 | 1.0 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.9982 | 0.3981 | 0.3989 | | hf_T5 | 4 | 0.6729 | 0.7461 | 0.359 | 0.8834 | | phlippe_resnet | 256 | 1.4892 | 1.4892 | 0.3583 | 0.3611 | | hf_BigBird | 4 | 0.8571 | 0.8571 | 0.3278 | 0.8572 | | hf_T5_base | 1 | 0.7645 | 0.8122 | 0.3239 | 0.8896 | | hf_Reformer | 8 | 0.7321 | 0.7393 | 0.2509 | 0.7401 | | hf_Longformer | 4 | 0.3904 | 0.3904 | 0.2374 | 0.3905 | | tacotron2 | 128 | 0.9209 | 0.9209 | nan | nan | | llama | 1024 | 1.0 | 0.6756 | nan | nan | | moco | 64 | 1.0354 | nan | nan | nan | | detectron2_fcos_r_50_fpn | 4 | 0.8694 | nan | nan | nan | | DALLE2_pytorch | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------+------------------------+ | vision_maskrcnn | 4 | 193.6408 | 200.8595 | 156.7076 | 154.8324 | | timm_vision_transformer_large | 32 | 142.0358 | 142.1905 | 130.5995 | 131.2456 | | hf_Longformer | 4 | 141.3068 | 322.8847 | 107.9811 | 104.8337 | | demucs | 32 | 94.0761 | 94.1388 | 81.1125 | 77.9673 | | hf_T5_base | 1 | 110.9219 | 115.4631 | 70.6839 | 73.2236 | | hf_BigBird | 4 | 123.8805 | 127.3245 | 68.1053 | 78.4072 | | hf_T5 | 4 | 103.3766 | 105.4867 | 62.5763 | 58.6273 | | hf_GPT2 | 16 | 94.483 | 94.5901 | 53.683 | 53.8919 | | pytorch_unet | 4 | 68.2269 | 97.6348 | 47.4514 | 47.4754 | | Super_SloMo | 8 | 63.9705 | 81.1255 | 46.8564 | 46.875 | | hf_T5_large | 1 | 103.1815 | 112.413 | 43.8136 | 65.332 | | fastNLP_Bert | 16 | 53.979 | 54.4642 | 38.1553 | 38.1644 | | doctr_det_predictor | 4 | 48.9311 | 64.2433 | 35.9617 | 36.6363 | | timm_nfnet | 128 | 43.5001 | 43.458 | 23.6251 | 24.0825 | | timm_resnest | 256 | 39.6985 | 39.7262 | 23.2228 | 23.3216 | | hf_GPT2_large | 1 | 37.3139 | 39.7391 | 22.0567 | 22.4471 | | resnet152 | 64 | 31.8951 | 32.3969 | 21.3476 | 21.6854 | | attention_is_all_you_need_pytorch | 256 | 24.2616 | 24.4727 | 21.2205 | 21.5668 | | timm_vovnet | 128 | 24.9054 | 24.947 | 18.4175 | 18.5532 | | timm_vision_transformer | 128 | 21.4662 | 21.4878 | 17.7461 | 17.9044 | | alexnet | 1024 | 22.1587 | 22.1406 | 17.6596 | 17.2577 | | hf_Bart | 8 | 22.396 | 24.1673 | 16.3471 | 18.0951 | | Background_Matting | 1 | 22.5735 | 35.5875 | 14.897 | 15.0132 | | hf_Bert_large | 4 | 20.3634 | 23.1206 | 14.8437 | 14.9634 | | hf_Reformer | 8 | 27.3388 | 27.3432 | 14.587 | 13.2207 | | hf_Albert | 16 | 28.8533 | 28.8725 | 14.3412 | 14.6687 | | timm_regnet | 32 | 19.5412 | 19.9795 | 13.6135 | 14.5742 | | resnet18 | 256 | 16.4811 | 16.5351 | 12.4661 | 12.4688 | | timm_efficientnet | 128 | 17.0848 | 17.2579 | 12.4632 | 12.5746 | | densenet121 | 64 | 20.3402 | 20.5007 | 11.9025 | 12.2509 | | resnext50_32x4d | 64 | 17.9332 | 18.0899 | 11.8072 | 11.973 | | yolov3 | 8 | 15.035 | 16.2467 | 11.5051 | 11.6233 | | BERT_pytorch | 32 | 15.348 | 15.5259 | 11.2865 | 11.5267 | | speech_transformer | 1 | 19.0673 | 21.7882 | 11.0612 | 13.7705 | | hf_DistilBert | 16 | 13.965 | 14.0145 | 10.0689 | 10.1606 | | hf_Bert | 8 | 13.866 | 13.9758 | 9.8862 | 10.1199 | | tts_angular | 512 | 9.1869 | 9.1719 | 9.2492 | 9.231 | | resnet50 | 64 | 14.0973 | 14.2829 | 9.0935 | 9.1947 | | squeezenet1_1 | 256 | 14.1416 | 14.207 | 8.3985 | 8.3959 | | mobilenet_v2 | 128 | 12.0409 | 12.2002 | 7.7608 | 7.7607 | | mnasnet1_0 | 128 | 11.5663 | 11.7183 | 7.6011 | 7.6078 | | LearningToPaint | 256 | 9.2729 | 9.1613 | 7.1069 | 6.9662 | | mobilenet_v3_large | 128 | 10.2227 | 10.3381 | 7.0488 | 7.1027 | | doctr_reco_predictor | 64 | 6.9806 | 7.069 | 5.8021 | 5.838 | | pytorch_stargan | 16 | 7.0147 | 7.8547 | 5.7597 | 5.7583 | | nvidia_deeprecommender | 512 | 3.8833 | 3.8903 | 4.4712 | 3.896 | | dcgan | 1024 | 4.6848 | 4.5082 | 3.9245 | 3.8996 | | shufflenet_v2_x1_0 | 128 | 8.2596 | 8.5497 | 3.6303 | 3.6414 | | phlippe_densenet | 128 | 6.5669 | 8.271 | 3.2364 | 3.8701 | | functorch_dp_cifar10 | 512 | 3.5803 | 3.567 | 2.8423 | 2.99 | | vgg16 | 8 | 3.3561 | 3.3948 | 2.558 | 2.6189 | | pytorch_CycleGAN_and_pix2pix | 1 | 4.5094 | 4.4086 | 2.4604 | 2.6617 | | phlippe_resnet | 256 | 2.1931 | 2.7663 | 1.6765 | 1.7453 | | dlrm | 1 | 0.8356 | 0.7833 | 0.428 | 0.9191 | | drq | 1 | 0.7457 | 0.7562 | 0.2118 | 0.5984 | | soft_actor_critic | 256 | 0.3534 | 0.3602 | 0.1372 | 0.2567 | | lennard_jones | 1000 | 0.2871 | 0.2963 | 0.117 | 0.2923 | | tacotron2 | 128 | 780.5868 | 770.4857 | nan | nan | | llama | 1024 | 10.3325 | 16.8184 | nan | nan | | detectron2_fcos_r_50_fpn | 4 | 82.3764 | nan | nan | nan | | moco | 64 | 47.5212 | nan | nan | nan | | DALLE2_pytorch | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+----------+-----------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | MT5ForConditionalGeneration | 16 | 0.9882 | 0.8915 | 2.7012 | 1.9115 | | XLNetLMHeadModel | 8 | 0.9935 | 0.9935 | 2.2589 | 2.2467 | | XGLMForCausalLM | 8 | 0.8538 | 0.7931 | 1.886 | 1.2059 | | GPT2ForSequenceClassification | 4 | 0.9619 | 0.962 | 1.7996 | 1.7778 | | OPTForCausalLM | 2 | 0.9987 | 1.0155 | 1.7991 | 1.8335 | | MobileBertForMaskedLM | 64 | 0.8518 | 0.7485 | 1.7358 | 1.2738 | | T5Small | 4 | 0.9573 | 0.9497 | 1.7247 | 1.6953 | | T5ForConditionalGeneration | 4 | 0.9577 | 0.9484 | 1.7221 | 1.6901 | | GoogleFnet | 16 | 0.9977 | 0.9978 | 1.6889 | 1.9025 | | DistillGPT2 | 16 | 0.9847 | 0.984 | 1.6874 | 1.6799 | | ElectraForCausalLM | 32 | 0.9633 | 0.9618 | 1.5904 | 1.5705 | | PLBartForCausalLM | 8 | 0.9932 | 0.9921 | 1.5248 | 1.5877 | | ElectraForQuestionAnswering | 64 | 0.968 | 0.9674 | 1.4399 | 1.4238 | | M2M100ForConditionalGeneration | 16 | 0.9398 | 0.8831 | 1.4347 | 1.3186 | | Speech2Text2ForCausalLM | 256 | 0.9815 | 0.9867 | 1.402 | 1.4371 | | BartForCausalLM | 4 | 0.9967 | 0.9931 | 1.3549 | 1.3676 | | YituTechConvBert | 16 | 0.9741 | 0.9846 | 1.3498 | 1.3435 | | RobertaForCausalLM | 16 | 0.9712 | 0.9719 | 1.3493 | 1.3389 | | MBartForCausalLM | 4 | 0.9953 | 0.9922 | 1.3382 | 1.3483 | | LayoutLMForSequenceClassification | 16 | 0.9661 | 0.965 | 1.3362 | 1.3247 | | LayoutLMForMaskedLM | 16 | 0.9708 | 0.9707 | 1.3228 | 1.314 | | BlenderbotSmallForCausalLM | 64 | 0.99 | 0.9867 | 1.3132 | 1.3359 | | RobertaForQuestionAnswering | 16 | 0.9654 | 0.9643 | 1.3122 | 1.3024 | | BertForQuestionAnswering | 16 | 0.9648 | 0.9643 | 1.3091 | 1.3005 | | AlbertForMaskedLM | 4 | 0.9958 | 0.9978 | 1.3078 | 1.3081 | | CamemBert | 16 | 0.9719 | 0.9711 | 1.3065 | 1.2949 | | AlbertForQuestionAnswering | 4 | 0.9954 | 0.9959 | 1.3051 | 1.3023 | | BertForMaskedLM | 16 | 0.9703 | 0.9712 | 1.2971 | 1.2926 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9684 | 0.9764 | 1.2914 | 1.1867 | | PLBartForConditionalGeneration | 4 | 0.9841 | 0.9782 | 1.2877 | 1.285 | | DebertaForMaskedLM | 4 | 0.6941 | 0.6132 | 1.2769 | 1.0259 | | DistilBertForMaskedLM | 128 | 0.9903 | 0.99 | 1.2263 | 1.2214 | | MegatronBertForQuestionAnswering | 8 | 0.9545 | 0.9544 | 1.1994 | 1.1898 | | TrOCRForCausalLM | 32 | 0.9971 | 0.9953 | 1.1984 | 1.2153 | | DistilBertForQuestionAnswering | 256 | 0.9913 | 0.9919 | 1.1975 | 1.1942 | | MegatronBertForCausalLM | 4 | 0.9272 | 0.9274 | 1.1712 | 1.155 | | MobileBertForQuestionAnswering | 128 | 0.8314 | 0.7729 | 1.1533 | 1.1302 | | PegasusForCausalLM | 32 | 0.9908 | 0.9885 | 1.1487 | 1.1549 | | DebertaV2ForMaskedLM | 1 | 0.5762 | 0.512 | 1.1412 | 0.7069 | | BartForConditionalGeneration | 2 | 0.9774 | 0.9697 | 1.1389 | 1.1217 | | AllenaiLongformerBase | 4 | 0.8311 | 0.3603 | 1.1281 | 1.1602 | | MBartForConditionalGeneration | 2 | 0.9621 | 0.9693 | 1.1183 | 1.1078 | | PegasusForConditionalGeneration | 32 | 0.9729 | 0.9797 | 1.0869 | 1.0789 | | BlenderbotForCausalLM | 4 | 0.8219 | 0.7984 | 1.0512 | 1.044 | | DebertaForQuestionAnswering | 8 | 0.9692 | 0.8471 | 1.047 | 1.3318 | | DebertaV2ForQuestionAnswering | 2 | 0.586 | 0.5991 | 0.921 | 0.8083 | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | | GoogleFnet | 1 | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | AllenaiLongformerBase | 4 | 7.3464 | 17.8156 | 74.4527 | 62.3997 | | DebertaV2ForQuestionAnswering | 2 | 7.7715 | 10.8712 | 39.7915 | 26.2237 | | DebertaV2ForMaskedLM | 1 | 7.566 | 11.012 | 39.4228 | 25.9096 | | MobileBertForQuestionAnswering | 128 | 15.4822 | 21.443 | 37.5372 | 37.4897 | | MobileBertForMaskedLM | 64 | 15.4361 | 21.5753 | 36.2554 | 35.6657 | | M2M100ForConditionalGeneration | 16 | 4.4089 | 8.4552 | 29.306 | 26.7293 | | XLNetLMHeadModel | 8 | 5.0533 | 10.4946 | 28.1601 | 28.809 | | MT5ForConditionalGeneration | 16 | 5.0652 | 7.6617 | 27.8499 | 27.7992 | | DebertaForMaskedLM | 4 | 4.0626 | 6.2606 | 26.9032 | 19.8899 | | DebertaForQuestionAnswering | 8 | 4.1454 | 6.1749 | 26.4084 | 19.572 | | PegasusForConditionalGeneration | 32 | 4.2528 | 8.2133 | 26.3983 | 25.8311 | | XGLMForCausalLM | 8 | 3.5673 | 6.8418 | 26.1937 | 22.839 | | BlenderbotForCausalLM | 4 | 3.5992 | 6.631 | 23.6705 | 20.5129 | | YituTechConvBert | 16 | 3.3766 | 5.7813 | 22.9145 | 22.2111 | | BartForConditionalGeneration | 2 | 4.4986 | 8.5543 | 22.3036 | 21.9549 | | ElectraForCausalLM | 32 | 2.4641 | 4.0428 | 22.2271 | 19.2678 | | PLBartForConditionalGeneration | 4 | 3.6159 | 5.6838 | 21.7781 | 20.3014 | | MBartForConditionalGeneration | 2 | 4.4836 | 8.6447 | 21.5754 | 21.3252 | | MBartForCausalLM | 4 | 2.0317 | 3.6361 | 20.1181 | 16.1675 | | TrOCRForCausalLM | 32 | 2.2027 | 3.6454 | 19.0339 | 17.7479 | | BlenderbotSmallForCausalLM | 64 | 1.7049 | 2.7038 | 18.3077 | 15.3715 | | T5ForConditionalGeneration | 4 | 3.3898 | 5.3293 | 18.2346 | 18.1498 | | T5Small | 4 | 3.3883 | 5.3677 | 18.1819 | 17.9587 | | GoogleFnet | 16 | 1.5578 | 2.3435 | 18.161 | 13.1706 | | BartForCausalLM | 4 | 2.1693 | 3.6504 | 17.8162 | 16.0648 | | BlenderbotSmallForConditionalGeneration | 64 | 3.0134 | 5.6103 | 17.7408 | 17.1638 | | PegasusForCausalLM | 32 | 1.9809 | 3.5391 | 17.4076 | 15.9636 | | MegatronBertForCausalLM | 4 | 4.8882 | 7.8416 | 17.1618 | 17.0175 | | OPTForCausalLM | 2 | 2.159 | 3.6535 | 17.0997 | 14.9461 | | MegatronBertForQuestionAnswering | 8 | 4.841 | 7.7894 | 17.0335 | 17.1013 | | PLBartForCausalLM | 8 | 1.3005 | 2.1753 | 15.9801 | 14.8611 | | Speech2Text2ForCausalLM | 256 | 1.1794 | 2.036 | 15.9672 | 14.5855 | | GPT2ForSequenceClassification | 4 | 2.5227 | 4.0595 | 15.1507 | 14.5196 | | LayoutLMForMaskedLM | 16 | 2.6751 | 4.258 | 15.1189 | 13.5883 | | LayoutLMForSequenceClassification | 16 | 2.6604 | 4.2231 | 15.1042 | 13.4161 | | DistilBertForQuestionAnswering | 256 | 0.9645 | 1.852 | 14.313 | 14.3004 | | AlbertForMaskedLM | 4 | 1.823 | 3.4349 | 14.204 | 13.4766 | | ElectraForQuestionAnswering | 64 | 2.4232 | 3.9472 | 13.947 | 12.373 | | RobertaForCausalLM | 16 | 2.4405 | 3.935 | 12.6494 | 12.4924 | | AlbertForQuestionAnswering | 4 | 1.818 | 3.4268 | 12.5125 | 12.191 | | DistillGPT2 | 16 | 1.257 | 2.2129 | 11.9388 | 11.7975 | | BertForQuestionAnswering | 16 | 2.4166 | 3.9506 | 11.4339 | 10.1714 | | RobertaForQuestionAnswering | 16 | 2.435 | 3.9326 | 10.593 | 10.3068 | | BertForMaskedLM | 16 | 2.4361 | 3.9584 | 10.3961 | 10.1533 | | DistilBertForMaskedLM | 128 | 0.9903 | 1.8112 | 10.2963 | 10.0923 | | CamemBert | 16 | 2.4651 | 3.9064 | 10.2845 | 10.0996 | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | ElectraForCausalLM | 32 | 0.9946 | 0.9946 | 2.4838 | 2.554 | | DistillGPT2 | 16 | 1.0041 | 1.0041 | 2.0018 | 2.0075 | | RobertaForCausalLM | 16 | 1.0065 | 1.0065 | 1.8161 | 1.8234 | | DistilBertForMaskedLM | 128 | 1.0111 | 1.0111 | 1.7631 | 1.7691 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0042 | 1.0042 | 1.6885 | 1.6947 | | CamemBert | 16 | 1.0084 | 1.0084 | 1.5248 | 1.5315 | | BertForMaskedLM | 16 | 1.0087 | 1.0087 | 1.5151 | 1.522 | | LayoutLMForMaskedLM | 16 | 1.0086 | 1.0086 | 1.5076 | 1.5143 | | MT5ForConditionalGeneration | 16 | 1.0 | 1.0 | 1.4483 | 1.451 | | BlenderbotSmallForCausalLM | 64 | 0.9344 | 0.9344 | 1.4426 | 1.5037 | | T5ForConditionalGeneration | 4 | 1.0096 | 1.0096 | 1.415 | 1.4248 | | T5Small | 4 | 1.0096 | 1.0096 | 1.415 | 1.4248 | | YituTechConvBert | 16 | 0.9748 | 0.9748 | 1.4091 | 1.4628 | | PLBartForCausalLM | 8 | 0.9305 | 0.9305 | 1.3022 | 1.4272 | | OPTForCausalLM | 2 | 0.9236 | 0.9236 | 1.2556 | 1.5083 | | Speech2Text2ForCausalLM | 256 | 0.8748 | 0.8748 | 1.2159 | 1.3714 | | PegasusForConditionalGeneration | 32 | 0.9933 | 0.9933 | 1.1627 | 1.1905 | | MegatronBertForCausalLM | 4 | 1.0025 | 1.0025 | 1.1586 | 1.1619 | | PLBartForConditionalGeneration | 4 | 0.9045 | 0.9045 | 1.1461 | 1.189 | | MBartForConditionalGeneration | 2 | 1.0021 | 1.0021 | 1.1091 | 1.1117 | | TrOCRForCausalLM | 32 | 0.8803 | 0.8803 | 1.1054 | 1.1507 | | XGLMForCausalLM | 8 | 0.9702 | 0.9702 | 1.0954 | 1.1396 | | M2M100ForConditionalGeneration | 16 | 0.9362 | 0.9362 | 1.0902 | 1.1055 | | BartForConditionalGeneration | 2 | 1.0021 | 1.0021 | 1.0599 | 1.0623 | | PegasusForCausalLM | 32 | 0.907 | 0.907 | 1.0553 | 1.1086 | | BartForCausalLM | 4 | 0.9074 | 0.9074 | 1.0188 | 1.1093 | | MBartForCausalLM | 4 | 0.9074 | 0.9074 | 1.0146 | 1.1093 | | MobileBertForQuestionAnswering | 128 | 1.9097 | 1.9097 | 1.0066 | 1.021 | | XLNetLMHeadModel | 8 | 1.0 | 1.0 | 1.0 | 1.0 | | AlbertForQuestionAnswering | 4 | 1.0896 | 1.0896 | 0.9832 | 0.9866 | | AlbertForMaskedLM | 4 | 1.0894 | 1.0894 | 0.9828 | 0.9862 | | MegatronBertForQuestionAnswering | 8 | 1.0339 | 1.0339 | 0.9809 | 0.9836 | | BlenderbotForCausalLM | 4 | 0.9883 | 0.9883 | 0.9796 | 0.9889 | | GPT2ForSequenceClassification | 4 | 1.0149 | 1.0149 | 0.9653 | 0.9695 | | LayoutLMForSequenceClassification | 16 | 1.0924 | 1.0924 | 0.9616 | 0.9669 | | BertForQuestionAnswering | 16 | 1.0943 | 1.0943 | 0.9606 | 0.966 | | RobertaForQuestionAnswering | 16 | 1.0943 | 1.0943 | 0.9606 | 0.966 | | ElectraForQuestionAnswering | 64 | 1.2329 | 1.2329 | 0.9205 | 0.9289 | | MobileBertForMaskedLM | 64 | 1.0073 | 1.0073 | 0.9034 | 0.9065 | | DistilBertForQuestionAnswering | 256 | 1.1397 | 1.1397 | 0.8871 | 0.8914 | | DebertaV2ForMaskedLM | 1 | 0.9996 | 0.9996 | 0.793 | 1.0348 | | GoogleFnet | 16 | 0.9911 | 0.9911 | 0.7194 | 1.5527 | | DebertaV2ForQuestionAnswering | 2 | 1.0008 | 0.994 | 0.6539 | 1.0008 | | DebertaForMaskedLM | 4 | 0.9569 | 0.9569 | 0.4906 | 1.212 | | AllenaiLongformerBase | 4 | 0.6108 | 0.6108 | 0.4883 | 0.7383 | | DebertaForQuestionAnswering | 8 | 0.93 | 0.8765 | 0.2841 | 0.93 | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | AlbertForMaskedLM | 4 | 126.2542 | 126.5573 | 97.5407 | 97.4501 | | AlbertForQuestionAnswering | 4 | 125.5669 | 125.6202 | 96.9402 | 97.0922 | | XLNetLMHeadModel | 8 | 158.2154 | 158.1649 | 69.6399 | 70.2706 | | PegasusForConditionalGeneration | 32 | 56.2907 | 55.6185 | 50.3291 | 50.5872 | | TrOCRForCausalLM | 32 | 57.022 | 57.141 | 47.4576 | 46.9352 | | AllenaiLongformerBase | 4 | 60.7885 | 140.3403 | 44.852 | 43.4621 | | MegatronBertForQuestionAnswering | 8 | 55.7552 | 55.734 | 44.35 | 44.7266 | | MBartForConditionalGeneration | 2 | 50.4745 | 49.271 | 43.3659 | 43.8122 | | BartForConditionalGeneration | 2 | 48.7139 | 49.0767 | 42.2649 | 43.1124 | | YituTechConvBert | 16 | 57.1477 | 56.4361 | 41.1798 | 41.394 | | MobileBertForQuestionAnswering | 128 | 55.161 | 50.43 | 39.5618 | 40.2196 | | DistilBertForQuestionAnswering | 256 | 46.7692 | 46.7596 | 38.9442 | 39.0187 | | BlenderbotSmallForConditionalGeneration | 64 | 42.0612 | 41.3286 | 34.098 | 34.3559 | | CamemBert | 16 | 45.3732 | 45.3945 | 33.7687 | 34.0647 | | LayoutLMForMaskedLM | 16 | 45.6835 | 45.6522 | 33.5465 | 33.7145 | | BertForMaskedLM | 16 | 44.8864 | 44.8019 | 33.5365 | 33.6575 | | RobertaForCausalLM | 16 | 46.5962 | 46.4922 | 33.4839 | 33.7512 | | DistilBertForMaskedLM | 128 | 41.04 | 41.0428 | 33.1664 | 33.2767 | | MBartForCausalLM | 4 | 43.867 | 44.4155 | 33.1073 | 32.6457 | | BartForCausalLM | 4 | 44.2067 | 44.404 | 32.6872 | 32.2174 | | OPTForCausalLM | 2 | 62.434 | 61.0617 | 32.3361 | 31.721 | | MobileBertForMaskedLM | 64 | 63.7915 | 58.382 | 31.1003 | 42.51 | | DebertaV2ForQuestionAnswering | 2 | 48.5257 | 46.8956 | 30.5598 | 34.7174 | | M2M100ForConditionalGeneration | 16 | 46.0436 | 38.9137 | 29.7022 | 32.725 | | PLBartForConditionalGeneration | 4 | 38.3815 | 37.3951 | 29.1801 | 29.0664 | | PLBartForCausalLM | 8 | 43.8298 | 46.9022 | 27.9291 | 27.5476 | | MegatronBertForCausalLM | 4 | 34.2912 | 34.246 | 27.1647 | 27.5346 | | LayoutLMForSequenceClassification | 16 | 37.3113 | 37.3188 | 26.9969 | 27.1816 | | RobertaForQuestionAnswering | 16 | 36.7772 | 36.7082 | 26.9932 | 27.1737 | | BertForQuestionAnswering | 16 | 36.572 | 36.5435 | 26.974 | 27.084 | | ElectraForQuestionAnswering | 64 | 39.8295 | 39.7537 | 26.8014 | 27.0277 | | DistillGPT2 | 16 | 41.1372 | 41.1132 | 23.955 | 24.0677 | | PegasusForCausalLM | 32 | 27.5298 | 27.6807 | 23.8989 | 23.676 | | BlenderbotForCausalLM | 4 | 32.5449 | 31.4046 | 23.3216 | 25.7966 | | DebertaV2ForMaskedLM | 1 | 44.9702 | 50.1849 | 22.5054 | 37.1475 | | DebertaForQuestionAnswering | 8 | 24.0186 | 27.4533 | 22.296 | 17.4813 | | GoogleFnet | 16 | 37.0892 | 37.0956 | 21.9284 | 19.4609 | | ElectraForCausalLM | 32 | 34.3517 | 34.2947 | 20.8451 | 21.0074 | | T5ForConditionalGeneration | 4 | 36.2782 | 36.2645 | 19.9811 | 20.3573 | | T5Small | 4 | 36.2273 | 36.237 | 19.9487 | 20.314 | | GPT2ForSequenceClassification | 4 | 34.3034 | 34.1441 | 18.3303 | 18.4762 | | BlenderbotSmallForCausalLM | 64 | 22.9798 | 23.4065 | 17.3623 | 17.3722 | | Speech2Text2ForCausalLM | 256 | 24.6211 | 23.4238 | 16.5433 | 16.1413 | | XGLMForCausalLM | 8 | 41.2122 | 34.9513 | 16.3257 | 23.1257 | | MT5ForConditionalGeneration | 16 | 40.5845 | 37.7231 | 14.9378 | 20.9932 | | DebertaForMaskedLM | 4 | 27.2988 | 31.156 | 14.8458 | 18.2129 | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | gmlp_s16_224 | 128 | 0.9862 | 1.1076 | 1.8839 | 1.8791 | | dm_nfnet_f0 | 128 | 0.9858 | 0.9857 | 1.8184 | 1.7779 | | sebotnet33ts_256 | 64 | 0.9664 | 0.9608 | 1.6653 | 1.6589 | | nfnet_l0 | 128 | 0.9879 | 0.9881 | 1.639 | 1.6231 | | poolformer_m36 | 64 | 0.9701 | 0.9717 | 1.5818 | 1.5613 | | cait_m36_384 | 4 | 0.9889 | 0.9983 | 1.5802 | 1.5522 | | xcit_large_24_p8_224 | 5 | 0.981 | 0.9948 | 1.5605 | 1.5203 | | resnest101e | 64 | 0.9865 | 0.9699 | 1.5588 | 1.507 | | volo_d1_224 | 64 | 0.9869 | 0.9797 | 1.5525 | 1.5345 | | eca_botnext26ts_256 | 128 | 0.9769 | 0.9731 | 1.4959 | 1.4849 | | botnet26t_256 | 128 | 0.9795 | 0.9777 | 1.488 | 1.4893 | | coat_lite_mini | 128 | 0.9923 | 0.993 | 1.4759 | 1.4632 | | dla102 | 128 | 0.9901 | 0.9885 | 1.453 | 1.45 | | gmixer_24_224 | 128 | 0.9883 | 1.0315 | 1.4524 | 1.4471 | | res2net50_14w_8s | 128 | 0.9978 | 0.9854 | 1.4394 | 1.4235 | | tnt_s_patch16_224 | 128 | 0.9963 | 0.9963 | 1.4195 | 1.4088 | | res2net101_26w_4s | 64 | 0.9959 | 0.9767 | 1.4145 | 1.3916 | | hrnet_w18 | 128 | 0.9806 | 0.9272 | 1.411 | 1.3319 | | res2next50 | 128 | 0.9986 | 0.991 | 1.3912 | 1.3735 | | jx_nest_base | 32 | 0.9683 | 0.9622 | 1.3839 | 1.3622 | | convit_base | 64 | 0.9949 | 0.9946 | 1.3723 | 1.3634 | | repvgg_a2 | 128 | 0.9602 | 0.9537 | 1.3686 | 1.3759 | | inception_v3 | 128 | 0.99 | 0.9768 | 1.3636 | 1.3638 | | gluon_inception_v3 | 128 | 0.99 | 0.9768 | 1.3627 | 1.3636 | | adv_inception_v3 | 128 | 0.99 | 0.9766 | 1.3624 | 1.364 | | mobilenetv2_100 | 128 | 0.9626 | 0.949 | 1.3539 | 1.3557 | | ghostnet_100 | 128 | 0.962 | 0.8149 | 1.3447 | 1.345 | | convnext_base | 64 | 0.9781 | 0.9775 | 1.3407 | 1.3343 | | tf_efficientnet_b0 | 128 | 0.9702 | 0.9637 | 1.3282 | 1.3199 | | rexnet_100 | 128 | 0.9517 | 0.9432 | 1.3161 | 1.3125 | | gernet_l | 128 | 0.9573 | 0.9498 | 1.3121 | 1.3165 | | tinynet_a | 128 | 0.9588 | 0.9425 | 1.309 | 1.2996 | | ese_vovnet19b_dw | 128 | 0.9633 | 0.9601 | 1.2965 | 1.2982 | | mobilenetv3_large_100 | 128 | 0.9532 | 0.9372 | 1.2962 | 1.2963 | | tf_mixnet_l | 128 | 0.9745 | 0.9721 | 1.2883 | 1.2792 | | spnasnet_100 | 128 | 0.9593 | 0.9422 | 1.2823 | 1.2839 | | cspdarknet53 | 64 | 0.9422 | 0.9313 | 1.2817 | 1.2766 | | mnasnet_100 | 128 | 0.9638 | 0.9514 | 1.2774 | 1.2799 | | resmlp_12_224 | 128 | 0.982 | 0.9774 | 1.2698 | 1.272 | | dpn107 | 32 | 0.959 | 0.9508 | 1.2669 | 1.2508 | | swsl_resnext101_32x16d | 32 | 0.9954 | 0.9838 | 1.2661 | 1.2378 | | fbnetc_100 | 128 | 0.9656 | 0.9522 | 1.265 | 1.267 | | selecsls42b | 128 | 0.9965 | 0.9924 | 1.2605 | 1.2597 | | mixnet_l | 128 | 0.9765 | 0.9747 | 1.2598 | 1.2488 | | convmixer_768_32 | 32 | 0.9968 | 0.9961 | 1.2449 | 1.2444 | | crossvit_9_240 | 128 | 0.973 | 0.5755 | 1.2436 | 1.2291 | | fbnetv3_b | 128 | 0.9651 | 0.9507 | 1.2414 | 1.236 | | regnety_002 | 128 | 0.8913 | 0.8702 | 1.2373 | 1.2011 | | pnasnet5large | 16 | 0.9743 | 0.9625 | 1.2296 | 1.2237 | | mobilevit_s | 64 | 0.955 | 0.951 | 1.2271 | 1.2189 | | gluon_xception65 | 32 | 0.9838 | 0.9758 | 1.1692 | 1.1764 | | twins_pcpvt_base | 64 | 0.9743 | 0.974 | 1.1605 | 1.1378 | | lcnet_050 | 128 | 0.9075 | 0.8804 | 1.1556 | 1.1762 | | pit_b_224 | 64 | 0.9888 | 0.9884 | 1.1544 | 1.145 | | swin_base_patch4_window7_224 | 64 | 0.9827 | 0.9826 | 1.1388 | 1.1315 | | mixer_b16_224 | 128 | 0.9948 | 1.0318 | 1.1372 | 1.1394 | | beit_base_patch16_224 | 64 | 0.9926 | 0.9923 | 1.1076 | 1.1025 | | deit_base_distilled_patch16_224 | 64 | 0.9933 | 0.9928 | 1.0912 | 1.0904 | | vit_base_patch16_224 | 64 | 0.993 | 0.9925 | 1.088 | 1.0849 | | visformer_small | 128 | 0.9915 | 0.9875 | 1.0797 | 1.0692 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-----------+---------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-----------+---------------+------------------------+ | adv_inception_v3 | 8 | pass | pass | pass | pass | | beit_base_patch16_224 | 8 | pass | pass | pass | pass | | mobilevit_s | 8 | pass | pass | pass | pass | | nfnet_l0 | 8 | pass | pass | pass | pass | | pit_b_224 | 8 | pass | pass | pass | pass | | pnasnet5large | 8 | pass | pass | pass | pass | | poolformer_m36 | 8 | pass | pass | pass | pass | | regnety_002 | 8 | pass | pass | pass | pass | | repvgg_a2 | 8 | pass | pass | pass | pass | | res2net101_26w_4s | 8 | pass | pass | pass | pass | | res2net50_14w_8s | 8 | pass | pass | pass | pass | | res2next50 | 8 | pass | pass | pass | pass | | resmlp_12_224 | 8 | pass | pass | pass | pass | | resnest101e | 8 | pass | pass | pass | pass | | rexnet_100 | 8 | pass | pass | pass | pass | | sebotnet33ts_256 | 8 | pass | pass | pass | pass | | selecsls42b | 8 | pass | pass | pass | pass | | spnasnet_100 | 8 | pass | pass | pass | pass | | swin_base_patch4_window7_224 | 8 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 8 | pass | pass | pass | pass | | tf_efficientnet_b0 | 8 | pass | pass | pass | pass | | tf_mixnet_l | 8 | pass | pass | pass | pass | | tinynet_a | 8 | pass | pass | pass | pass | | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | twins_pcpvt_base | 8 | pass | pass | pass | pass | | visformer_small | 8 | pass | pass | pass | pass | | vit_base_patch16_224 | 8 | pass | pass | pass | pass | | volo_d1_224 | 8 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 8 | pass | pass | pass | pass | | mobilenetv3_large_100 | 8 | pass | pass | pass | pass | | mobilenetv2_100 | 8 | pass | pass | pass | pass | | mnasnet_100 | 8 | pass | pass | pass | pass | | ese_vovnet19b_dw | 8 | pass | pass | pass | pass | | botnet26t_256 | 8 | pass | pass | pass | pass | | coat_lite_mini | 8 | pass | pass | pass | pass | | convit_base | 8 | pass | pass | pass | pass | | convmixer_768_32 | 8 | pass | pass | pass | pass | | convnext_base | 8 | pass | pass | pass | pass | | crossvit_9_240 | 8 | pass | pass | pass | pass | | cspdarknet53 | 8 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 8 | pass | pass | pass | pass | | dla102 | 8 | pass | pass | pass | pass | | dm_nfnet_f0 | 8 | pass | pass | pass | pass | | dpn107 | 8 | pass | pass | pass | pass | | eca_botnext26ts_256 | 8 | pass | pass | pass | pass | | fbnetc_100 | 8 | pass | pass | pass | pass | | mixnet_l | 8 | pass | pass | pass | pass | | fbnetv3_b | 8 | pass | pass | pass | pass | | gernet_l | 8 | pass | pass | pass | pass | | ghostnet_100 | 8 | pass | pass | pass | pass | | gluon_inception_v3 | 8 | pass | pass | pass | pass | | gluon_xception65 | 8 | pass | pass | pass | pass | | gmixer_24_224 | 8 | pass | pass | pass | pass | | gmlp_s16_224 | 8 | pass | pass | pass | pass | | hrnet_w18 | 8 | pass | pass | pass | pass | | inception_v3 | 8 | pass | pass | pass | pass | | jx_nest_base | 8 | pass | pass | pass | pass | | lcnet_050 | 8 | pass | pass | pass | pass | | mixer_b16_224 | 8 | pass | pass | pass | pass | | cait_m36_384 | 4 | pass | pass | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-----------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | mobilevit_s | 64 | 2.5109 | 4.5861 | 63.6771 | 62.5295 | | twins_pcpvt_base | 64 | 3.3927 | 7.1876 | 62.207 | 62.4913 | | coat_lite_mini | 128 | 1.4326 | 2.8992 | 47.3055 | 46.9281 | | poolformer_m36 | 64 | 3.1267 | 4.7846 | 39.7162 | 38.66 | | hrnet_w18 | 128 | 8.4835 | 18.9589 | 37.9987 | 36.7699 | | jx_nest_base | 32 | 2.6134 | 4.9714 | 37.6429 | 36.5513 | | swin_base_patch4_window7_224 | 64 | 3.8164 | 7.3584 | 37.6034 | 37.5829 | | pnasnet5large | 16 | 7.5197 | 14.7443 | 34.6454 | 33.7544 | | cait_m36_384 | 4 | 4.2816 | 9.6582 | 34.3131 | 33.9495 | | resnest101e | 64 | 4.4083 | 9.7529 | 30.3535 | 30.0594 | | xcit_large_24_p8_224 | 5 | 4.0315 | 9.0001 | 29.0763 | 28.5659 | | crossvit_9_240 | 128 | 2.106 | 4.9942 | 27.8044 | 27.6619 | | res2net101_26w_4s | 64 | 3.9006 | 10.5464 | 24.7029 | 24.4074 | | nfnet_l0 | 128 | 2.6694 | 4.555 | 24.5339 | 23.5969 | | tnt_s_patch16_224 | 128 | 2.4935 | 5.5044 | 23.5368 | 23.3995 | | res2net50_14w_8s | 128 | 3.3069 | 10.0388 | 23.5204 | 23.4001 | | dpn107 | 32 | 5.3803 | 9.4092 | 22.7362 | 22.8578 | | sebotnet33ts_256 | 64 | 2.3773 | 4.1817 | 21.8676 | 21.9724 | | dm_nfnet_f0 | 128 | 3.3494 | 5.2263 | 21.4196 | 21.7254 | | botnet26t_256 | 128 | 1.6571 | 3.0716 | 21.3122 | 21.2952 | | fbnetv3_b | 128 | 4.0129 | 7.5427 | 21.2972 | 20.9957 | | tf_mixnet_l | 128 | 4.9118 | 7.8595 | 20.8744 | 21.6546 | | volo_d1_224 | 64 | 1.9156 | 4.3133 | 20.8301 | 21.6174 | | rexnet_100 | 128 | 2.7202 | 4.9431 | 19.9971 | 18.8192 | | mixnet_l | 128 | 4.4366 | 7.4657 | 19.7177 | 19.5087 | | gmlp_s16_224 | 128 | 1.8286 | 3.8192 | 19.6017 | 18.215 | | ghostnet_100 | 128 | 2.1601 | 5.095 | 18.6402 | 18.5284 | | convnext_base | 64 | 2.6738 | 4.2466 | 18.557 | 17.5518 | | eca_botnext26ts_256 | 128 | 1.7939 | 3.3672 | 17.878 | 17.7734 | | gluon_xception65 | 32 | 2.7732 | 6.728 | 17.6377 | 17.4105 | | gmixer_24_224 | 128 | 2.0412 | 4.0211 | 17.4138 | 16.3532 | | tinynet_a | 128 | 2.6767 | 4.8185 | 17.001 | 17.0008 | | dla102 | 128 | 2.5362 | 5.9113 | 16.8211 | 16.7233 | | convit_base | 64 | 1.4174 | 3.2202 | 16.6146 | 16.3608 | | adv_inception_v3 | 128 | 2.4349 | 5.3608 | 16.4565 | 16.311 | | cspdarknet53 | 64 | 3.2084 | 5.455 | 16.4465 | 16.2417 | | gluon_inception_v3 | 128 | 2.4051 | 5.3807 | 16.4334 | 15.8552 | | res2next50 | 128 | 1.8606 | 5.3242 | 16.0582 | 15.8784 | | tf_efficientnet_b0 | 128 | 2.4044 | 4.3198 | 16.0181 | 15.09 | | swsl_resnext101_32x16d | 32 | 2.1616 | 5.6607 | 15.8774 | 15.358 | | resmlp_12_224 | 128 | 0.9143 | 1.6517 | 15.7966 | 15.438 | | beit_base_patch16_224 | 64 | 1.4667 | 3.1358 | 15.4823 | 15.3814 | | mobilenetv3_large_100 | 128 | 1.9087 | 3.713 | 15.4547 | 14.4825 | | inception_v3 | 128 | 2.4 | 5.3349 | 15.445 | 15.3646 | | regnety_002 | 128 | 2.3567 | 3.9035 | 14.5992 | 13.608 | | pit_b_224 | 64 | 1.3228 | 2.9014 | 14.5198 | 14.6241 | | lcnet_050 | 128 | 1.157 | 2.2195 | 14.5036 | 14.1976 | | deit_base_distilled_patch16_224 | 64 | 1.1333 | 2.4567 | 13.7162 | 12.8836 | | spnasnet_100 | 128 | 2.4494 | 4.3582 | 13.4498 | 13.2197 | | fbnetc_100 | 128 | 2.513 | 4.4267 | 13.3719 | 13.1345 | | mixer_b16_224 | 128 | 0.9072 | 1.7908 | 13.333 | 12.1382 | | repvgg_a2 | 128 | 2.6836 | 4.4134 | 12.983 | 12.6781 | | gernet_l | 128 | 2.7131 | 4.4845 | 12.823 | 12.2412 | | mobilenetv2_100 | 128 | 1.9222 | 3.7363 | 12.2066 | 11.1786 | | mnasnet_100 | 128 | 1.8568 | 3.5354 | 12.0597 | 11.0723 | | vit_base_patch16_224 | 64 | 1.1511 | 2.3926 | 11.8892 | 13.1103 | | visformer_small | 128 | 1.2108 | 2.5719 | 11.7276 | 12.5315 | | selecsls42b | 128 | 0.8573 | 2.1482 | 11.3474 | 12.0893 | | ese_vovnet19b_dw | 128 | 1.2276 | 2.0608 | 11.0977 | 11.0487 | | convmixer_768_32 | 32 | 1.4111 | 3.7471 | 10.0531 | 10.7476 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | mobilenetv2_100 | 128 | 1.2161 | 1.2161 | 1.6528 | 1.7996 | | rexnet_100 | 128 | 1.2146 | 1.2146 | 1.6467 | 1.7914 | | tinynet_a | 128 | 1.2101 | 1.2101 | 1.6235 | 1.7668 | | fbnetc_100 | 128 | 1.1425 | 1.1425 | 1.5449 | 1.6797 | | dm_nfnet_f0 | 128 | 1.1745 | 1.7697 | 1.5321 | 1.5939 | | fbnetv3_b | 128 | 1.1984 | 1.1984 | 1.5302 | 1.706 | | mobilenetv3_large_100 | 128 | 1.1993 | 1.1993 | 1.5295 | 1.7105 | | ese_vovnet19b_dw | 128 | 1.4966 | 1.4966 | 1.4838 | 1.564 | | selecsls42b | 128 | 1.5903 | 1.5903 | 1.4697 | 1.5901 | | pnasnet5large | 16 | 1.5067 | 1.5067 | 1.3889 | 1.4158 | | sebotnet33ts_256 | 64 | 1.1865 | 1.1865 | 1.374 | 1.4004 | | gluon_xception65 | 32 | 1.405 | 1.405 | 1.3623 | 1.405 | | mnasnet_100 | 128 | 1.3777 | 1.3777 | 1.354 | 1.5095 | | spnasnet_100 | 128 | 1.3775 | 1.3775 | 1.3538 | 1.5093 | | convmixer_768_32 | 32 | 1.1936 | 1.1936 | 1.3296 | 1.4137 | | nfnet_l0 | 128 | 1.3942 | 1.3942 | 1.3246 | 1.3942 | | poolformer_m36 | 64 | 1.1888 | 1.1888 | 1.3016 | 1.3592 | | convnext_base | 64 | 1.1431 | 1.1431 | 1.2874 | 1.3331 | | tf_efficientnet_b0 | 128 | 1.3185 | 1.3185 | 1.2451 | 1.3185 | | hrnet_w18 | 128 | 1.0654 | 1.0654 | 1.2414 | 1.3255 | | cspdarknet53 | 64 | 1.6402 | 1.6402 | 1.2103 | 1.2425 | | res2net50_14w_8s | 128 | 1.2884 | 1.2884 | 1.1395 | 1.1961 | | res2next50 | 128 | 1.3217 | 1.3217 | 1.1334 | 1.1879 | | mixnet_l | 128 | 1.1528 | 1.1528 | 1.1172 | 1.1528 | | tf_mixnet_l | 128 | 1.1528 | 1.1528 | 1.1172 | 1.1528 | | res2net101_26w_4s | 64 | 1.2034 | 1.2034 | 1.082 | 1.1263 | | eca_botnext26ts_256 | 128 | 1.1405 | 1.1405 | 1.0787 | 1.1404 | | botnet26t_256 | 128 | 1.1393 | 1.1393 | 1.0781 | 1.1393 | | coat_lite_mini | 128 | 1.1027 | 1.1027 | 1.0754 | 1.123 | | mobilevit_s | 64 | 1.164 | 1.164 | 1.0235 | 1.0685 | | ghostnet_100 | 128 | 1.1107 | 1.1107 | 1.0169 | 1.1107 | | repvgg_a2 | 128 | 1.0 | 1.0 | 1.0105 | 1.0674 | | swsl_resnext101_32x16d | 32 | 1.0101 | 1.0101 | 0.9992 | 1.0101 | | dla102 | 128 | 1.0 | 1.0 | 0.9641 | 1.0 | | adv_inception_v3 | 128 | 1.0001 | 1.0001 | 0.9469 | 1.0 | | gluon_inception_v3 | 128 | 1.0001 | 1.0001 | 0.9469 | 1.0 | | inception_v3 | 128 | 1.0001 | 1.0001 | 0.9469 | 1.0 | | convit_base | 64 | 1.1577 | 1.1577 | 0.9464 | 0.9715 | | cait_m36_384 | 4 | 1.0086 | 1.0086 | 0.934 | 0.9395 | | gernet_l | 128 | 1.0 | 1.0 | 0.9336 | 1.0 | | dpn107 | 32 | 1.2334 | 1.2334 | 0.929 | 0.9398 | | lcnet_050 | 128 | 1.267 | 1.267 | 0.9279 | 1.0873 | | resnest101e | 64 | 1.0 | 1.0 | 0.926 | 0.9591 | | volo_d1_224 | 64 | 1.0 | 1.0 | 0.9076 | 0.9519 | | regnety_002 | 128 | 1.0 | 0.9997 | 0.9011 | 0.9997 | | twins_pcpvt_base | 64 | 1.0799 | 1.0799 | 0.8882 | 0.9152 | | swin_base_patch4_window7_224 | 64 | 1.3573 | 1.3573 | 0.883 | 0.899 | | xcit_large_24_p8_224 | 5 | 1.0001 | 1.0001 | 0.8765 | 0.8818 | | pit_b_224 | 64 | 1.0667 | 1.0667 | 0.8608 | 0.8725 | | mixer_b16_224 | 128 | 1.1733 | 1.1733 | 0.8569 | 0.8992 | | visformer_small | 128 | 1.1302 | 1.1302 | 0.8474 | 0.8967 | | beit_base_patch16_224 | 64 | 1.0655 | 1.0655 | 0.8072 | 0.8323 | | deit_base_distilled_patch16_224 | 64 | 1.0673 | 1.0673 | 0.7967 | 0.8224 | | vit_base_patch16_224 | 64 | 1.066 | 1.066 | 0.7965 | 0.8211 | | jx_nest_base | 32 | 1.1099 | 1.1099 | 0.7852 | 0.7961 | | resmlp_12_224 | 128 | 1.1807 | 1.1807 | 0.771 | 0.8453 | | gmlp_s16_224 | 128 | 1.0706 | 1.1961 | 0.7311 | 0.7947 | | gmixer_24_224 | 128 | 1.1616 | 1.1616 | 0.667 | 0.7162 | | crossvit_9_240 | 128 | 1.0501 | 0.7597 | 0.586 | 0.618 | | tnt_s_patch16_224 | 128 | 1.2109 | 1.2143 | 0.4363 | 0.4552 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 149.1923 | 149.2683 | 104.7925 | 105.4986 | | convmixer_768_32 | 32 | 103.4503 | 103.554 | 82.8589 | 82.8607 | | pnasnet5large | 16 | 87.1141 | 88.0131 | 69.1091 | 69.3241 | | convnext_base | 64 | 93.7803 | 93.8366 | 68.4079 | 68.7651 | | swin_base_patch4_window7_224 | 64 | 75.6175 | 75.6251 | 65.2854 | 65.7053 | | dm_nfnet_f0 | 128 | 118.7708 | 118.7609 | 64.3937 | 65.807 | | hrnet_w18 | 128 | 88.8038 | 94.1599 | 61.907 | 65.6613 | | nfnet_l0 | 128 | 93.333 | 93.2669 | 56.2601 | 56.8116 | | swsl_resnext101_32x16d | 32 | 69.0705 | 70.0095 | 54.3424 | 55.6951 | | cait_m36_384 | 4 | 84.1241 | 83.3399 | 52.6155 | 53.6378 | | res2next50 | 128 | 72.4039 | 72.965 | 51.9822 | 52.6281 | | mixer_b16_224 | 128 | 57.1711 | 55.0987 | 50.2668 | 50.2266 | | tf_mixnet_l | 128 | 63.4185 | 63.6169 | 47.9635 | 48.3355 | | convit_base | 64 | 66.0091 | 66.0994 | 47.8865 | 48.2165 | | mixnet_l | 128 | 60.9605 | 61.0818 | 47.2602 | 47.6719 | | dla102 | 128 | 69.2312 | 69.3597 | 47.1591 | 47.2852 | | pit_b_224 | 64 | 53.5091 | 53.5016 | 45.8215 | 46.1857 | | resnest101e | 64 | 68.2328 | 69.4816 | 43.1241 | 44.6106 | | adv_inception_v3 | 128 | 59.2757 | 60.1412 | 43.0922 | 43.0549 | | gluon_inception_v3 | 128 | 59.2919 | 60.0902 | 43.0874 | 43.0732 | | inception_v3 | 128 | 59.2951 | 60.0798 | 43.0461 | 43.0286 | | dpn107 | 32 | 56.3283 | 56.7856 | 42.6949 | 43.2053 | | res2net50_14w_8s | 128 | 61.4325 | 62.238 | 42.5792 | 43.1201 | | poolformer_m36 | 64 | 68.8329 | 68.7182 | 42.2239 | 42.7849 | | gluon_xception65 | 32 | 47.9668 | 48.3785 | 40.3682 | 40.1352 | | beit_base_patch16_224 | 64 | 43.6054 | 43.6353 | 39.0751 | 39.2595 | | vit_base_patch16_224 | 64 | 41.1438 | 41.1629 | 37.7777 | 37.737 | | deit_base_distilled_patch16_224 | 64 | 41.3421 | 41.3454 | 37.6964 | 37.8401 | | visformer_small | 128 | 40.3744 | 40.576 | 37.1107 | 37.4877 | | twins_pcpvt_base | 64 | 40.3829 | 40.4092 | 33.8905 | 34.5775 | | res2net101_26w_4s | 64 | 47.3095 | 48.1181 | 33.2628 | 33.7877 | | gmixer_24_224 | 128 | 48.003 | 46.0203 | 32.6857 | 32.828 | | volo_d1_224 | 64 | 51.1698 | 51.5128 | 32.5393 | 32.8632 | | fbnetv3_b | 128 | 41.0289 | 41.6895 | 31.9579 | 32.0791 | | jx_nest_base | 32 | 42.8787 | 43.1312 | 30.0473 | 30.4682 | | gmlp_s16_224 | 128 | 55.8714 | 49.766 | 29.2488 | 29.3498 | | botnet26t_256 | 128 | 42.8027 | 42.8724 | 28.1824 | 28.1392 | | eca_botnext26ts_256 | 128 | 42.9088 | 43.0961 | 28.0556 | 28.2598 | | coat_lite_mini | 128 | 41.4042 | 41.4011 | 27.8322 | 28.0837 | | gernet_l | 128 | 37.4633 | 37.7791 | 27.3376 | 27.2569 | | cspdarknet53 | 64 | 34.8843 | 35.2782 | 25.6231 | 25.7544 | | repvgg_a2 | 128 | 35.6206 | 35.8679 | 25.012 | 24.8768 | | crossvit_9_240 | 128 | 31.1044 | 52.5942 | 24.3748 | 24.6695 | | xcit_large_24_p8_224 | 5 | 36.3044 | 35.8093 | 22.8528 | 23.4736 | | tf_efficientnet_b0 | 128 | 30.8646 | 31.1052 | 22.5788 | 22.6771 | | mobilevit_s | 64 | 28.9446 | 29.0415 | 22.5057 | 22.6671 | | sebotnet33ts_256 | 64 | 35.6274 | 35.8246 | 20.6718 | 20.7656 | | fbnetc_100 | 128 | 25.9533 | 26.3184 | 19.805 | 19.7679 | | rexnet_100 | 128 | 27.3365 | 27.6136 | 19.78 | 19.8209 | | selecsls42b | 128 | 24.6295 | 24.72 | 19.4539 | 19.4744 | | ese_vovnet19b_dw | 128 | 25.7312 | 25.8037 | 19.1276 | 19.0831 | | tinynet_a | 128 | 24.6358 | 25.101 | 18.0523 | 18.1949 | | resmlp_12_224 | 128 | 21.852 | 21.9576 | 16.8879 | 16.8753 | | spnasnet_100 | 128 | 22.1372 | 22.5349 | 16.5651 | 16.5328 | | mnasnet_100 | 128 | 20.6708 | 20.9537 | 15.6149 | 15.5647 | | mobilenetv2_100 | 128 | 19.9811 | 20.2817 | 14.205 | 14.1859 | | mobilenetv3_large_100 | 128 | 16.7199 | 17.0148 | 12.3002 | 12.3156 | | ghostnet_100 | 128 | 16.847 | 19.9368 | 12.0683 | 12.0477 | | regnety_002 | 128 | 11.1693 | 11.4997 | 8.095 | 8.3068 | | lcnet_050 | 128 | 5.1087 | 5.2683 | 4.0118 | 3.942 | +---------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_087_28_03_23_performance_amp_838/torchbench_amp.png : ![](https://i.imgur.com/Iv5zRtQ.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_087_28_03_23_performance_amp_838/huggingface_amp.png : ![](https://i.imgur.com/owM82mb.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_087_28_03_23_performance_amp_838/timm_models_amp.png : ![](https://i.imgur.com/MkLP7Cb.png)

Build Summary

### Run name ### day_087_28_03_23_performance_amp_838 ### Commit hashes ### pytorch commit: 0c78456e24eab0e175cec7567d2dfa45ecff58dc pytorch commit date: 2023-03-28 22:46:34+00:00 torchbench commit: d618fa8e06c13bbe441cc929c5d3bf498d0f369c torchbench commit date: 2023-03-22 15:27:07-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+gita7c8d25 ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (inductor max-autotune with cudagraphs)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-----------------------+------------+-------------+-------------+
|       Compiler        | torchbench | huggingface | timm_models |
+-----------------------+------------+-------------+-------------+
| inductor_max_autotune | 78%, 47/60 | 91%, 41/45  | 95%, 57/60  |
+-----------------------+------------+-------------+-------------+

Geometric mean speedup

+-----------------------+------------+-------------+-------------+
|       Compiler        | torchbench | huggingface | timm_models |
+-----------------------+------------+-------------+-------------+
| inductor_max_autotune |   1.61x    |    1.62x    |    1.42x    |
+-----------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+-----------------------+------------+-------------+-------------+
|       Compiler        | torchbench | huggingface | timm_models |
+-----------------------+------------+-------------+-------------+
| inductor_max_autotune |   348.81   |   210.04    |   497.89    |
+-----------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+-----------------------+------------+-------------+-------------+
|       Compiler        | torchbench | huggingface | timm_models |
+-----------------------+------------+-------------+-------------+
| inductor_max_autotune |   0.77x    |    0.90x    |    0.91x    |
+-----------------------+------------+-------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------+------+-----------------------+ | functorch_dp_cifar10 | 64 | 3.7416 | | BERT_pytorch | 16 | 3.2846 | | densenet121 | 4 | 2.7968 | | hf_BigBird | 2 | 2.6482 | | pytorch_CycleGAN_and_pix2pix | 1 | 2.4523 | | hf_Albert | 8 | 2.355 | | hf_T5_large | 2 | 2.3266 | | mobilenet_v3_large | 32 | 2.0984 | | hf_Bart | 4 | 2.0956 | | phlippe_densenet | 128 | 2.0789 | | dlrm | 1024 | 2.0617 | | squeezenet1_1 | 32 | 2.0064 | | hf_GPT2 | 4 | 1.9726 | | hf_T5 | 8 | 1.9579 | | hf_Bert | 4 | 1.8905 | | phlippe_resnet | 128 | 1.8371 | | pytorch_struct | 200 | 1.8044 | | timm_vision_transformer | 32 | 1.7806 | | resnext50_32x4d | 8 | 1.7364 | | speech_transformer | 32 | 1.7304 | | mnasnet1_0 | 32 | 1.7063 | | attention_is_all_you_need_pytorch | 256 | 1.6814 | | shufflenet_v2_x1_0 | 128 | 1.6711 | | fastNLP_Bert | 6 | 1.6644 | | hf_Bert_large | 4 | 1.6419 | | resnet18 | 16 | 1.6279 | | timm_resnest | 32 | 1.5636 | | timm_nfnet | 128 | 1.5355 | | drq | 1 | 1.5315 | | mobilenet_v2 | 96 | 1.5227 | | hf_DistilBert | 8 | 1.4591 | | timm_efficientnet | 32 | 1.4566 | | dcgan | 32 | 1.4469 | | lennard_jones | 1000 | 1.4177 | | LearningToPaint | 96 | 1.3766 | | pytorch_unet | 1 | 1.3561 | | pytorch_stargan | 16 | 1.2663 | | vgg16 | 64 | 1.2499 | | Super_SloMo | 6 | 1.2349 | | Background_Matting | 4 | 1.2171 | | yolov3 | 16 | 1.2042 | | resnet152 | 32 | 1.1934 | | resnet50 | 32 | 1.1808 | | soft_actor_critic | 256 | 1.1772 | | hf_Reformer | 4 | 1.1453 | | alexnet | 128 | 1.13 | | demucs | 4 | 1.0397 | | timm_regnet | 32 | 1.0208 | | timm_vovnet | 32 | 0.9487 | | tts_angular | 64 | 0.9484 | | nvidia_deeprecommender | 256 | 0.9337 | | moco | 0 | 0.0 | | tacotron2 | 0 | 0.0 | | hf_Longformer | 0 | 0.0 | | sage | 0 | 0.0 | | gcn | 0 | 0.0 | | timm_vision_transformer_large | 0 | 0.0 | | torchrec_dlrm | 0 | 0.0 | | gat | 0 | 0.0 | | hf_GPT2_large | 0 | 0.0 | +-----------------------------------+------+-----------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------+-----+-----------------------+ | hf_T5_large | 4 | pass_due_to_skip | | timm_vision_transformer_large | 4 | pass_due_to_skip | | hf_GPT2_large | 4 | pass_due_to_skip | | resnet50 | 4 | pass | | mnasnet1_0 | 4 | pass | | mobilenet_v3_large | 4 | pass | | nvidia_deeprecommender | 4 | pass | | phlippe_densenet | 4 | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | | pytorch_stargan | 16 | pass | | pytorch_struct | 200 | pass | | pytorch_unet | 2 | pass | | resnet152 | 4 | pass | | resnet18 | 4 | pass | | BERT_pytorch | 4 | pass | | lennard_jones | 4 | pass | | shufflenet_v2_x1_0 | 4 | pass | | soft_actor_critic | 256 | pass | | speech_transformer | 4 | pass | | timm_efficientnet | 4 | pass | | timm_nfnet | 4 | pass | | timm_regnet | 4 | pass | | timm_resnest | 4 | pass | | timm_vision_transformer | 4 | pass | | timm_vovnet | 4 | pass | | tts_angular | 4 | pass | | vgg16 | 4 | pass | | resnext50_32x4d | 4 | pass | | mobilenet_v2 | 4 | pass | | yolov3 | 4 | pass | | hf_Bart | 4 | pass | | attention_is_all_you_need_pytorch | 4 | pass | | dcgan | 4 | pass | | demucs | 4 | pass | | densenet121 | 4 | pass | | dlrm | 4 | pass | | Super_SloMo | 4 | pass | | fastNLP_Bert | 4 | pass | | functorch_dp_cifar10 | 4 | pass | | hf_T5_base | 4 | pass | | LearningToPaint | 4 | pass | | hf_Albert | 4 | pass | | hf_Bert | 4 | pass | | hf_Bert_large | 4 | pass | | hf_BigBird | 4 | pass | | hf_DistilBert | 4 | pass | | hf_GPT2 | 2 | pass | | hf_Reformer | 4 | pass | | hf_T5 | 4 | pass | | alexnet | 4 | pass | | hf_Longformer | 4 | fail_to_run | | moco | 4 | fail_to_run | | squeezenet1_1 | 4 | fail_accuracy | | phlippe_resnet | 4 | fail_accuracy | | drq | 1 | fail_accuracy | | vision_maskrcnn | 4 | eager_variation | | Background_Matting | 4 | eager_variation | | torchrec_dlrm | 0 | 0.0000 | | llama | 0 | 0.0000 | | tacotron2 | 0 | 0.0000 | | sage | 0 | 0.0000 | | gcn | 0 | 0.0000 | | gat | 0 | 0.0000 | +-----------------------------------+-----+-----------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------+------+-----------------------+ | densenet121 | 4 | 1210.9224 | | speech_transformer | 32 | 866.9875 | | phlippe_densenet | 128 | 851.8078 | | attention_is_all_you_need_pytorch | 256 | 727.7114 | | mnasnet1_0 | 32 | 609.9616 | | mobilenet_v3_large | 32 | 563.3833 | | mobilenet_v2 | 96 | 550.3897 | | hf_BigBird | 2 | 533.2365 | | hf_T5_large | 2 | 494.556 | | timm_vision_transformer | 32 | 477.4533 | | yolov3 | 16 | 473.3135 | | timm_regnet | 32 | 464.3725 | | timm_nfnet | 128 | 446.6786 | | hf_Albert | 8 | 442.971 | | timm_efficientnet | 32 | 423.847 | | fastNLP_Bert | 6 | 420.551 | | pytorch_struct | 200 | 393.8293 | | dlrm | 1024 | 383.5522 | | BERT_pytorch | 16 | 381.3677 | | timm_vovnet | 32 | 380.7795 | | resnext50_32x4d | 8 | 375.2733 | | drq | 1 | 331.8744 | | shufflenet_v2_x1_0 | 128 | 331.8666 | | hf_Bert_large | 4 | 330.7902 | | LearningToPaint | 96 | 327.6767 | | Super_SloMo | 6 | 305.3558 | | hf_T5 | 8 | 276.8124 | | squeezenet1_1 | 32 | 260.9367 | | resnet18 | 16 | 259.5868 | | nvidia_deeprecommender | 256 | 253.237 | | functorch_dp_cifar10 | 64 | 251.3416 | | pytorch_unet | 1 | 250.9091 | | Background_Matting | 4 | 248.3087 | | vgg16 | 64 | 234.7503 | | alexnet | 128 | 218.6421 | | hf_Reformer | 4 | 215.0925 | | timm_resnest | 32 | 210.3936 | | hf_GPT2 | 4 | 203.1842 | | phlippe_resnet | 128 | 200.4492 | | soft_actor_critic | 256 | 184.3741 | | resnet152 | 32 | 182.207 | | hf_Bart | 4 | 179.6105 | | lennard_jones | 1000 | 155.9248 | | pytorch_CycleGAN_and_pix2pix | 1 | 134.3312 | | hf_Bert | 4 | 104.8345 | | pytorch_stargan | 16 | 87.0111 | | demucs | 4 | 72.8124 | | dcgan | 32 | 61.981 | | hf_DistilBert | 8 | 54.9767 | | resnet50 | 32 | 28.2371 | | tts_angular | 64 | 5.2083 | | gat | 0 | nan | | gcn | 0 | nan | | hf_GPT2_large | 0 | nan | | hf_Longformer | 0 | nan | | moco | 0 | nan | | sage | 0 | nan | | tacotron2 | 0 | nan | | timm_vision_transformer_large | 0 | nan | | torchrec_dlrm | 0 | nan | +-----------------------------------+------+-----------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------+------+-----------------------+ | Super_SloMo | 6 | 1.1595 | | hf_Albert | 8 | 1.0399 | | mobilenet_v2 | 96 | 1.0102 | | hf_T5 | 8 | 0.9988 | | fastNLP_Bert | 6 | 0.9953 | | tts_angular | 64 | 0.9895 | | attention_is_all_you_need_pytorch | 256 | 0.9693 | | timm_nfnet | 128 | 0.9617 | | dlrm | 1024 | 0.9466 | | BERT_pytorch | 16 | 0.9428 | | hf_Bert | 4 | 0.9421 | | hf_GPT2 | 4 | 0.9319 | | timm_efficientnet | 32 | 0.9282 | | hf_Bert_large | 4 | 0.9138 | | yolov3 | 16 | 0.8685 | | shufflenet_v2_x1_0 | 128 | 0.865 | | speech_transformer | 32 | 0.8588 | | timm_regnet | 32 | 0.8479 | | hf_DistilBert | 8 | 0.8456 | | timm_vision_transformer | 32 | 0.8357 | | resnet50 | 32 | 0.8346 | | Background_Matting | 4 | 0.8333 | | resnet152 | 32 | 0.8323 | | timm_resnest | 32 | 0.8293 | | hf_T5_large | 2 | 0.8201 | | phlippe_densenet | 128 | 0.7988 | | mobilenet_v3_large | 32 | 0.785 | | pytorch_stargan | 16 | 0.7724 | | pytorch_unet | 1 | 0.7708 | | demucs | 4 | 0.7661 | | hf_Bart | 4 | 0.7626 | | squeezenet1_1 | 32 | 0.7625 | | timm_vovnet | 32 | 0.7457 | | mnasnet1_0 | 32 | 0.7428 | | pytorch_struct | 200 | 0.7341 | | vgg16 | 64 | 0.7228 | | alexnet | 128 | 0.7091 | | densenet121 | 4 | 0.7088 | | hf_BigBird | 2 | 0.696 | | nvidia_deeprecommender | 256 | 0.6585 | | resnext50_32x4d | 8 | 0.6558 | | LearningToPaint | 96 | 0.6006 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.5607 | | resnet18 | 16 | 0.5357 | | hf_Reformer | 4 | 0.4622 | | functorch_dp_cifar10 | 64 | 0.4063 | | phlippe_resnet | 128 | 0.3272 | | drq | 1 | 0.1818 | | dcgan | 32 | 0.1811 | | soft_actor_critic | 256 | 0.1108 | | lennard_jones | 1000 | 0.0648 | | gat | 0 | nan | | gcn | 0 | nan | | hf_GPT2_large | 0 | nan | | hf_Longformer | 0 | nan | | moco | 0 | nan | | sage | 0 | nan | | tacotron2 | 0 | nan | | timm_vision_transformer_large | 0 | nan | | torchrec_dlrm | 0 | nan | +-----------------------------------+------+-----------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------+------+-----------------------+ | Background_Matting | 4 | 103.4522 | | hf_T5_large | 2 | 98.2323 | | hf_T5 | 8 | 91.6259 | | timm_nfnet | 128 | 76.4085 | | hf_BigBird | 2 | 73.4201 | | hf_Reformer | 4 | 70.6724 | | Super_SloMo | 6 | 64.3118 | | yolov3 | 16 | 56.8469 | | timm_regnet | 32 | 54.7105 | | vgg16 | 64 | 52.9478 | | resnet152 | 32 | 52.7409 | | demucs | 4 | 51.9206 | | hf_Bert_large | 4 | 50.6931 | | speech_transformer | 32 | 34.397 | | hf_Bart | 4 | 33.2546 | | attention_is_all_you_need_pytorch | 256 | 32.5595 | | fastNLP_Bert | 6 | 31.487 | | mobilenet_v2 | 96 | 30.8072 | | pytorch_unet | 1 | 29.3277 | | hf_Albert | 8 | 29.0455 | | timm_vovnet | 32 | 26.2372 | | hf_GPT2 | 4 | 24.6515 | | timm_efficientnet | 32 | 22.29 | | hf_Bert | 4 | 22.1109 | | resnet50 | 32 | 21.9717 | | hf_DistilBert | 8 | 21.4556 | | densenet121 | 4 | 21.3784 | | shufflenet_v2_x1_0 | 128 | 18.6762 | | BERT_pytorch | 16 | 17.1107 | | timm_vision_transformer | 32 | 16.7119 | | timm_resnest | 32 | 15.3226 | | mnasnet1_0 | 32 | 13.26 | | mobilenet_v3_large | 32 | 12.9303 | | resnext50_32x4d | 8 | 12.117 | | pytorch_stargan | 16 | 11.6841 | | phlippe_densenet | 128 | 11.4957 | | nvidia_deeprecommender | 256 | 10.9424 | | alexnet | 128 | 8.679 | | LearningToPaint | 96 | 8.3947 | | tts_angular | 64 | 6.6692 | | resnet18 | 16 | 5.8876 | | pytorch_CycleGAN_and_pix2pix | 1 | 5.8258 | | squeezenet1_1 | 32 | 5.321 | | phlippe_resnet | 128 | 4.9519 | | functorch_dp_cifar10 | 64 | 2.8959 | | pytorch_struct | 200 | 2.7246 | | drq | 1 | 2.1961 | | dlrm | 1024 | 2.1 | | dcgan | 32 | 1.5496 | | soft_actor_critic | 256 | 1.3602 | | lennard_jones | 1000 | 1.1525 | | gat | 0 | nan | | gcn | 0 | nan | | hf_GPT2_large | 0 | nan | | hf_Longformer | 0 | nan | | moco | 0 | nan | | sage | 0 | nan | | tacotron2 | 0 | nan | | timm_vision_transformer_large | 0 | nan | | torchrec_dlrm | 0 | nan | +-----------------------------------+------+-----------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------------+-----+-----------------------+ | OPTForCausalLM | 2 | 2.5219 | | GPT2ForSequenceClassification | 4 | 2.3089 | | MobileBertForMaskedLM | 64 | 2.1917 | | MT5ForConditionalGeneration | 16 | 2.1437 | | ElectraForQuestionAnswering | 64 | 2.1284 | | M2M100ForConditionalGeneration | 16 | 1.9772 | | DistillGPT2 | 16 | 1.9024 | | PLBartForCausalLM | 8 | 1.8722 | | ElectraForCausalLM | 32 | 1.8501 | | XLNetLMHeadModel | 8 | 1.8247 | | LayoutLMForSequenceClassification | 16 | 1.7923 | | RobertaForQuestionAnswering | 16 | 1.7901 | | BertForQuestionAnswering | 16 | 1.7802 | | PLBartForConditionalGeneration | 4 | 1.735 | | XGLMForCausalLM | 8 | 1.7268 | | RobertaForCausalLM | 16 | 1.6711 | | T5ForConditionalGeneration | 4 | 1.6644 | | T5Small | 4 | 1.6595 | | BartForCausalLM | 4 | 1.6579 | | MBartForCausalLM | 4 | 1.6514 | | AlbertForQuestionAnswering | 4 | 1.6512 | | YituTechConvBert | 16 | 1.6384 | | AlbertForMaskedLM | 4 | 1.6346 | | MegatronBertForQuestionAnswering | 8 | 1.6248 | | CamemBert | 16 | 1.6231 | | BartForConditionalGeneration | 2 | 1.6161 | | BertForMaskedLM | 16 | 1.5983 | | LayoutLMForMaskedLM | 16 | 1.5828 | | MBartForConditionalGeneration | 2 | 1.5361 | | Speech2Text2ForCausalLM | 256 | 1.5178 | | MegatronBertForCausalLM | 4 | 1.4962 | | DistilBertForQuestionAnswering | 256 | 1.4593 | | BlenderbotSmallForConditionalGeneration | 64 | 1.4572 | | PegasusForConditionalGeneration | 32 | 1.4289 | | MobileBertForQuestionAnswering | 128 | 1.3941 | | TrOCRForCausalLM | 32 | 1.3932 | | BlenderbotSmallForCausalLM | 64 | 1.3854 | | PegasusForCausalLM | 32 | 1.3282 | | DistilBertForMaskedLM | 128 | 1.2159 | | DebertaForQuestionAnswering | 8 | 1.0628 | | DebertaForMaskedLM | 4 | 0.9939 | | DebertaV2ForMaskedLM | 1 | 0.8898 | | DebertaV2ForQuestionAnswering | 2 | 0.8359 | | BlenderbotForCausalLM | 0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | +-----------------------------------------+-----+-----------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------------+----+-----------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | | AlbertForMaskedLM | 1 | pass | | PegasusForCausalLM | 1 | pass | | MT5ForConditionalGeneration | 1 | pass | | MegatronBertForCausalLM | 1 | pass | | MegatronBertForQuestionAnswering | 1 | pass | | MobileBertForMaskedLM | 1 | pass | | MobileBertForQuestionAnswering | 1 | pass | | OPTForCausalLM | 1 | pass | | PLBartForCausalLM | 1 | pass | | PLBartForConditionalGeneration | 1 | pass | | PegasusForConditionalGeneration | 1 | pass | | MBartForCausalLM | 1 | pass | | RobertaForCausalLM | 1 | pass | | RobertaForQuestionAnswering | 1 | pass | | Speech2Text2ForCausalLM | 1 | pass | | T5ForConditionalGeneration | 1 | pass | | T5Small | 1 | pass | | TrOCRForCausalLM | 1 | pass | | XGLMForCausalLM | 1 | pass | | XLNetLMHeadModel | 1 | pass | | MBartForConditionalGeneration | 1 | pass | | LayoutLMForSequenceClassification | 1 | pass | | M2M100ForConditionalGeneration | 1 | pass | | DebertaForMaskedLM | 1 | pass | | AllenaiLongformerBase | 1 | pass | | BartForCausalLM | 1 | pass | | BartForConditionalGeneration | 1 | pass | | BertForMaskedLM | 1 | pass | | BertForQuestionAnswering | 1 | pass | | BlenderbotSmallForCausalLM | 1 | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | | CamemBert | 1 | pass | | DebertaForQuestionAnswering | 1 | pass | | DistilBertForMaskedLM | 1 | pass | | DistilBertForQuestionAnswering | 1 | pass | | DistillGPT2 | 1 | pass | | ElectraForCausalLM | 1 | pass | | ElectraForQuestionAnswering | 1 | pass | | GPT2ForSequenceClassification | 1 | pass | | LayoutLMForMaskedLM | 1 | pass | | YituTechConvBert | 1 | pass | | DebertaV2ForQuestionAnswering | 1 | fail_to_run | | AlbertForQuestionAnswering | 1 | fail_accuracy | +-----------------------------------------+----+-----------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------------+-----+-----------------------+ | MobileBertForMaskedLM | 64 | 641.83 | | MobileBertForQuestionAnswering | 128 | 618.0127 | | MT5ForConditionalGeneration | 16 | 598.155 | | DebertaV2ForMaskedLM | 1 | 482.9045 | | ElectraForCausalLM | 32 | 385.9535 | | DebertaV2ForQuestionAnswering | 2 | 357.7603 | | AlbertForMaskedLM | 4 | 346.3587 | | XLNetLMHeadModel | 8 | 305.4182 | | XGLMForCausalLM | 8 | 298.9284 | | M2M100ForConditionalGeneration | 16 | 292.4152 | | T5ForConditionalGeneration | 4 | 272.078 | | ElectraForQuestionAnswering | 64 | 255.4572 | | YituTechConvBert | 16 | 244.8943 | | TrOCRForCausalLM | 32 | 244.1556 | | BartForConditionalGeneration | 2 | 241.7899 | | BertForMaskedLM | 16 | 233.1334 | | BlenderbotSmallForCausalLM | 64 | 207.1343 | | DistilBertForMaskedLM | 128 | 204.2688 | | DebertaForQuestionAnswering | 8 | 203.9153 | | BartForCausalLM | 4 | 194.9556 | | GPT2ForSequenceClassification | 4 | 194.0422 | | DistilBertForQuestionAnswering | 256 | 189.3398 | | LayoutLMForSequenceClassification | 16 | 167.9029 | | DebertaForMaskedLM | 4 | 160.1921 | | Speech2Text2ForCausalLM | 256 | 153.7871 | | MegatronBertForQuestionAnswering | 8 | 145.4149 | | DistillGPT2 | 16 | 134.5174 | | OPTForCausalLM | 2 | 121.5937 | | MBartForConditionalGeneration | 2 | 112.0044 | | PegasusForCausalLM | 32 | 106.749 | | PegasusForConditionalGeneration | 32 | 103.4952 | | MegatronBertForCausalLM | 4 | 103.1821 | | PLBartForConditionalGeneration | 4 | 97.4698 | | BertForQuestionAnswering | 16 | 95.2555 | | AlbertForQuestionAnswering | 4 | 89.417 | | BlenderbotSmallForConditionalGeneration | 64 | 79.5304 | | PLBartForCausalLM | 8 | 76.8499 | | CamemBert | 16 | 73.4491 | | MBartForCausalLM | 4 | 53.9876 | | T5Small | 4 | 45.9055 | | RobertaForCausalLM | 16 | 45.5067 | | LayoutLMForMaskedLM | 16 | 41.4 | | RobertaForQuestionAnswering | 16 | 38.3782 | | AllenaiLongformerBase | 0 | nan | | BlenderbotForCausalLM | 0 | nan | +-----------------------------------------+-----+-----------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------------+-----+-----------------------+ | XLNetLMHeadModel | 8 | 1.1342 | | GPT2ForSequenceClassification | 4 | 1.1135 | | ElectraForQuestionAnswering | 64 | 1.1114 | | OPTForCausalLM | 2 | 1.094 | | BertForQuestionAnswering | 16 | 1.0868 | | RobertaForQuestionAnswering | 16 | 1.0865 | | LayoutLMForSequenceClassification | 16 | 1.0583 | | RobertaForCausalLM | 16 | 1.0541 | | YituTechConvBert | 16 | 1.0402 | | T5Small | 4 | 1.0382 | | T5ForConditionalGeneration | 4 | 1.0356 | | DistilBertForQuestionAnswering | 256 | 1.0299 | | LayoutLMForMaskedLM | 16 | 1.0078 | | BertForMaskedLM | 16 | 0.9864 | | CamemBert | 16 | 0.9828 | | AlbertForQuestionAnswering | 4 | 0.9734 | | ElectraForCausalLM | 32 | 0.9731 | | DistillGPT2 | 16 | 0.9682 | | AlbertForMaskedLM | 4 | 0.9574 | | MegatronBertForQuestionAnswering | 8 | 0.953 | | PLBartForConditionalGeneration | 4 | 0.9294 | | MBartForCausalLM | 4 | 0.9281 | | PegasusForCausalLM | 32 | 0.893 | | TrOCRForCausalLM | 32 | 0.8836 | | BartForCausalLM | 4 | 0.8818 | | PegasusForConditionalGeneration | 32 | 0.8687 | | MBartForConditionalGeneration | 2 | 0.8672 | | BartForConditionalGeneration | 2 | 0.8456 | | MegatronBertForCausalLM | 4 | 0.845 | | PLBartForCausalLM | 8 | 0.8437 | | MT5ForConditionalGeneration | 16 | 0.8222 | | BlenderbotSmallForConditionalGeneration | 64 | 0.816 | | DistilBertForMaskedLM | 128 | 0.8045 | | M2M100ForConditionalGeneration | 16 | 0.7651 | | MobileBertForMaskedLM | 64 | 0.752 | | BlenderbotSmallForCausalLM | 64 | 0.7355 | | Speech2Text2ForCausalLM | 256 | 0.7143 | | XGLMForCausalLM | 8 | 0.7117 | | MobileBertForQuestionAnswering | 128 | 0.6505 | | DebertaForMaskedLM | 4 | 0.5504 | | DebertaV2ForMaskedLM | 1 | 0.5138 | | DebertaV2ForQuestionAnswering | 2 | 0.4821 | | DebertaForQuestionAnswering | 8 | 0.4604 | | AllenaiLongformerBase | 0 | nan | | BlenderbotForCausalLM | 0 | nan | +-----------------------------------------+-----+-----------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +-----------------------------------------+-----+-----------------------+ | AlbertForMaskedLM | 4 | 162.9778 | | AlbertForQuestionAnswering | 4 | 160.0323 | | XLNetLMHeadModel | 8 | 152.8289 | | DebertaV2ForQuestionAnswering | 2 | 129.8578 | | MobileBertForQuestionAnswering | 128 | 124.2626 | | DebertaV2ForMaskedLM | 1 | 117.7692 | | PegasusForConditionalGeneration | 32 | 105.6076 | | TrOCRForCausalLM | 32 | 100.0077 | | BartForConditionalGeneration | 2 | 90.9734 | | MBartForConditionalGeneration | 2 | 89.4901 | | MegatronBertForQuestionAnswering | 8 | 87.4238 | | MobileBertForMaskedLM | 64 | 80.8102 | | YituTechConvBert | 16 | 76.6409 | | BlenderbotSmallForConditionalGeneration | 64 | 75.9591 | | M2M100ForConditionalGeneration | 16 | 74.0544 | | CamemBert | 16 | 72.8774 | | DebertaForQuestionAnswering | 8 | 72.3622 | | DistilBertForQuestionAnswering | 256 | 71.0903 | | LayoutLMForMaskedLM | 16 | 71.0194 | | DistilBertForMaskedLM | 128 | 69.6847 | | MBartForCausalLM | 4 | 69.1411 | | BertForMaskedLM | 16 | 68.8905 | | RobertaForCausalLM | 16 | 68.7737 | | PLBartForConditionalGeneration | 4 | 68.4457 | | BartForCausalLM | 4 | 68.2489 | | OPTForCausalLM | 2 | 67.7134 | | DebertaForMaskedLM | 4 | 64.0079 | | T5ForConditionalGeneration | 4 | 62.8488 | | T5Small | 4 | 62.8476 | | PLBartForCausalLM | 8 | 61.7269 | | MegatronBertForCausalLM | 4 | 57.6586 | | DistillGPT2 | 16 | 55.5112 | | LayoutLMForSequenceClassification | 16 | 54.4499 | | ElectraForQuestionAnswering | 64 | 53.7627 | | BertForQuestionAnswering | 16 | 53.4591 | | RobertaForQuestionAnswering | 16 | 53.3137 | | PegasusForCausalLM | 32 | 53.0044 | | XGLMForCausalLM | 8 | 52.2883 | | ElectraForCausalLM | 32 | 47.5662 | | MT5ForConditionalGeneration | 16 | 42.9764 | | BlenderbotSmallForCausalLM | 64 | 41.8483 | | GPT2ForSequenceClassification | 4 | 39.5726 | | Speech2Text2ForCausalLM | 256 | 34.7189 | | AllenaiLongformerBase | 0 | nan | | BlenderbotForCausalLM | 0 | nan | +-----------------------------------------+-----+-----------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +---------------------------------+-----+-----------------------+ | tnt_s_patch16_224 | 128 | 3.3194 | | twins_pcpvt_base | 64 | 2.1355 | | xcit_large_24_p8_224 | 5 | 2.11 | | coat_lite_mini | 128 | 2.0666 | | gmixer_24_224 | 128 | 1.8929 | | gmlp_s16_224 | 128 | 1.858 | | ghostnet_100 | 128 | 1.8472 | | crossvit_9_240 | 128 | 1.8215 | | volo_d1_224 | 64 | 1.7402 | | convit_base | 64 | 1.7127 | | swin_base_patch4_window7_224 | 64 | 1.7101 | | lcnet_050 | 128 | 1.6936 | | pit_b_224 | 64 | 1.6036 | | gluon_inception_v3 | 128 | 1.5392 | | inception_v3 | 128 | 1.539 | | adv_inception_v3 | 128 | 1.5353 | | jx_nest_base | 32 | 1.5332 | | dla102 | 128 | 1.5258 | | sebotnet33ts_256 | 64 | 1.515 | | convnext_base | 64 | 1.4921 | | dm_nfnet_f0 | 128 | 1.4853 | | nfnet_l0 | 128 | 1.4848 | | mobilevit_s | 64 | 1.4742 | | beit_base_patch16_224 | 64 | 1.4573 | | eca_botnext26ts_256 | 128 | 1.4484 | | cait_m36_384 | 4 | 1.4476 | | regnety_002 | 128 | 1.4428 | | mobilenetv3_large_100 | 128 | 1.4334 | | mnasnet_100 | 128 | 1.4275 | | resnest101e | 64 | 1.4233 | | selecsls42b | 128 | 1.4112 | | botnet26t_256 | 128 | 1.4094 | | res2net50_14w_8s | 128 | 1.3973 | | mixer_b16_224 | 128 | 1.3956 | | resmlp_12_224 | 128 | 1.3944 | | mobilenetv2_100 | 128 | 1.3898 | | hrnet_w18 | 128 | 1.3838 | | res2next50 | 128 | 1.3707 | | ese_vovnet19b_dw | 128 | 1.364 | | spnasnet_100 | 128 | 1.3517 | | tf_efficientnet_b0 | 128 | 1.3513 | | fbnetc_100 | 128 | 1.3493 | | vit_base_patch16_224 | 64 | 1.346 | | poolformer_m36 | 64 | 1.3271 | | fbnetv3_b | 128 | 1.3191 | | deit_base_distilled_patch16_224 | 64 | 1.3168 | | rexnet_100 | 128 | 1.2987 | | cspdarknet53 | 64 | 1.2264 | | tinynet_a | 128 | 1.2254 | | visformer_small | 128 | 1.2055 | | tf_mixnet_l | 128 | 1.19 | | mixnet_l | 128 | 1.1786 | | res2net101_26w_4s | 64 | 1.1749 | | pnasnet5large | 16 | 1.1241 | | dpn107 | 32 | 1.0918 | | gluon_xception65 | 32 | 1.0849 | | repvgg_a2 | 128 | 1.0847 | | swsl_resnext101_32x16d | 32 | 1.0601 | | gernet_l | 128 | 1.0416 | | convmixer_768_32 | 32 | 1.0078 | +---------------------------------+-----+-----------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-----------------------+ | name | bs | inductor_max_autotune | +---------------------------------+----+-----------------------+ | adv_inception_v3 | 8 | pass | | resmlp_12_224 | 8 | pass | | mobilenetv2_100 | 8 | pass | | mobilenetv3_large_100 | 8 | pass | | mobilevit_s | 8 | pass | | nfnet_l0 | 8 | pass | | pit_b_224 | 8 | pass | | pnasnet5large | 8 | pass | | poolformer_m36 | 8 | pass | | regnety_002 | 8 | pass | | repvgg_a2 | 8 | pass | | res2net101_26w_4s | 8 | pass | | res2net50_14w_8s | 8 | pass | | res2next50 | 8 | pass | | resnest101e | 8 | pass | | mixnet_l | 8 | pass | | rexnet_100 | 8 | pass | | sebotnet33ts_256 | 8 | pass | | selecsls42b | 8 | pass | | spnasnet_100 | 8 | pass | | swsl_resnext101_32x16d | 8 | pass | | tf_efficientnet_b0 | 8 | pass | | tf_mixnet_l | 8 | pass | | tinynet_a | 8 | pass | | tnt_s_patch16_224 | 8 | pass | | visformer_small | 8 | pass | | vit_base_patch16_224 | 8 | pass | | volo_d1_224 | 8 | pass | | beit_base_patch16_224 | 8 | pass | | mnasnet_100 | 8 | pass | | mixer_b16_224 | 8 | pass | | eca_botnext26ts_256 | 8 | pass | | botnet26t_256 | 8 | pass | | cait_m36_384 | 4 | pass | | convit_base | 8 | pass | | convmixer_768_32 | 8 | pass | | convnext_base | 8 | pass | | crossvit_9_240 | 8 | pass | | cspdarknet53 | 8 | pass | | deit_base_distilled_patch16_224 | 8 | pass | | dla102 | 8 | pass | | dm_nfnet_f0 | 8 | pass | | lcnet_050 | 8 | pass | | dpn107 | 8 | pass | | ese_vovnet19b_dw | 8 | pass | | fbnetc_100 | 8 | pass | | fbnetv3_b | 8 | pass | | gernet_l | 8 | pass | | ghostnet_100 | 8 | pass | | gluon_inception_v3 | 8 | pass | | gluon_xception65 | 8 | pass | | gmixer_24_224 | 8 | pass | | gmlp_s16_224 | 8 | pass | | hrnet_w18 | 8 | pass | | inception_v3 | 8 | pass | | jx_nest_base | 8 | pass | | xcit_large_24_p8_224 | 8 | pass | | swin_base_patch4_window7_224 | 8 | fail_accuracy | | twins_pcpvt_base | 0 | 0.0000 | | coat_lite_mini | 0 | 0.0000 | +---------------------------------+----+-----------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +---------------------------------+-----+-----------------------+ | twins_pcpvt_base | 64 | 1688.7575 | | mobilevit_s | 64 | 1572.9991 | | coat_lite_mini | 128 | 1364.6126 | | crossvit_9_240 | 128 | 1264.9017 | | rexnet_100 | 128 | 1099.5804 | | xcit_large_24_p8_224 | 5 | 1088.1898 | | volo_d1_224 | 64 | 1084.1596 | | swin_base_patch4_window7_224 | 64 | 985.0894 | | pit_b_224 | 64 | 983.8723 | | jx_nest_base | 32 | 975.7591 | | cait_m36_384 | 4 | 953.7136 | | ghostnet_100 | 128 | 915.6073 | | hrnet_w18 | 128 | 852.9444 | | mixnet_l | 128 | 827.5129 | | sebotnet33ts_256 | 64 | 814.3166 | | adv_inception_v3 | 128 | 788.1443 | | res2net50_14w_8s | 128 | 785.2897 | | botnet26t_256 | 128 | 758.1937 | | res2net101_26w_4s | 64 | 727.2155 | | dpn107 | 32 | 702.1314 | | fbnetv3_b | 128 | 674.6594 | | pnasnet5large | 16 | 628.957 | | fbnetc_100 | 128 | 584.9966 | | tnt_s_patch16_224 | 128 | 579.6948 | | convnext_base | 64 | 526.0179 | | tinynet_a | 128 | 510.8161 | | regnety_002 | 128 | 477.458 | | dla102 | 128 | 459.0892 | | visformer_small | 128 | 444.6314 | | convit_base | 64 | 435.4578 | | resnest101e | 64 | 435.2065 | | cspdarknet53 | 64 | 395.4526 | | gluon_xception65 | 32 | 355.2383 | | nfnet_l0 | 128 | 336.8407 | | beit_base_patch16_224 | 64 | 336.6734 | | poolformer_m36 | 64 | 333.2378 | | gmixer_24_224 | 128 | 330.7234 | | eca_botnext26ts_256 | 128 | 326.0591 | | gernet_l | 128 | 320.009 | | tf_efficientnet_b0 | 128 | 316.5582 | | selecsls42b | 128 | 289.8516 | | mnasnet_100 | 128 | 285.7957 | | ese_vovnet19b_dw | 128 | 285.7598 | | deit_base_distilled_patch16_224 | 64 | 270.0214 | | repvgg_a2 | 128 | 269.9819 | | mixer_b16_224 | 128 | 259.1909 | | lcnet_050 | 128 | 225.9875 | | gmlp_s16_224 | 128 | 196.2178 | | mobilenetv3_large_100 | 128 | 189.084 | | swsl_resnext101_32x16d | 32 | 179.7337 | | resmlp_12_224 | 128 | 173.4721 | | res2next50 | 128 | 145.9784 | | mobilenetv2_100 | 128 | 126.8471 | | convmixer_768_32 | 32 | 109.2042 | | tf_mixnet_l | 128 | 92.7005 | | spnasnet_100 | 128 | 76.5348 | | gluon_inception_v3 | 128 | 56.6582 | | inception_v3 | 128 | 56.136 | | vit_base_patch16_224 | 64 | 48.2327 | | dm_nfnet_f0 | 128 | 39.9079 | +---------------------------------+-----+-----------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +---------------------------------+-----+-----------------------+ | gmlp_s16_224 | 128 | 1.1841 | | pnasnet5large | 16 | 1.1522 | | gmixer_24_224 | 128 | 1.1129 | | convit_base | 64 | 1.0948 | | mobilenetv2_100 | 128 | 1.0267 | | dm_nfnet_f0 | 128 | 1.013 | | resmlp_12_224 | 128 | 1.01 | | tinynet_a | 128 | 0.9985 | | resnest101e | 64 | 0.9933 | | tf_efficientnet_b0 | 128 | 0.9875 | | tnt_s_patch16_224 | 128 | 0.9834 | | rexnet_100 | 128 | 0.9744 | | twins_pcpvt_base | 64 | 0.9729 | | convmixer_768_32 | 32 | 0.967 | | dla102 | 128 | 0.9528 | | mixer_b16_224 | 128 | 0.9439 | | vit_base_patch16_224 | 64 | 0.9362 | | tf_mixnet_l | 128 | 0.9344 | | beit_base_patch16_224 | 64 | 0.9284 | | mobilevit_s | 64 | 0.9263 | | visformer_small | 128 | 0.9245 | | fbnetv3_b | 128 | 0.917 | | nfnet_l0 | 128 | 0.9101 | | cspdarknet53 | 64 | 0.9098 | | deit_base_distilled_patch16_224 | 64 | 0.9072 | | volo_d1_224 | 64 | 0.9068 | | ese_vovnet19b_dw | 128 | 0.8976 | | sebotnet33ts_256 | 64 | 0.8908 | | gluon_inception_v3 | 128 | 0.8902 | | inception_v3 | 128 | 0.8902 | | adv_inception_v3 | 128 | 0.8902 | | hrnet_w18 | 128 | 0.8889 | | gluon_xception65 | 32 | 0.8833 | | spnasnet_100 | 128 | 0.8788 | | xcit_large_24_p8_224 | 5 | 0.8761 | | eca_botnext26ts_256 | 128 | 0.8738 | | mixnet_l | 128 | 0.8685 | | mnasnet_100 | 128 | 0.8684 | | dpn107 | 32 | 0.8676 | | res2next50 | 128 | 0.8659 | | mobilenetv3_large_100 | 128 | 0.865 | | cait_m36_384 | 4 | 0.8633 | | poolformer_m36 | 64 | 0.8599 | | fbnetc_100 | 128 | 0.8597 | | pit_b_224 | 64 | 0.8566 | | res2net101_26w_4s | 64 | 0.8506 | | res2net50_14w_8s | 128 | 0.8501 | | gernet_l | 128 | 0.8494 | | selecsls42b | 128 | 0.8473 | | swsl_resnext101_32x16d | 32 | 0.8461 | | ghostnet_100 | 128 | 0.8408 | | coat_lite_mini | 128 | 0.8402 | | convnext_base | 64 | 0.832 | | botnet26t_256 | 128 | 0.8241 | | lcnet_050 | 128 | 0.8174 | | regnety_002 | 128 | 0.7846 | | repvgg_a2 | 128 | 0.7738 | | crossvit_9_240 | 128 | 0.7525 | | swin_base_patch4_window7_224 | 64 | 0.7214 | | jx_nest_base | 32 | 0.6693 | +---------------------------------+-----+-----------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+-----------------------+ | name | bs | inductor_max_autotune | +---------------------------------+-----+-----------------------+ | convmixer_768_32 | 32 | 297.715 | | hrnet_w18 | 128 | 201.8791 | | pnasnet5large | 16 | 174.487 | | tf_mixnet_l | 128 | 158.9862 | | mixnet_l | 128 | 153.5053 | | cait_m36_384 | 4 | 115.5031 | | resnest101e | 64 | 114.6772 | | dla102 | 128 | 112.6617 | | swsl_resnext101_32x16d | 32 | 111.8307 | | poolformer_m36 | 64 | 109.0074 | | adv_inception_v3 | 128 | 104.2656 | | gluon_inception_v3 | 128 | 104.0856 | | inception_v3 | 128 | 103.9529 | | res2net50_14w_8s | 128 | 100.7721 | | dpn107 | 32 | 97.2715 | | tnt_s_patch16_224 | 128 | 97.2103 | | convit_base | 64 | 95.154 | | res2next50 | 128 | 91.7484 | | gluon_xception65 | 32 | 91.2787 | | swin_base_patch4_window7_224 | 64 | 85.4506 | | dm_nfnet_f0 | 128 | 85.0143 | | res2net101_26w_4s | 64 | 84.7107 | | mixer_b16_224 | 128 | 83.7046 | | fbnetv3_b | 128 | 82.9483 | | convnext_base | 64 | 81.8352 | | visformer_small | 128 | 75.483 | | nfnet_l0 | 128 | 75.172 | | gmlp_s16_224 | 128 | 73.8028 | | pit_b_224 | 64 | 73.6116 | | eca_botnext26ts_256 | 128 | 73.1492 | | cspdarknet53 | 64 | 72.1408 | | botnet26t_256 | 128 | 70.3273 | | beit_base_patch16_224 | 64 | 69.9807 | | gernet_l | 128 | 69.8907 | | volo_d1_224 | 64 | 69.0967 | | repvgg_a2 | 128 | 66.9311 | | jx_nest_base | 32 | 65.2306 | | deit_base_distilled_patch16_224 | 64 | 64.6438 | | vit_base_patch16_224 | 64 | 64.3562 | | gmixer_24_224 | 128 | 62.1661 | | tf_efficientnet_b0 | 128 | 60.2175 | | xcit_large_24_p8_224 | 5 | 59.6563 | | rexnet_100 | 128 | 58.6476 | | fbnetc_100 | 128 | 58.2957 | | tinynet_a | 128 | 56.7487 | | twins_pcpvt_base | 64 | 55.8003 | | mobilevit_s | 64 | 55.2001 | | coat_lite_mini | 128 | 54.5083 | | sebotnet33ts_256 | 64 | 50.7858 | | spnasnet_100 | 128 | 49.0136 | | ghostnet_100 | 128 | 48.6445 | | ese_vovnet19b_dw | 128 | 45.4094 | | crossvit_9_240 | 128 | 44.9644 | | mobilenetv2_100 | 128 | 44.7157 | | mnasnet_100 | 128 | 42.6209 | | selecsls42b | 128 | 42.4078 | | mobilenetv3_large_100 | 128 | 40.5618 | | resmlp_12_224 | 128 | 38.0244 | | regnety_002 | 128 | 26.5598 | | lcnet_050 | 128 | 17.6225 | +---------------------------------+-----+-----------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_095_05_04_23_performance_amp_283/torchbench_amp.png : ![](https://i.imgur.com/dkcvKaz.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_095_05_04_23_performance_amp_283/huggingface_amp.png : ![](https://i.imgur.com/Cf0eSr3.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_095_05_04_23_performance_amp_283/timm_models_amp.png : ![](https://i.imgur.com/rgSqkGh.png)

Build Summary

### Run name ### day_095_05_04_23_performance_amp_283 ### Commit hashes ### pytorch commit: f55e72c0f6bd6da016aaa51de379e6ba6d7891cc pytorch commit date: 2023-04-07 17:30:27+00:00 torchbench commit: 735f1927996c8d9ab81f0b0c05dd1ebdb26a6250 torchbench commit date: 2023-04-05 09:43:21-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+gitf55e72c ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (inductor max-autotune without cudagraphs)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-------------------------------------+------------+-------------+-------------+
|              Compiler               | torchbench | huggingface | timm_models |
+-------------------------------------+------------+-------------+-------------+
| inductor_max_autotune_no_cudagraphs | 80%, 48/60 | 96%, 43/45  | 95%, 57/60  |
+-------------------------------------+------------+-------------+-------------+

Geometric mean speedup

+-------------------------------------+------------+-------------+-------------+
|              Compiler               | torchbench | huggingface | timm_models |
+-------------------------------------+------------+-------------+-------------+
| inductor_max_autotune_no_cudagraphs |   1.32x    |    1.57x    |    1.40x    |
+-------------------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+-------------------------------------+------------+-------------+-------------+
|              Compiler               | torchbench | huggingface | timm_models |
+-------------------------------------+------------+-------------+-------------+
| inductor_max_autotune_no_cudagraphs |   360.44   |   222.95    |   497.02    |
+-------------------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+-------------------------------------+------------+-------------+-------------+
|              Compiler               | torchbench | huggingface | timm_models |
+-------------------------------------+------------+-------------+-------------+
| inductor_max_autotune_no_cudagraphs |   0.88x    |    1.02x    |    1.01x    |
+-------------------------------------+------------+-------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------+------+-------------------------------------+ | hf_Albert | 8 | 2.3013 | | BERT_pytorch | 16 | 2.1213 | | hf_T5 | 8 | 2.007 | | hf_T5_large | 2 | 1.9716 | | hf_GPT2 | 4 | 1.9456 | | hf_GPT2_large | 4 | 1.8553 | | speech_transformer | 32 | 1.7828 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.7737 | | hf_BigBird | 2 | 1.7036 | | attention_is_all_you_need_pytorch | 256 | 1.6701 | | hf_Bert | 4 | 1.6423 | | hf_Bert_large | 4 | 1.6183 | | fastNLP_Bert | 6 | 1.6095 | | hf_Bart | 4 | 1.5487 | | timm_vision_transformer | 32 | 1.5202 | | timm_nfnet | 128 | 1.4864 | | timm_resnest | 32 | 1.4809 | | hf_DistilBert | 8 | 1.472 | | mobilenet_v2 | 96 | 1.4695 | | functorch_dp_cifar10 | 64 | 1.378 | | squeezenet1_1 | 32 | 1.3735 | | pytorch_unet | 1 | 1.3534 | | pytorch_struct | 200 | 1.3482 | | vgg16 | 64 | 1.2635 | | dlrm | 1024 | 1.2449 | | pytorch_stargan | 16 | 1.2381 | | Super_SloMo | 6 | 1.2342 | | Background_Matting | 4 | 1.2138 | | yolov3 | 16 | 1.2071 | | shufflenet_v2_x1_0 | 128 | 1.2004 | | alexnet | 128 | 1.1836 | | mobilenet_v3_large | 32 | 1.1808 | | timm_vision_transformer_large | 32 | 1.1654 | | drq | 1 | 1.1516 | | nvidia_deeprecommender | 256 | 1.1048 | | mnasnet1_0 | 32 | 1.0859 | | LearningToPaint | 96 | 1.0833 | | hf_Reformer | 4 | 1.0677 | | lennard_jones | 1000 | 1.0671 | | resnet50 | 32 | 1.0633 | | phlippe_resnet | 128 | 1.0544 | | timm_efficientnet | 32 | 1.0499 | | phlippe_densenet | 128 | 1.0464 | | densenet121 | 4 | 1.0444 | | demucs | 4 | 1.038 | | resnet152 | 32 | 1.0293 | | timm_regnet | 32 | 0.9828 | | resnext50_32x4d | 8 | 0.9655 | | tts_angular | 64 | 0.9547 | | resnet18 | 16 | 0.9463 | | timm_vovnet | 32 | 0.9212 | | soft_actor_critic | 256 | 0.8621 | | dcgan | 32 | 0.8277 | | sage | 0 | 0.0 | | hf_Longformer | 0 | 0.0 | | tacotron2 | 0 | 0.0 | | moco | 0 | 0.0 | | torchrec_dlrm | 0 | 0.0 | | gcn | 0 | 0.0 | | gat | 0 | 0.0 | +-----------------------------------+------+-------------------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------+-----+-------------------------------------+ | hf_T5_large | 4 | pass_due_to_skip | | timm_vision_transformer_large | 4 | pass_due_to_skip | | hf_GPT2_large | 4 | pass_due_to_skip | | resnet50 | 4 | pass | | mnasnet1_0 | 4 | pass | | mobilenet_v3_large | 4 | pass | | nvidia_deeprecommender | 4 | pass | | phlippe_densenet | 4 | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | | pytorch_stargan | 16 | pass | | pytorch_struct | 200 | pass | | pytorch_unet | 2 | pass | | resnet152 | 4 | pass | | resnet18 | 4 | pass | | BERT_pytorch | 4 | pass | | lennard_jones | 4 | pass | | shufflenet_v2_x1_0 | 4 | pass | | soft_actor_critic | 256 | pass | | speech_transformer | 4 | pass | | timm_efficientnet | 4 | pass | | timm_nfnet | 4 | pass | | timm_regnet | 4 | pass | | timm_resnest | 4 | pass | | timm_vision_transformer | 4 | pass | | timm_vovnet | 4 | pass | | tts_angular | 4 | pass | | vgg16 | 4 | pass | | resnext50_32x4d | 4 | pass | | mobilenet_v2 | 4 | pass | | yolov3 | 4 | pass | | hf_Bart | 4 | pass | | attention_is_all_you_need_pytorch | 4 | pass | | dcgan | 4 | pass | | demucs | 4 | pass | | densenet121 | 4 | pass | | dlrm | 4 | pass | | Super_SloMo | 4 | pass | | fastNLP_Bert | 4 | pass | | functorch_dp_cifar10 | 4 | pass | | hf_T5_base | 4 | pass | | LearningToPaint | 4 | pass | | hf_Albert | 4 | pass | | hf_Bert | 4 | pass | | hf_BigBird | 4 | pass | | hf_DistilBert | 4 | pass | | hf_GPT2 | 2 | pass | | hf_Reformer | 4 | pass | | hf_T5 | 4 | pass | | alexnet | 4 | pass | | hf_Bert_large | 4 | fail_to_run | | hf_Longformer | 4 | fail_to_run | | moco | 4 | fail_to_run | | squeezenet1_1 | 4 | fail_accuracy | | phlippe_resnet | 4 | fail_accuracy | | drq | 1 | fail_accuracy | | vision_maskrcnn | 4 | eager_variation | | Background_Matting | 4 | eager_variation | | torchrec_dlrm | 0 | 0.0000 | | llama | 0 | 0.0000 | | tacotron2 | 0 | 0.0000 | | sage | 0 | 0.0000 | | gcn | 0 | 0.0000 | | gat | 0 | 0.0000 | +-----------------------------------+-----+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------+------+-------------------------------------+ | densenet121 | 4 | 1220.2865 | | speech_transformer | 32 | 872.3027 | | hf_GPT2_large | 4 | 862.7958 | | phlippe_densenet | 128 | 843.1009 | | attention_is_all_you_need_pytorch | 256 | 742.8618 | | mnasnet1_0 | 32 | 603.51 | | mobilenet_v3_large | 32 | 561.2653 | | mobilenet_v2 | 96 | 548.8599 | | hf_BigBird | 2 | 504.069 | | hf_T5_large | 2 | 494.9345 | | yolov3 | 16 | 472.1285 | | timm_vision_transformer_large | 32 | 469.5215 | | timm_vision_transformer | 32 | 469.4433 | | timm_regnet | 32 | 458.2416 | | timm_nfnet | 128 | 452.6057 | | hf_Albert | 8 | 443.6373 | | fastNLP_Bert | 6 | 432.484 | | timm_efficientnet | 32 | 416.368 | | pytorch_struct | 200 | 380.1856 | | resnext50_32x4d | 8 | 378.2311 | | dlrm | 1024 | 367.2021 | | BERT_pytorch | 16 | 356.5031 | | hf_Bert_large | 4 | 328.6867 | | shufflenet_v2_x1_0 | 128 | 327.7534 | | drq | 1 | 325.9405 | | timm_vovnet | 32 | 321.087 | | Super_SloMo | 6 | 301.9892 | | hf_T5 | 8 | 298.1788 | | LearningToPaint | 96 | 295.7467 | | nvidia_deeprecommender | 256 | 269.3865 | | pytorch_unet | 1 | 261.0985 | | resnet18 | 16 | 256.4528 | | squeezenet1_1 | 32 | 248.7594 | | vgg16 | 64 | 242.8924 | | functorch_dp_cifar10 | 64 | 241.1431 | | alexnet | 128 | 228.0223 | | hf_GPT2 | 4 | 226.4292 | | timm_resnest | 32 | 218.0259 | | phlippe_resnet | 128 | 212.9454 | | hf_Reformer | 4 | 206.2104 | | soft_actor_critic | 256 | 194.8455 | | resnet152 | 32 | 183.7911 | | hf_Bart | 4 | 182.812 | | Background_Matting | 4 | 178.2901 | | lennard_jones | 1000 | 158.1878 | | pytorch_CycleGAN_and_pix2pix | 1 | 135.3382 | | hf_Bert | 4 | 103.9135 | | pytorch_stargan | 16 | 86.6805 | | demucs | 4 | 76.9034 | | hf_DistilBert | 8 | 61.5585 | | dcgan | 32 | 39.0268 | | resnet50 | 32 | 28.1121 | | tts_angular | 64 | 4.8102 | | gat | 0 | nan | | gcn | 0 | nan | | hf_Longformer | 0 | nan | | moco | 0 | nan | | sage | 0 | nan | | tacotron2 | 0 | nan | | torchrec_dlrm | 0 | nan | +-----------------------------------+------+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------+------+-------------------------------------+ | hf_Albert | 8 | 1.1991 | | hf_T5 | 8 | 1.1719 | | BERT_pytorch | 16 | 1.1689 | | hf_T5_large | 2 | 1.1595 | | Super_SloMo | 6 | 1.1595 | | fastNLP_Bert | 6 | 1.147 | | hf_GPT2_large | 4 | 1.134 | | mobilenet_v2 | 96 | 1.1007 | | attention_is_all_you_need_pytorch | 256 | 1.0885 | | hf_BigBird | 2 | 1.0756 | | timm_nfnet | 128 | 1.072 | | hf_GPT2 | 4 | 1.0707 | | hf_Bert_large | 4 | 1.0453 | | Background_Matting | 4 | 1.0399 | | yolov3 | 16 | 1.0062 | | tts_angular | 64 | 0.9983 | | vgg16 | 64 | 0.9938 | | resnet50 | 32 | 0.9921 | | hf_Bert | 4 | 0.974 | | timm_vision_transformer_large | 32 | 0.9725 | | demucs | 4 | 0.9657 | | timm_resnest | 32 | 0.9652 | | shufflenet_v2_x1_0 | 128 | 0.9628 | | dlrm | 1024 | 0.9565 | | timm_regnet | 32 | 0.9521 | | timm_efficientnet | 32 | 0.94 | | resnet152 | 32 | 0.9392 | | hf_DistilBert | 8 | 0.932 | | hf_Bart | 4 | 0.9175 | | nvidia_deeprecommender | 256 | 0.9175 | | pytorch_unet | 1 | 0.8949 | | alexnet | 128 | 0.8908 | | timm_vision_transformer | 32 | 0.8835 | | timm_vovnet | 32 | 0.882 | | mobilenet_v3_large | 32 | 0.8702 | | phlippe_densenet | 128 | 0.8648 | | speech_transformer | 32 | 0.8606 | | squeezenet1_1 | 32 | 0.8434 | | hf_Reformer | 4 | 0.8029 | | densenet121 | 4 | 0.7981 | | pytorch_stargan | 16 | 0.783 | | mnasnet1_0 | 32 | 0.7752 | | resnext50_32x4d | 8 | 0.7558 | | pytorch_struct | 200 | 0.7362 | | LearningToPaint | 96 | 0.7295 | | resnet18 | 16 | 0.6019 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.5911 | | functorch_dp_cifar10 | 64 | 0.4424 | | phlippe_resnet | 128 | 0.3394 | | drq | 1 | 0.1965 | | dcgan | 32 | 0.1873 | | soft_actor_critic | 256 | 0.1141 | | lennard_jones | 1000 | 0.0666 | | gat | 0 | nan | | gcn | 0 | nan | | hf_Longformer | 0 | nan | | moco | 0 | nan | | sage | 0 | nan | | tacotron2 | 0 | nan | | torchrec_dlrm | 0 | nan | +-----------------------------------+------+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------+------+-------------------------------------+ | timm_vision_transformer_large | 32 | 398.4664 | | hf_BigBird | 2 | 115.0004 | | hf_T5_large | 2 | 114.9077 | | hf_GPT2_large | 4 | 112.83 | | Background_Matting | 4 | 103.7789 | | hf_T5 | 8 | 89.5281 | | timm_nfnet | 128 | 79.536 | | hf_Reformer | 4 | 75.8413 | | Super_SloMo | 6 | 64.3573 | | resnet152 | 32 | 61.98 | | yolov3 | 16 | 56.9666 | | timm_regnet | 32 | 56.7352 | | densenet121 | 4 | 54.8286 | | vgg16 | 64 | 52.3855 | | hf_Bert_large | 4 | 51.707 | | demucs | 4 | 51.628 | | hf_Bart | 4 | 49.4292 | | fastNLP_Bert | 6 | 33.3076 | | attention_is_all_you_need_pytorch | 256 | 32.9483 | | speech_transformer | 32 | 32.5738 | | mobilenet_v2 | 96 | 32.09 | | timm_efficientnet | 32 | 31.1444 | | hf_Albert | 8 | 29.7185 | | pytorch_unet | 1 | 29.433 | | timm_vovnet | 32 | 27.2264 | | shufflenet_v2_x1_0 | 128 | 26.0945 | | BERT_pytorch | 16 | 26.0669 | | hf_Bert | 4 | 25.2858 | | hf_GPT2 | 4 | 25.2109 | | resnet50 | 32 | 25.1884 | | mobilenet_v3_large | 32 | 23.766 | | phlippe_densenet | 128 | 22.884 | | timm_vision_transformer | 32 | 21.5993 | | resnext50_32x4d | 8 | 21.5213 | | hf_DistilBert | 8 | 21.4008 | | mnasnet1_0 | 32 | 21.3485 | | timm_resnest | 32 | 16.3099 | | pytorch_stargan | 16 | 11.8717 | | LearningToPaint | 96 | 10.7695 | | resnet18 | 16 | 10.0338 | | nvidia_deeprecommender | 256 | 9.254 | | phlippe_resnet | 128 | 8.8261 | | alexnet | 128 | 8.2875 | | functorch_dp_cifar10 | 64 | 7.7689 | | pytorch_CycleGAN_and_pix2pix | 1 | 7.6824 | | squeezenet1_1 | 32 | 7.5651 | | tts_angular | 64 | 6.5613 | | pytorch_struct | 200 | 3.6031 | | dlrm | 1024 | 3.5055 | | drq | 1 | 3.0625 | | dcgan | 32 | 2.6613 | | soft_actor_critic | 256 | 1.9039 | | lennard_jones | 1000 | 1.539 | | gat | 0 | nan | | gcn | 0 | nan | | hf_Longformer | 0 | nan | | moco | 0 | nan | | sage | 0 | nan | | tacotron2 | 0 | nan | | torchrec_dlrm | 0 | nan | +-----------------------------------+------+-------------------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+-----+-------------------------------------+ | OPTForCausalLM | 2 | 2.5239 | | GPT2ForSequenceClassification | 4 | 2.3477 | | ElectraForQuestionAnswering | 64 | 2.0998 | | DistillGPT2 | 16 | 1.958 | | MT5ForConditionalGeneration | 16 | 1.9534 | | PLBartForCausalLM | 8 | 1.9276 | | ElectraForCausalLM | 32 | 1.833 | | XLNetLMHeadModel | 8 | 1.8255 | | LayoutLMForSequenceClassification | 16 | 1.7915 | | T5Small | 4 | 1.7771 | | RobertaForQuestionAnswering | 16 | 1.7735 | | T5ForConditionalGeneration | 4 | 1.7718 | | BertForQuestionAnswering | 16 | 1.7677 | | PLBartForConditionalGeneration | 4 | 1.7639 | | BartForConditionalGeneration | 2 | 1.7328 | | BartForCausalLM | 4 | 1.6963 | | MBartForCausalLM | 4 | 1.6848 | | RobertaForCausalLM | 16 | 1.67 | | XGLMForCausalLM | 8 | 1.6486 | | AlbertForQuestionAnswering | 4 | 1.644 | | MegatronBertForQuestionAnswering | 8 | 1.6421 | | AlbertForMaskedLM | 4 | 1.625 | | YituTechConvBert | 16 | 1.6188 | | CamemBert | 16 | 1.6128 | | LayoutLMForMaskedLM | 16 | 1.609 | | M2M100ForConditionalGeneration | 16 | 1.6015 | | BertForMaskedLM | 16 | 1.5876 | | MBartForConditionalGeneration | 2 | 1.5766 | | Speech2Text2ForCausalLM | 256 | 1.5756 | | MegatronBertForCausalLM | 4 | 1.5563 | | BlenderbotSmallForConditionalGeneration | 64 | 1.4753 | | DistilBertForQuestionAnswering | 256 | 1.4538 | | PegasusForCausalLM | 32 | 1.435 | | PegasusForConditionalGeneration | 32 | 1.4324 | | TrOCRForCausalLM | 32 | 1.4192 | | BlenderbotSmallForCausalLM | 64 | 1.4111 | | BlenderbotForCausalLM | 4 | 1.2585 | | DistilBertForMaskedLM | 128 | 1.2406 | | MobileBertForMaskedLM | 64 | 1.2012 | | MobileBertForQuestionAnswering | 128 | 1.1555 | | DebertaForQuestionAnswering | 8 | 0.9571 | | DebertaForMaskedLM | 4 | 0.8277 | | DebertaV2ForQuestionAnswering | 2 | 0.7041 | | DebertaV2ForMaskedLM | 1 | 0.6608 | | AllenaiLongformerBase | 0 | 0.0 | +-----------------------------------------+-----+-------------------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+----+-------------------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | | AlbertForMaskedLM | 1 | pass | | PegasusForCausalLM | 1 | pass | | MT5ForConditionalGeneration | 1 | pass | | MegatronBertForCausalLM | 1 | pass | | MegatronBertForQuestionAnswering | 1 | pass | | MobileBertForMaskedLM | 1 | pass | | MobileBertForQuestionAnswering | 1 | pass | | OPTForCausalLM | 1 | pass | | PLBartForCausalLM | 1 | pass | | PLBartForConditionalGeneration | 1 | pass | | PegasusForConditionalGeneration | 1 | pass | | MBartForCausalLM | 1 | pass | | RobertaForCausalLM | 1 | pass | | RobertaForQuestionAnswering | 1 | pass | | Speech2Text2ForCausalLM | 1 | pass | | T5ForConditionalGeneration | 1 | pass | | T5Small | 1 | pass | | TrOCRForCausalLM | 1 | pass | | XGLMForCausalLM | 1 | pass | | XLNetLMHeadModel | 1 | pass | | MBartForConditionalGeneration | 1 | pass | | LayoutLMForSequenceClassification | 1 | pass | | M2M100ForConditionalGeneration | 1 | pass | | DebertaForMaskedLM | 1 | pass | | AllenaiLongformerBase | 1 | pass | | BartForCausalLM | 1 | pass | | BartForConditionalGeneration | 1 | pass | | BertForMaskedLM | 1 | pass | | BertForQuestionAnswering | 1 | pass | | BlenderbotSmallForCausalLM | 1 | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | | CamemBert | 1 | pass | | DebertaForQuestionAnswering | 1 | pass | | DebertaV2ForQuestionAnswering | 1 | pass | | DistilBertForMaskedLM | 1 | pass | | DistilBertForQuestionAnswering | 1 | pass | | DistillGPT2 | 1 | pass | | ElectraForCausalLM | 1 | pass | | ElectraForQuestionAnswering | 1 | pass | | GPT2ForSequenceClassification | 1 | pass | | LayoutLMForMaskedLM | 1 | pass | | YituTechConvBert | 1 | pass | | AlbertForQuestionAnswering | 1 | fail_accuracy | +-----------------------------------------+----+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+-----+-------------------------------------+ | BlenderbotForCausalLM | 4 | 649.4784 | | MobileBertForMaskedLM | 64 | 639.3457 | | MobileBertForQuestionAnswering | 128 | 616.8377 | | MT5ForConditionalGeneration | 16 | 612.7483 | | ElectraForCausalLM | 32 | 421.2196 | | DebertaV2ForMaskedLM | 1 | 399.9344 | | AlbertForMaskedLM | 4 | 342.793 | | XGLMForCausalLM | 8 | 314.3379 | | XLNetLMHeadModel | 8 | 299.0966 | | M2M100ForConditionalGeneration | 16 | 298.5031 | | DebertaV2ForQuestionAnswering | 2 | 288.9307 | | ElectraForQuestionAnswering | 64 | 281.8455 | | T5ForConditionalGeneration | 4 | 271.9564 | | BertForMaskedLM | 16 | 244.5709 | | YituTechConvBert | 16 | 242.3426 | | BartForConditionalGeneration | 2 | 233.1992 | | DistilBertForMaskedLM | 128 | 230.7913 | | TrOCRForCausalLM | 32 | 230.0665 | | GPT2ForSequenceClassification | 4 | 221.7695 | | DistilBertForQuestionAnswering | 256 | 213.5598 | | BartForCausalLM | 4 | 202.7918 | | BlenderbotSmallForCausalLM | 64 | 190.1801 | | LayoutLMForSequenceClassification | 16 | 176.0581 | | DebertaForQuestionAnswering | 8 | 171.8915 | | DistillGPT2 | 16 | 168.6432 | | Speech2Text2ForCausalLM | 256 | 155.1008 | | MegatronBertForQuestionAnswering | 8 | 147.9811 | | DebertaForMaskedLM | 4 | 126.1295 | | OPTForCausalLM | 2 | 122.2415 | | MegatronBertForCausalLM | 4 | 111.1831 | | MBartForConditionalGeneration | 2 | 110.6487 | | PegasusForCausalLM | 32 | 107.636 | | PegasusForConditionalGeneration | 32 | 102.5265 | | PLBartForConditionalGeneration | 4 | 97.2393 | | AlbertForQuestionAnswering | 4 | 92.8096 | | BertForQuestionAnswering | 16 | 92.5737 | | PLBartForCausalLM | 8 | 76.7421 | | CamemBert | 16 | 76.3326 | | BlenderbotSmallForConditionalGeneration | 64 | 75.592 | | MBartForCausalLM | 4 | 53.1392 | | T5Small | 4 | 45.5744 | | RobertaForCausalLM | 16 | 44.4327 | | LayoutLMForMaskedLM | 16 | 40.6821 | | RobertaForQuestionAnswering | 16 | 38.2552 | | AllenaiLongformerBase | 0 | nan | +-----------------------------------------+-----+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+-----+-------------------------------------+ | AlbertForQuestionAnswering | 4 | 1.3106 | | AlbertForMaskedLM | 4 | 1.2322 | | GPT2ForSequenceClassification | 4 | 1.2114 | | T5Small | 4 | 1.1813 | | RobertaForQuestionAnswering | 16 | 1.1724 | | ElectraForQuestionAnswering | 64 | 1.1609 | | DebertaForQuestionAnswering | 8 | 1.1528 | | BertForQuestionAnswering | 16 | 1.1418 | | LayoutLMForSequenceClassification | 16 | 1.137 | | OPTForCausalLM | 2 | 1.1345 | | XLNetLMHeadModel | 8 | 1.1342 | | MegatronBertForQuestionAnswering | 8 | 1.1152 | | DistilBertForQuestionAnswering | 256 | 1.1135 | | T5ForConditionalGeneration | 4 | 1.1019 | | MegatronBertForCausalLM | 4 | 1.0784 | | DistillGPT2 | 16 | 1.0642 | | RobertaForCausalLM | 16 | 1.052 | | LayoutLMForMaskedLM | 16 | 1.0517 | | YituTechConvBert | 16 | 1.0411 | | MBartForConditionalGeneration | 2 | 1.0307 | | PegasusForConditionalGeneration | 32 | 1.0185 | | BlenderbotForCausalLM | 4 | 0.9995 | | PLBartForConditionalGeneration | 4 | 0.9987 | | MBartForCausalLM | 4 | 0.9912 | | PegasusForCausalLM | 32 | 0.9864 | | BertForMaskedLM | 16 | 0.9848 | | BartForConditionalGeneration | 2 | 0.9844 | | CamemBert | 16 | 0.9812 | | MobileBertForMaskedLM | 64 | 0.9802 | | DebertaV2ForQuestionAnswering | 2 | 0.98 | | DebertaForMaskedLM | 4 | 0.9759 | | ElectraForCausalLM | 32 | 0.9739 | | TrOCRForCausalLM | 32 | 0.9583 | | M2M100ForConditionalGeneration | 16 | 0.9273 | | BartForCausalLM | 4 | 0.9243 | | DebertaV2ForMaskedLM | 1 | 0.9165 | | XGLMForCausalLM | 8 | 0.9124 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9085 | | PLBartForCausalLM | 8 | 0.9066 | | MT5ForConditionalGeneration | 16 | 0.8968 | | DistilBertForMaskedLM | 128 | 0.8675 | | MobileBertForQuestionAnswering | 128 | 0.837 | | BlenderbotSmallForCausalLM | 64 | 0.8095 | | Speech2Text2ForCausalLM | 256 | 0.7856 | | AllenaiLongformerBase | 0 | nan | +-----------------------------------------+-----+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+-----+-------------------------------------+ | AlbertForMaskedLM | 4 | 164.5389 | | AlbertForQuestionAnswering | 4 | 161.0295 | | DebertaV2ForMaskedLM | 1 | 158.0596 | | DebertaV2ForQuestionAnswering | 2 | 152.7889 | | XLNetLMHeadModel | 8 | 152.6646 | | MobileBertForQuestionAnswering | 128 | 148.1638 | | MobileBertForMaskedLM | 64 | 145.3818 | | PegasusForConditionalGeneration | 32 | 101.4923 | | TrOCRForCausalLM | 32 | 97.2026 | | BlenderbotForCausalLM | 4 | 95.5 | | M2M100ForConditionalGeneration | 16 | 90.5091 | | MBartForConditionalGeneration | 2 | 88.1672 | | BartForConditionalGeneration | 2 | 88.0842 | | MegatronBertForQuestionAnswering | 8 | 86.5595 | | BlenderbotSmallForConditionalGeneration | 64 | 82.6847 | | DebertaForQuestionAnswering | 8 | 79.3787 | | YituTechConvBert | 16 | 77.3507 | | DebertaForMaskedLM | 4 | 75.6133 | | CamemBert | 16 | 73.4118 | | DistilBertForQuestionAnswering | 256 | 71.4565 | | LayoutLMForMaskedLM | 16 | 69.969 | | BertForMaskedLM | 16 | 69.4337 | | RobertaForCausalLM | 16 | 68.8753 | | DistilBertForMaskedLM | 128 | 68.2423 | | XGLMForCausalLM | 8 | 67.992 | | MBartForCausalLM | 4 | 67.8393 | | PLBartForConditionalGeneration | 4 | 67.3449 | | OPTForCausalLM | 2 | 67.1718 | | BartForCausalLM | 4 | 66.9884 | | PLBartForCausalLM | 8 | 60.7361 | | T5ForConditionalGeneration | 4 | 59.0007 | | T5Small | 4 | 58.8911 | | MegatronBertForCausalLM | 4 | 56.8237 | | ElectraForQuestionAnswering | 64 | 54.6674 | | LayoutLMForSequenceClassification | 16 | 54.6074 | | DistillGPT2 | 16 | 54.0521 | | RobertaForQuestionAnswering | 16 | 53.9412 | | BertForQuestionAnswering | 16 | 53.9321 | | PegasusForCausalLM | 32 | 52.0473 | | MT5ForConditionalGeneration | 16 | 48.7685 | | ElectraForCausalLM | 32 | 48.194 | | BlenderbotSmallForCausalLM | 64 | 43.7949 | | GPT2ForSequenceClassification | 4 | 38.967 | | Speech2Text2ForCausalLM | 256 | 34.1697 | | AllenaiLongformerBase | 0 | nan | +-----------------------------------------+-----+-------------------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+-------------------------------------+ | tnt_s_patch16_224 | 128 | 3.2808 | | coat_lite_mini | 128 | 2.043 | | gmixer_24_224 | 128 | 1.8775 | | gmlp_s16_224 | 128 | 1.848 | | twins_pcpvt_base | 64 | 1.8212 | | crossvit_9_240 | 128 | 1.7885 | | volo_d1_224 | 64 | 1.7163 | | convit_base | 64 | 1.7118 | | swin_base_patch4_window7_224 | 64 | 1.7051 | | pit_b_224 | 64 | 1.5949 | | ghostnet_100 | 128 | 1.5827 | | xcit_large_24_p8_224 | 5 | 1.577 | | sebotnet33ts_256 | 64 | 1.5422 | | gluon_inception_v3 | 128 | 1.5297 | | inception_v3 | 128 | 1.5277 | | dla102 | 128 | 1.5251 | | jx_nest_base | 32 | 1.5248 | | adv_inception_v3 | 128 | 1.5244 | | mnasnet_100 | 128 | 1.4935 | | mobilevit_s | 64 | 1.486 | | convnext_base | 64 | 1.4763 | | beit_base_patch16_224 | 64 | 1.4595 | | mobilenetv2_100 | 128 | 1.4441 | | lcnet_050 | 128 | 1.4427 | | cait_m36_384 | 4 | 1.4424 | | dm_nfnet_f0 | 128 | 1.4387 | | nfnet_l0 | 128 | 1.436 | | eca_botnext26ts_256 | 128 | 1.4247 | | botnet26t_256 | 128 | 1.4204 | | selecsls42b | 128 | 1.4156 | | spnasnet_100 | 128 | 1.4026 | | fbnetc_100 | 128 | 1.3996 | | resmlp_12_224 | 128 | 1.3908 | | mobilenetv3_large_100 | 128 | 1.39 | | mixer_b16_224 | 128 | 1.3896 | | ese_vovnet19b_dw | 128 | 1.3818 | | tf_efficientnet_b0 | 128 | 1.3813 | | res2net50_14w_8s | 128 | 1.3811 | | hrnet_w18 | 128 | 1.378 | | res2next50 | 128 | 1.3608 | | resnest101e | 64 | 1.355 | | vit_base_patch16_224 | 64 | 1.3449 | | rexnet_100 | 128 | 1.3386 | | poolformer_m36 | 64 | 1.3193 | | deit_base_distilled_patch16_224 | 64 | 1.3188 | | fbnetv3_b | 128 | 1.3007 | | cspdarknet53 | 64 | 1.2718 | | tinynet_a | 128 | 1.2295 | | regnety_002 | 128 | 1.2245 | | visformer_small | 128 | 1.1971 | | tf_mixnet_l | 128 | 1.1961 | | mixnet_l | 128 | 1.187 | | pnasnet5large | 16 | 1.1411 | | dpn107 | 32 | 1.1358 | | repvgg_a2 | 128 | 1.1229 | | res2net101_26w_4s | 64 | 1.0952 | | gluon_xception65 | 32 | 1.0893 | | gernet_l | 128 | 1.0709 | | swsl_resnext101_32x16d | 32 | 1.0223 | | convmixer_768_32 | 32 | 1.0087 | +---------------------------------+-----+-------------------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +---------------------------------+----+-------------------------------------+ | adv_inception_v3 | 8 | pass | | resmlp_12_224 | 8 | pass | | mobilenetv2_100 | 8 | pass | | mobilenetv3_large_100 | 8 | pass | | mobilevit_s | 8 | pass | | nfnet_l0 | 8 | pass | | pit_b_224 | 8 | pass | | pnasnet5large | 8 | pass | | poolformer_m36 | 8 | pass | | regnety_002 | 8 | pass | | repvgg_a2 | 8 | pass | | res2net101_26w_4s | 8 | pass | | res2net50_14w_8s | 8 | pass | | res2next50 | 8 | pass | | resnest101e | 8 | pass | | mixnet_l | 8 | pass | | rexnet_100 | 8 | pass | | sebotnet33ts_256 | 8 | pass | | selecsls42b | 8 | pass | | spnasnet_100 | 8 | pass | | swsl_resnext101_32x16d | 8 | pass | | tf_efficientnet_b0 | 8 | pass | | tf_mixnet_l | 8 | pass | | tinynet_a | 8 | pass | | tnt_s_patch16_224 | 8 | pass | | visformer_small | 8 | pass | | vit_base_patch16_224 | 8 | pass | | volo_d1_224 | 8 | pass | | beit_base_patch16_224 | 8 | pass | | mnasnet_100 | 8 | pass | | mixer_b16_224 | 8 | pass | | eca_botnext26ts_256 | 8 | pass | | botnet26t_256 | 8 | pass | | cait_m36_384 | 4 | pass | | convit_base | 8 | pass | | convmixer_768_32 | 8 | pass | | convnext_base | 8 | pass | | crossvit_9_240 | 8 | pass | | cspdarknet53 | 8 | pass | | deit_base_distilled_patch16_224 | 8 | pass | | dla102 | 8 | pass | | dm_nfnet_f0 | 8 | pass | | lcnet_050 | 8 | pass | | dpn107 | 8 | pass | | ese_vovnet19b_dw | 8 | pass | | fbnetc_100 | 8 | pass | | fbnetv3_b | 8 | pass | | gernet_l | 8 | pass | | ghostnet_100 | 8 | pass | | gluon_inception_v3 | 8 | pass | | gluon_xception65 | 8 | pass | | gmixer_24_224 | 8 | pass | | gmlp_s16_224 | 8 | pass | | hrnet_w18 | 8 | pass | | inception_v3 | 8 | pass | | jx_nest_base | 8 | pass | | xcit_large_24_p8_224 | 8 | pass | | swin_base_patch4_window7_224 | 8 | fail_accuracy | | twins_pcpvt_base | 0 | 0.0000 | | coat_lite_mini | 0 | 0.0000 | +---------------------------------+----+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+-------------------------------------+ | twins_pcpvt_base | 64 | 1661.7656 | | mobilevit_s | 64 | 1550.9785 | | coat_lite_mini | 128 | 1446.3549 | | crossvit_9_240 | 128 | 1274.4466 | | xcit_large_24_p8_224 | 5 | 1090.0301 | | rexnet_100 | 128 | 1086.8005 | | volo_d1_224 | 64 | 1077.1989 | | cait_m36_384 | 4 | 985.835 | | pit_b_224 | 64 | 983.0208 | | swin_base_patch4_window7_224 | 64 | 979.3625 | | jx_nest_base | 32 | 973.6643 | | ghostnet_100 | 128 | 917.2281 | | hrnet_w18 | 128 | 848.4131 | | mixnet_l | 128 | 828.4946 | | sebotnet33ts_256 | 64 | 815.841 | | adv_inception_v3 | 128 | 791.4986 | | res2net50_14w_8s | 128 | 782.9523 | | botnet26t_256 | 128 | 748.2407 | | res2net101_26w_4s | 64 | 721.0519 | | dpn107 | 32 | 699.6225 | | fbnetv3_b | 128 | 671.9632 | | pnasnet5large | 16 | 625.272 | | fbnetc_100 | 128 | 582.0276 | | tnt_s_patch16_224 | 128 | 573.5951 | | convnext_base | 64 | 520.1623 | | tinynet_a | 128 | 513.939 | | regnety_002 | 128 | 483.9504 | | visformer_small | 128 | 476.2705 | | dla102 | 128 | 461.3868 | | resnest101e | 64 | 438.5754 | | cspdarknet53 | 64 | 394.8797 | | nfnet_l0 | 128 | 357.7415 | | gluon_xception65 | 32 | 350.2381 | | beit_base_patch16_224 | 64 | 341.4709 | | poolformer_m36 | 64 | 335.8947 | | gmixer_24_224 | 128 | 333.6782 | | gernet_l | 128 | 330.3406 | | eca_botnext26ts_256 | 128 | 330.1705 | | tf_efficientnet_b0 | 128 | 314.6216 | | convit_base | 64 | 307.2133 | | selecsls42b | 128 | 298.4681 | | ese_vovnet19b_dw | 128 | 284.912 | | mnasnet_100 | 128 | 283.627 | | repvgg_a2 | 128 | 266.93 | | deit_base_distilled_patch16_224 | 64 | 266.4633 | | mixer_b16_224 | 128 | 258.0345 | | lcnet_050 | 128 | 237.0143 | | gmlp_s16_224 | 128 | 192.8936 | | mobilenetv3_large_100 | 128 | 188.8074 | | swsl_resnext101_32x16d | 32 | 184.8311 | | resmlp_12_224 | 128 | 172.1008 | | res2next50 | 128 | 152.5042 | | mobilenetv2_100 | 128 | 128.714 | | convmixer_768_32 | 32 | 127.8047 | | tf_mixnet_l | 128 | 92.5167 | | spnasnet_100 | 128 | 78.3946 | | gluon_inception_v3 | 128 | 55.8254 | | inception_v3 | 128 | 54.8863 | | vit_base_patch16_224 | 64 | 47.5897 | | dm_nfnet_f0 | 128 | 38.9676 | +---------------------------------+-----+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+-------------------------------------+ | pnasnet5large | 16 | 1.2842 | | gmlp_s16_224 | 128 | 1.2049 | | poolformer_m36 | 64 | 1.1877 | | convit_base | 64 | 1.157 | | gmixer_24_224 | 128 | 1.1482 | | mobilenetv2_100 | 128 | 1.118 | | resnest101e | 64 | 1.0869 | | dm_nfnet_f0 | 128 | 1.0845 | | tf_efficientnet_b0 | 128 | 1.0724 | | tinynet_a | 128 | 1.0712 | | tf_mixnet_l | 128 | 1.0681 | | tnt_s_patch16_224 | 128 | 1.0504 | | rexnet_100 | 128 | 1.0454 | | resmlp_12_224 | 128 | 1.0349 | | cspdarknet53 | 64 | 1.0319 | | dla102 | 128 | 1.0312 | | inception_v3 | 128 | 1.0265 | | gluon_inception_v3 | 128 | 1.0265 | | twins_pcpvt_base | 64 | 1.0223 | | visformer_small | 128 | 1.0194 | | sebotnet33ts_256 | 64 | 1.0191 | | adv_inception_v3 | 128 | 1.0174 | | convnext_base | 64 | 1.0165 | | eca_botnext26ts_256 | 128 | 0.9979 | | nfnet_l0 | 128 | 0.9952 | | hrnet_w18 | 128 | 0.9915 | | crossvit_9_240 | 128 | 0.9898 | | ese_vovnet19b_dw | 128 | 0.9897 | | mixnet_l | 128 | 0.9893 | | spnasnet_100 | 128 | 0.9863 | | convmixer_768_32 | 32 | 0.9852 | | cait_m36_384 | 4 | 0.9845 | | mobilevit_s | 64 | 0.9818 | | beit_base_patch16_224 | 64 | 0.9812 | | mixer_b16_224 | 128 | 0.9788 | | pit_b_224 | 64 | 0.9773 | | fbnetv3_b | 128 | 0.9772 | | ghostnet_100 | 128 | 0.9765 | | swsl_resnext101_32x16d | 32 | 0.9747 | | xcit_large_24_p8_224 | 5 | 0.9737 | | gluon_xception65 | 32 | 0.97 | | volo_d1_224 | 64 | 0.9673 | | coat_lite_mini | 128 | 0.9634 | | gernet_l | 128 | 0.9634 | | dpn107 | 32 | 0.9609 | | jx_nest_base | 32 | 0.9605 | | botnet26t_256 | 128 | 0.9593 | | selecsls42b | 128 | 0.959 | | res2net50_14w_8s | 128 | 0.959 | | vit_base_patch16_224 | 64 | 0.955 | | deit_base_distilled_patch16_224 | 64 | 0.9536 | | fbnetc_100 | 128 | 0.9536 | | res2next50 | 128 | 0.9531 | | repvgg_a2 | 128 | 0.9518 | | res2net101_26w_4s | 64 | 0.9459 | | mnasnet_100 | 128 | 0.9395 | | mobilenetv3_large_100 | 128 | 0.9352 | | swin_base_patch4_window7_224 | 64 | 0.9044 | | regnety_002 | 128 | 0.8964 | | lcnet_050 | 128 | 0.8843 | +---------------------------------+-----+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+-------------------------------------+ | name | bs | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+-------------------------------------+ | convmixer_768_32 | 32 | 297.5561 | | hrnet_w18 | 128 | 203.6333 | | pnasnet5large | 16 | 172.5757 | | tf_mixnet_l | 128 | 158.779 | | mixnet_l | 128 | 152.6326 | | resnest101e | 64 | 121.1486 | | swsl_resnext101_32x16d | 32 | 116.2847 | | cait_m36_384 | 4 | 115.9344 | | dla102 | 128 | 112.9899 | | poolformer_m36 | 64 | 109.8308 | | adv_inception_v3 | 128 | 105.1958 | | inception_v3 | 128 | 105.0804 | | gluon_inception_v3 | 128 | 104.8359 | | res2net50_14w_8s | 128 | 101.7858 | | tnt_s_patch16_224 | 128 | 98.5791 | | convit_base | 64 | 95.1954 | | dpn107 | 32 | 93.5896 | | res2next50 | 128 | 92.5685 | | gluon_xception65 | 32 | 91.1349 | | res2net101_26w_4s | 64 | 90.9843 | | dm_nfnet_f0 | 128 | 87.9436 | | swin_base_patch4_window7_224 | 64 | 85.6668 | | mixer_b16_224 | 128 | 84.4027 | | fbnetv3_b | 128 | 84.3302 | | convnext_base | 64 | 82.8924 | | xcit_large_24_p8_224 | 5 | 81.6976 | | nfnet_l0 | 128 | 78.0815 | | visformer_small | 128 | 76.0427 | | eca_botnext26ts_256 | 128 | 74.3494 | | pit_b_224 | 64 | 74.1421 | | gmlp_s16_224 | 128 | 74.107 | | volo_d1_224 | 64 | 70.2582 | | botnet26t_256 | 128 | 69.8646 | | cspdarknet53 | 64 | 69.729 | | beit_base_patch16_224 | 64 | 69.6892 | | gernet_l | 128 | 68.0302 | | jx_nest_base | 32 | 65.7983 | | twins_pcpvt_base | 64 | 65.1973 | | repvgg_a2 | 128 | 64.726 | | vit_base_patch16_224 | 64 | 64.3227 | | deit_base_distilled_patch16_224 | 64 | 64.2817 | | gmixer_24_224 | 128 | 62.5841 | | tf_efficientnet_b0 | 128 | 59.1001 | | rexnet_100 | 128 | 57.0838 | | ghostnet_100 | 128 | 57.0195 | | tinynet_a | 128 | 56.9044 | | fbnetc_100 | 128 | 56.3749 | | coat_lite_mini | 128 | 55.196 | | mobilevit_s | 64 | 54.9463 | | sebotnet33ts_256 | 64 | 49.9624 | | spnasnet_100 | 128 | 47.2813 | | crossvit_9_240 | 128 | 45.7538 | | ese_vovnet19b_dw | 128 | 44.8411 | | mobilenetv2_100 | 128 | 43.0641 | | selecsls42b | 128 | 42.4034 | | mobilenetv3_large_100 | 128 | 41.9541 | | mnasnet_100 | 128 | 40.8903 | | resmlp_12_224 | 128 | 38.1688 | | regnety_002 | 128 | 30.7911 | | lcnet_050 | 128 | 20.7195 | +---------------------------------+-----+-------------------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_095_05_04_23_performance_amp_233/timm_models_amp.png : ![](https://i.imgur.com/QFPgOfR.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_095_05_04_23_performance_amp_233/huggingface_amp.png : ![](https://i.imgur.com/So1mQ9D.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_095_05_04_23_performance_amp_233/torchbench_amp.png : ![](https://i.imgur.com/44x07PW.png)

Build Summary

### Run name ### day_095_05_04_23_performance_amp_233 ### Commit hashes ### pytorch commit: f55e72c0f6bd6da016aaa51de379e6ba6d7891cc pytorch commit date: 2023-04-07 17:30:27+00:00 torchbench commit: 735f1927996c8d9ab81f0b0c05dd1ebdb26a6250 torchbench commit date: 2023-04-05 09:43:21-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+gitf55e72c ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (inductor max-autotune comparison on timm models)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               | 100%, 60/60 |
|       inductor_no_cudagraphs        | 100%, 60/60 |
|        inductor_max_autotune        | 100%, 60/60 |
| inductor_max_autotune_no_cudagraphs | 100%, 60/60 |
+-------------------------------------+-------------+

Geometric mean speedup

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |    1.42x    |
|       inductor_no_cudagraphs        |    1.40x    |
|        inductor_max_autotune        |    1.47x    |
| inductor_max_autotune_no_cudagraphs |    1.44x    |
+-------------------------------------+-------------+

Mean compilation time (seconds)

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |    80.25    |
|       inductor_no_cudagraphs        |    44.69    |
|        inductor_max_autotune        |   372.93    |
| inductor_max_autotune_no_cudagraphs |    52.43    |
+-------------------------------------+-------------+

Peak memory footprint compression ratio (higher is better)

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |    0.91x    |
|       inductor_no_cudagraphs        |    1.03x    |
|        inductor_max_autotune        |    0.90x    |
| inductor_max_autotune_no_cudagraphs |    1.03x    |
+-------------------------------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Compilation latency (sec) warnings ~~~ +-------------+----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------+----------+------------------------+ | timm_models | rexnet_100 | 227.9512 | 41.7002 | | timm_models | hrnet_w18 | 208.9208 | 150.3511 | | timm_models | ghostnet_100 | 184.1414 | 51.6369 | | timm_models | pnasnet5large | 160.7814 | 104.1362 | | timm_models | adv_inception_v3 | 153.9169 | 50.6449 | | timm_models | res2net101_26w_4s | 147.3733 | 85.8358 | | timm_models | twins_pcpvt_base | 144.0143 | 67.4679 | | timm_models | fbnetv3_b | 140.7021 | 56.8816 | | timm_models | fbnetc_100 | 125.6072 | 33.4715 | | timm_models | xcit_large_24_p8_224 | 124.9615 | 86.6285 | | timm_models | tinynet_a | 123.9613 | 40.715 | | timm_models | resnest101e | 123.647 | 77.6249 | +-------------+----------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------------+----------+------------------------+ | timm_models | ghostnet_100 | 0.8976 | 1.0223 | | timm_models | hrnet_w18 | 0.8918 | 1.0029 | | timm_models | sebotnet33ts_256 | 0.891 | 1.1308 | | timm_models | adv_inception_v3 | 0.8904 | 1.0264 | | timm_models | inception_v3 | 0.8904 | 1.0264 | | timm_models | gluon_inception_v3 | 0.8904 | 1.0264 | | timm_models | mobilenetv3_large_100 | 0.8881 | 0.9808 | | timm_models | dpn107 | 0.8833 | 0.995 | | timm_models | gluon_xception65 | 0.8832 | 0.9952 | | timm_models | spnasnet_100 | 0.8786 | 0.9858 | | timm_models | selecsls42b | 0.8785 | 0.9929 | | timm_models | poolformer_m36 | 0.8768 | 1.1865 | | timm_models | eca_botnext26ts_256 | 0.8738 | 1.0136 | | timm_models | res2net50_14w_8s | 0.8712 | 0.9743 | | timm_models | res2net101_26w_4s | 0.871 | 0.9759 | | timm_models | mixnet_l | 0.8687 | 1.0035 | | timm_models | mnasnet_100 | 0.8683 | 0.9844 | | timm_models | res2next50 | 0.866 | 0.9673 | | timm_models | cait_m36_384 | 0.8632 | 1.0068 | | timm_models | fbnetc_100 | 0.8596 | 0.991 | | timm_models | pit_b_224 | 0.8578 | 1.0345 | | timm_models | convnext_base | 0.8505 | 1.033 | | timm_models | gernet_l | 0.8499 | 0.9793 | | timm_models | swsl_resnext101_32x16d | 0.8461 | 0.9986 | | timm_models | coat_lite_mini | 0.8402 | 1.033 | | timm_models | lcnet_050 | 0.8273 | 0.9465 | | timm_models | botnet26t_256 | 0.8239 | 0.9848 | | timm_models | xcit_large_24_p8_224 | 0.8225 | 1.0063 | | timm_models | regnety_002 | 0.8164 | 0.9526 | | timm_models | repvgg_a2 | 0.7738 | 0.9882 | | timm_models | crossvit_9_240 | 0.7526 | 0.9882 | | timm_models | swin_base_patch4_window7_224 | 0.7214 | 0.9272 | | timm_models | jx_nest_base | 0.6693 | 0.9883 | +-------------+------------------------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 128 | 3.0329 | 2.9895 | 3.3512 | 3.3078 | | twins_pcpvt_base | 64 | 2.1343 | 1.7009 | 2.3417 | 1.7722 | | xcit_large_24_p8_224 | 5 | 2.0741 | 1.5882 | 2.4879 | 1.5875 | | coat_lite_mini | 128 | 1.954 | 1.9295 | 2.0832 | 2.061 | | ghostnet_100 | 128 | 1.8722 | 1.5676 | 1.8762 | 1.6385 | | gmlp_s16_224 | 128 | 1.8698 | 1.8518 | 1.8886 | 1.8702 | | gmixer_24_224 | 128 | 1.7823 | 1.7637 | 1.9194 | 1.8962 | | volo_d1_224 | 64 | 1.7067 | 1.6817 | 1.7614 | 1.7384 | | lcnet_050 | 128 | 1.7011 | 1.4729 | 1.7062 | 1.4845 | | crossvit_9_240 | 128 | 1.6702 | 1.6427 | 1.8597 | 1.8271 | | swin_base_patch4_window7_224 | 64 | 1.6415 | 1.6303 | 1.7455 | 1.738 | | convit_base | 64 | 1.6178 | 1.6169 | 1.7202 | 1.719 | | inception_v3 | 128 | 1.5391 | 1.5278 | 1.5463 | 1.5348 | | adv_inception_v3 | 128 | 1.5387 | 1.5304 | 1.5454 | 1.5371 | | dla102 | 128 | 1.5377 | 1.534 | 1.5413 | 1.5351 | | gluon_inception_v3 | 128 | 1.5375 | 1.5296 | 1.5476 | 1.539 | | convnext_base | 64 | 1.5255 | 1.5075 | 1.5322 | 1.5137 | | sebotnet33ts_256 | 64 | 1.525 | 1.5548 | 1.5361 | 1.5669 | | nfnet_l0 | 128 | 1.5105 | 1.4579 | 1.5098 | 1.4525 | | dm_nfnet_f0 | 128 | 1.5081 | 1.4559 | 1.518 | 1.4655 | | eca_botnext26ts_256 | 128 | 1.4564 | 1.4336 | 1.4599 | 1.4387 | | mobilevit_s | 64 | 1.4494 | 1.4644 | 1.4949 | 1.5133 | | mobilenetv3_large_100 | 128 | 1.4447 | 1.4345 | 1.4434 | 1.4414 | | pit_b_224 | 64 | 1.4446 | 1.4389 | 1.6209 | 1.6134 | | mnasnet_100 | 128 | 1.4424 | 1.4995 | 1.4387 | 1.4986 | | resnest101e | 64 | 1.4417 | 1.3672 | 1.4409 | 1.37 | | regnety_002 | 128 | 1.4378 | 1.253 | 1.5412 | 1.2333 | | botnet26t_256 | 128 | 1.417 | 1.4347 | 1.4224 | 1.4427 | | selecsls42b | 128 | 1.4152 | 1.4148 | 1.4193 | 1.4181 | | mobilenetv2_100 | 128 | 1.3979 | 1.4541 | 1.3959 | 1.4508 | | jx_nest_base | 32 | 1.3891 | 1.3804 | 1.5752 | 1.5631 | | res2net50_14w_8s | 128 | 1.3834 | 1.3591 | 1.4046 | 1.3859 | | res2next50 | 128 | 1.3732 | 1.3664 | 1.3731 | 1.3673 | | ese_vovnet19b_dw | 128 | 1.3689 | 1.3881 | 1.3818 | 1.4021 | | spnasnet_100 | 128 | 1.3653 | 1.4233 | 1.3628 | 1.4256 | | mixer_b16_224 | 128 | 1.3653 | 1.366 | 1.3999 | 1.3996 | | hrnet_w18 | 128 | 1.3631 | 1.3628 | 1.3972 | 1.3667 | | tf_efficientnet_b0 | 128 | 1.3619 | 1.3927 | 1.3604 | 1.3935 | | fbnetc_100 | 128 | 1.3577 | 1.4092 | 1.355 | 1.3943 | | beit_base_patch16_224 | 64 | 1.3577 | 1.3574 | 1.468 | 1.4675 | | cait_m36_384 | 4 | 1.3565 | 1.3576 | 1.456 | 1.4451 | | poolformer_m36 | 64 | 1.3506 | 1.3425 | 1.352 | 1.3436 | | fbnetv3_b | 128 | 1.322 | 1.3385 | 1.3214 | 1.344 | | rexnet_100 | 128 | 1.3169 | 1.3524 | 1.3222 | 1.3589 | | resmlp_12_224 | 128 | 1.2766 | 1.2699 | 1.4146 | 1.4091 | | deit_base_distilled_patch16_224 | 64 | 1.2621 | 1.2615 | 1.3254 | 1.3254 | | cspdarknet53 | 64 | 1.2459 | 1.2821 | 1.2541 | 1.2908 | | vit_base_patch16_224 | 64 | 1.2419 | 1.2409 | 1.3522 | 1.3512 | | tinynet_a | 128 | 1.2355 | 1.2629 | 1.2386 | 1.2597 | | tf_mixnet_l | 128 | 1.1935 | 1.1991 | 1.1977 | 1.2051 | | mixnet_l | 128 | 1.1819 | 1.188 | 1.1866 | 1.1933 | | visformer_small | 128 | 1.1782 | 1.1703 | 1.2101 | 1.2029 | | res2net101_26w_4s | 64 | 1.1561 | 1.0921 | 1.168 | 1.0969 | | pnasnet5large | 16 | 1.1282 | 1.1439 | 1.1205 | 1.1624 | | dpn107 | 32 | 1.1035 | 1.1502 | 1.1052 | 1.1487 | | repvgg_a2 | 128 | 1.0975 | 1.1313 | 1.1043 | 1.1354 | | gluon_xception65 | 32 | 1.0841 | 1.0881 | 1.0961 | 1.0994 | | swsl_resnext101_32x16d | 32 | 1.0634 | 1.0258 | 1.0626 | 1.0227 | | gernet_l | 128 | 1.0495 | 1.0804 | 1.0546 | 1.0876 | | convmixer_768_32 | 32 | 1.0032 | 1.004 | 1.0091 | 1.01 | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+----+----------+------------------------+-----------------------+-------------------------------------+ | adv_inception_v3 | 8 | pass | pass | pass | pass | | beit_base_patch16_224 | 8 | pass | pass | pass | pass | | mobilenetv3_large_100 | 8 | pass | pass | pass | pass | | mobilevit_s | 8 | pass | pass | pass | pass | | nfnet_l0 | 8 | pass | pass | pass | pass | | pit_b_224 | 8 | pass | pass | pass | pass | | pnasnet5large | 8 | pass | pass | pass | pass | | poolformer_m36 | 8 | pass | pass | pass | pass | | regnety_002 | 8 | pass | pass | pass | pass | | repvgg_a2 | 8 | pass | pass | pass | pass | | res2net101_26w_4s | 8 | pass | pass | pass | pass | | res2net50_14w_8s | 8 | pass | pass | pass | pass | | res2next50 | 8 | pass | pass | pass | pass | | resmlp_12_224 | 8 | pass | pass | pass | pass | | resnest101e | 8 | pass | pass | pass | pass | | rexnet_100 | 8 | pass | pass | pass | pass | | sebotnet33ts_256 | 8 | pass | pass | pass | pass | | selecsls42b | 8 | pass | pass | pass | pass | | spnasnet_100 | 8 | pass | pass | pass | pass | | swin_base_patch4_window7_224 | 8 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 8 | pass | pass | pass | pass | | tf_efficientnet_b0 | 8 | pass | pass | pass | pass | | tf_mixnet_l | 8 | pass | pass | pass | pass | | tinynet_a | 8 | pass | pass | pass | pass | | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | twins_pcpvt_base | 8 | pass | pass | pass | pass | | visformer_small | 8 | pass | pass | pass | pass | | vit_base_patch16_224 | 8 | pass | pass | pass | pass | | volo_d1_224 | 8 | pass | pass | pass | pass | | mobilenetv2_100 | 8 | pass | pass | pass | pass | | mnasnet_100 | 8 | pass | pass | pass | pass | | mixnet_l | 8 | pass | pass | pass | pass | | eca_botnext26ts_256 | 8 | pass | pass | pass | pass | | botnet26t_256 | 8 | pass | pass | pass | pass | | cait_m36_384 | 4 | pass | pass | pass | pass | | coat_lite_mini | 8 | pass | pass | pass | pass | | convit_base | 8 | pass | pass | pass | pass | | convmixer_768_32 | 8 | pass | pass | pass | pass | | convnext_base | 8 | pass | pass | pass | pass | | crossvit_9_240 | 8 | pass | pass | pass | pass | | cspdarknet53 | 8 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 8 | pass | pass | pass | pass | | dla102 | 8 | pass | pass | pass | pass | | dm_nfnet_f0 | 8 | pass | pass | pass | pass | | dpn107 | 8 | pass | pass | pass | pass | | ese_vovnet19b_dw | 8 | pass | pass | pass | pass | | mixer_b16_224 | 8 | pass | pass | pass | pass | | fbnetc_100 | 8 | pass | pass | pass | pass | | fbnetv3_b | 8 | pass | pass | pass | pass | | gernet_l | 8 | pass | pass | pass | pass | | ghostnet_100 | 8 | pass | pass | pass | pass | | gluon_inception_v3 | 8 | pass | pass | pass | pass | | gluon_xception65 | 8 | pass | pass | pass | pass | | gmixer_24_224 | 8 | pass | pass | pass | pass | | gmlp_s16_224 | 8 | pass | pass | pass | pass | | hrnet_w18 | 8 | pass | pass | pass | pass | | inception_v3 | 8 | pass | pass | pass | pass | | jx_nest_base | 8 | pass | pass | pass | pass | | lcnet_050 | 8 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 8 | pass | pass | pass | pass | +---------------------------------+----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | rexnet_100 | 128 | 227.9512 | 41.7002 | 528.9244 | 45.2923 | | hrnet_w18 | 128 | 208.9208 | 150.3511 | 527.2794 | 165.8731 | | ghostnet_100 | 128 | 184.1414 | 51.6369 | 566.6493 | 55.3906 | | pnasnet5large | 16 | 160.7814 | 104.1362 | 398.3731 | 112.1264 | | adv_inception_v3 | 128 | 153.9169 | 50.6449 | 370.1581 | 56.4033 | | res2net101_26w_4s | 64 | 147.3733 | 85.8358 | 413.4458 | 96.2796 | | twins_pcpvt_base | 64 | 144.0143 | 67.4679 | 1334.9068 | 95.1846 | | fbnetv3_b | 128 | 140.7021 | 56.8816 | 392.2788 | 62.289 | | fbnetc_100 | 128 | 125.6072 | 33.4715 | 350.3351 | 37.5104 | | xcit_large_24_p8_224 | 5 | 124.9615 | 86.6285 | 864.4854 | 105.9428 | | tinynet_a | 128 | 123.9613 | 40.715 | 302.3102 | 45.6944 | | resnest101e | 64 | 123.647 | 77.6249 | 245.0998 | 83.6248 | | cait_m36_384 | 4 | 119.756 | 85.9746 | 803.6332 | 118.6149 | | mixnet_l | 128 | 116.8338 | 47.9094 | 473.2307 | 53.6196 | | mobilevit_s | 64 | 110.6918 | 50.1101 | 1135.7656 | 60.5821 | | swin_base_patch4_window7_224 | 64 | 104.5658 | 61.5276 | 719.3702 | 79.4015 | | res2net50_14w_8s | 128 | 101.3692 | 78.6431 | 462.8679 | 87.5904 | | poolformer_m36 | 64 | 93.8884 | 60.7278 | 211.4394 | 64.262 | | coat_lite_mini | 128 | 87.0573 | 31.2822 | 1045.1848 | 42.2426 | | dpn107 | 32 | 85.8901 | 59.0832 | 360.542 | 63.0853 | | crossvit_9_240 | 128 | 83.8572 | 40.5448 | 1021.9742 | 57.0584 | | botnet26t_256 | 128 | 83.6736 | 29.015 | 445.744 | 31.4552 | | dla102 | 128 | 83.4309 | 49.1961 | 230.1203 | 54.2374 | | cspdarknet53 | 64 | 81.6337 | 36.7646 | 208.8842 | 40.8153 | | gluon_xception65 | 32 | 80.7204 | 57.3019 | 251.8274 | 59.1264 | | jx_nest_base | 32 | 79.6116 | 52.7253 | 771.5004 | 66.4315 | | tf_mixnet_l | 128 | 69.528 | 49.7329 | 71.0052 | 54.565 | | regnety_002 | 128 | 69.0197 | 28.8112 | 288.4092 | 33.9246 | | dm_nfnet_f0 | 128 | 66.6549 | 35.6506 | 259.0451 | 39.0645 | | tnt_s_patch16_224 | 128 | 65.4538 | 46.4816 | 424.9861 | 68.0578 | | sebotnet33ts_256 | 64 | 63.3396 | 35.9376 | 606.9687 | 42.0218 | | gmlp_s16_224 | 128 | 59.8873 | 35.7854 | 145.5686 | 47.7077 | | volo_d1_224 | 64 | 59.6683 | 38.8359 | 856.2288 | 55.4987 | | nfnet_l0 | 128 | 58.3811 | 32.8717 | 200.0886 | 37.2039 | | gernet_l | 128 | 58.2368 | 28.3871 | 190.5211 | 30.981 | | tf_efficientnet_b0 | 128 | 54.7643 | 35.5072 | 198.6235 | 39.886 | | convnext_base | 64 | 54.6605 | 37.2604 | 386.5703 | 45.9042 | | gluon_inception_v3 | 128 | 53.5437 | 50.8934 | 55.4889 | 55.9809 | | inception_v3 | 128 | 53.2153 | 50.8377 | 56.3443 | 57.0138 | | gmixer_24_224 | 128 | 51.4464 | 37.915 | 254.9779 | 47.3883 | | mnasnet_100 | 128 | 50.0611 | 28.208 | 170.2907 | 30.6131 | | swsl_resnext101_32x16d | 32 | 48.2503 | 45.5488 | 164.7072 | 48.7519 | | ese_vovnet19b_dw | 128 | 47.7346 | 19.3878 | 206.4203 | 21.5838 | | mobilenetv3_large_100 | 128 | 47.047 | 31.3849 | 119.2528 | 34.6582 | | eca_botnext26ts_256 | 128 | 46.9274 | 28.7514 | 248.7129 | 32.9131 | | res2next50 | 128 | 45.7447 | 44.2491 | 114.0371 | 47.763 | | convit_base | 64 | 44.8977 | 28.75 | 303.0676 | 40.7247 | | mobilenetv2_100 | 128 | 44.3548 | 28.9598 | 86.289 | 32.0914 | | visformer_small | 128 | 43.9894 | 23.5964 | 336.3081 | 28.9532 | | pit_b_224 | 64 | 42.9319 | 25.4213 | 749.448 | 37.6907 | | deit_base_distilled_patch16_224 | 64 | 37.9943 | 22.833 | 191.8346 | 34.2248 | | resmlp_12_224 | 128 | 37.5703 | 16.7311 | 121.0222 | 21.0698 | | lcnet_050 | 128 | 36.3735 | 20.661 | 135.5437 | 23.8247 | | convmixer_768_32 | 32 | 34.6792 | 27.5799 | 96.8781 | 29.449 | | spnasnet_100 | 128 | 33.7377 | 33.3421 | 59.2232 | 36.1375 | | beit_base_patch16_224 | 64 | 33.2787 | 27.1632 | 305.5229 | 33.5373 | | vit_base_patch16_224 | 64 | 32.5272 | 22.1508 | 36.6643 | 31.6409 | | repvgg_a2 | 128 | 32.0223 | 28.4511 | 154.9555 | 32.5881 | | mixer_b16_224 | 128 | 29.1258 | 20.426 | 192.0016 | 24.9519 | | selecsls42b | 128 | 28.7757 | 24.8736 | 152.2227 | 27.3081 | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | gmlp_s16_224 | 128 | 1.1848 | 1.2263 | 1.1831 | 1.2263 | | pnasnet5large | 16 | 1.1712 | 1.3174 | 1.1522 | 1.3167 | | gmixer_24_224 | 128 | 1.1117 | 1.1802 | 1.1144 | 1.1802 | | convit_base | 64 | 1.0948 | 1.1825 | 1.098 | 1.1825 | | mobilenetv2_100 | 128 | 1.0431 | 1.155 | 1.0267 | 1.155 | | dm_nfnet_f0 | 128 | 1.013 | 1.0845 | 1.0129 | 1.0845 | | resmlp_12_224 | 128 | 1.0079 | 1.0838 | 1.0093 | 1.0838 | | tinynet_a | 128 | 0.9984 | 1.0981 | 0.9986 | 1.0981 | | rexnet_100 | 128 | 0.9977 | 1.0733 | 0.9744 | 1.0734 | | resnest101e | 64 | 0.9972 | 1.0989 | 0.9933 | 1.0989 | | tf_efficientnet_b0 | 128 | 0.9871 | 1.0917 | 0.9873 | 1.0915 | | tnt_s_patch16_224 | 128 | 0.9834 | 1.0597 | 0.986 | 1.0597 | | convmixer_768_32 | 32 | 0.9762 | 0.9981 | 0.9657 | 0.9981 | | twins_pcpvt_base | 64 | 0.9729 | 1.085 | 0.9763 | 1.085 | | mobilevit_s | 64 | 0.9557 | 1.0163 | 0.9262 | 1.0164 | | dla102 | 128 | 0.9536 | 1.0351 | 0.9528 | 1.0349 | | mixer_b16_224 | 128 | 0.9501 | 1.0049 | 0.9466 | 1.0049 | | vit_base_patch16_224 | 64 | 0.9362 | 0.9818 | 0.9362 | 0.9818 | | deit_base_distilled_patch16_224 | 64 | 0.9353 | 0.9815 | 0.9072 | 0.9815 | | visformer_small | 128 | 0.9348 | 1.029 | 0.9245 | 1.029 | | tf_mixnet_l | 128 | 0.9346 | 1.0819 | 0.9344 | 1.0817 | | beit_base_patch16_224 | 64 | 0.9285 | 1.0106 | 0.9284 | 1.0106 | | fbnetv3_b | 128 | 0.9228 | 0.9876 | 0.917 | 0.9939 | | nfnet_l0 | 128 | 0.9215 | 0.9953 | 0.9101 | 0.9953 | | volo_d1_224 | 64 | 0.9131 | 1.0027 | 0.9089 | 1.0028 | | cspdarknet53 | 64 | 0.9097 | 1.0473 | 0.9098 | 1.0473 | | ese_vovnet19b_dw | 128 | 0.9047 | 0.9907 | 0.8976 | 0.9907 | | ghostnet_100 | 128 | 0.8976 | 1.0223 | 0.8408 | 1.0213 | | hrnet_w18 | 128 | 0.8918 | 1.0029 | 0.8898 | 1.0063 | | sebotnet33ts_256 | 64 | 0.891 | 1.1308 | 0.9207 | 1.1308 | | adv_inception_v3 | 128 | 0.8904 | 1.0264 | 0.8902 | 1.0265 | | inception_v3 | 128 | 0.8904 | 1.0264 | 0.8902 | 1.0265 | | gluon_inception_v3 | 128 | 0.8904 | 1.0264 | 0.8902 | 1.0265 | | mobilenetv3_large_100 | 128 | 0.8881 | 0.9808 | 0.865 | 0.9808 | | dpn107 | 32 | 0.8833 | 0.995 | 0.8676 | 0.995 | | gluon_xception65 | 32 | 0.8832 | 0.9952 | 0.8833 | 0.9952 | | spnasnet_100 | 128 | 0.8786 | 0.9858 | 0.8788 | 0.9858 | | selecsls42b | 128 | 0.8785 | 0.9929 | 0.8473 | 0.9931 | | poolformer_m36 | 64 | 0.8768 | 1.1865 | 0.8592 | 1.1865 | | eca_botnext26ts_256 | 128 | 0.8738 | 1.0136 | 0.8738 | 1.0136 | | res2net50_14w_8s | 128 | 0.8712 | 0.9743 | 0.8501 | 0.9745 | | res2net101_26w_4s | 64 | 0.871 | 0.9759 | 0.8506 | 0.9759 | | mixnet_l | 128 | 0.8687 | 1.0035 | 0.8686 | 1.0031 | | mnasnet_100 | 128 | 0.8683 | 0.9844 | 0.8684 | 0.9844 | | res2next50 | 128 | 0.866 | 0.9673 | 0.8659 | 0.9673 | | cait_m36_384 | 4 | 0.8632 | 1.0068 | 0.8633 | 1.0073 | | fbnetc_100 | 128 | 0.8596 | 0.991 | 0.8597 | 0.991 | | pit_b_224 | 64 | 0.8578 | 1.0345 | 0.8566 | 1.0345 | | convnext_base | 64 | 0.8505 | 1.033 | 0.8317 | 1.033 | | gernet_l | 128 | 0.8499 | 0.9793 | 0.8496 | 0.9793 | | swsl_resnext101_32x16d | 32 | 0.8461 | 0.9986 | 0.8461 | 0.9986 | | coat_lite_mini | 128 | 0.8402 | 1.033 | 0.8501 | 1.033 | | lcnet_050 | 128 | 0.8273 | 0.9465 | 0.8174 | 0.9465 | | botnet26t_256 | 128 | 0.8239 | 0.9848 | 0.8241 | 0.9848 | | xcit_large_24_p8_224 | 5 | 0.8225 | 1.0063 | 0.826 | 1.0104 | | regnety_002 | 128 | 0.8164 | 0.9526 | 0.7697 | 0.9526 | | repvgg_a2 | 128 | 0.7738 | 0.9882 | 0.7738 | 0.9882 | | crossvit_9_240 | 128 | 0.7526 | 0.9882 | 0.7524 | 0.9882 | | swin_base_patch4_window7_224 | 64 | 0.7214 | 0.9272 | 0.7297 | 0.9272 | | jx_nest_base | 32 | 0.6693 | 0.9883 | 0.6705 | 0.9883 | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | convmixer_768_32 | 32 | 298.9236 | 298.6576 | 297.1984 | 296.9094 | | hrnet_w18 | 128 | 204.8608 | 205.117 | 199.9165 | 204.6855 | | pnasnet5large | 16 | 173.7291 | 171.2209 | 174.8788 | 168.5491 | | tf_mixnet_l | 128 | 158.652 | 157.671 | 158.1293 | 156.9048 | | mixnet_l | 128 | 152.9535 | 152.1897 | 152.5661 | 151.6937 | | cait_m36_384 | 4 | 122.7744 | 122.9833 | 114.7995 | 117.1221 | | resnest101e | 64 | 113.7942 | 119.6901 | 113.8629 | 119.4312 | | dla102 | 128 | 111.7066 | 111.8767 | 111.516 | 111.9509 | | swsl_resnext101_32x16d | 32 | 111.5857 | 115.3483 | 111.3781 | 115.949 | | poolformer_m36 | 64 | 107.0517 | 107.7065 | 107.0818 | 107.8243 | | tnt_s_patch16_224 | 128 | 106.6173 | 108.1408 | 96.4473 | 97.7809 | | gluon_inception_v3 | 128 | 104.292 | 104.7367 | 103.5322 | 104.0155 | | adv_inception_v3 | 128 | 104.0804 | 104.5105 | 103.5948 | 104.08 | | inception_v3 | 128 | 103.9287 | 104.8307 | 103.5586 | 104.3301 | | res2net50_14w_8s | 128 | 101.6507 | 103.5652 | 100.1172 | 101.3419 | | convit_base | 64 | 100.5689 | 100.6643 | 94.7881 | 94.7251 | | dpn107 | 32 | 96.1954 | 92.0629 | 95.9879 | 92.4044 | | res2next50 | 128 | 91.5825 | 92.1187 | 91.7129 | 91.9829 | | gluon_xception65 | 32 | 91.1507 | 90.9928 | 90.225 | 90.0748 | | swin_base_patch4_window7_224 | 64 | 88.9285 | 89.5009 | 83.7126 | 83.9664 | | mixer_b16_224 | 128 | 85.2227 | 85.0455 | 83.3849 | 82.9833 | | res2net101_26w_4s | 64 | 84.9355 | 91.2943 | 84.5028 | 93.0162 | | dm_nfnet_f0 | 128 | 84.0395 | 86.8564 | 83.2917 | 86.2557 | | fbnetv3_b | 128 | 82.681 | 81.766 | 82.8601 | 81.3775 | | pit_b_224 | 64 | 81.6451 | 82.029 | 72.7684 | 73.0988 | | convnext_base | 64 | 80.0433 | 81.074 | 79.7035 | 80.7168 | | visformer_small | 128 | 77.1963 | 77.6896 | 75.1034 | 75.4646 | | beit_base_patch16_224 | 64 | 74.5128 | 74.5353 | 68.8509 | 68.8849 | | nfnet_l0 | 128 | 74.0349 | 77.0221 | 73.9104 | 76.5953 | | gmlp_s16_224 | 128 | 73.2685 | 73.8547 | 72.4792 | 73.0208 | | eca_botnext26ts_256 | 128 | 72.7699 | 73.7935 | 72.5323 | 73.5424 | | jx_nest_base | 32 | 72.1658 | 72.8174 | 63.5922 | 64.1123 | | cspdarknet53 | 64 | 71.1087 | 69.0098 | 70.6885 | 68.5118 | | volo_d1_224 | 64 | 70.5353 | 71.4076 | 68.2591 | 69.1839 | | botnet26t_256 | 128 | 69.98 | 69.0591 | 69.771 | 68.6946 | | vit_base_patch16_224 | 64 | 69.7539 | 69.6572 | 64.0009 | 63.9545 | | gernet_l | 128 | 69.2196 | 67.2462 | 68.9426 | 66.829 | | deit_base_distilled_patch16_224 | 64 | 66.9793 | 66.9572 | 63.7497 | 63.7385 | | repvgg_a2 | 128 | 66.1846 | 64.2209 | 65.8451 | 63.9399 | | gmixer_24_224 | 128 | 66.0192 | 66.5525 | 61.3279 | 61.9823 | | xcit_large_24_p8_224 | 5 | 60.7263 | 78.0806 | 58.0621 | 77.4042 | | tf_efficientnet_b0 | 128 | 59.7304 | 58.4102 | 59.7971 | 58.3392 | | twins_pcpvt_base | 64 | 59.0531 | 68.5729 | 54.7026 | 71.1183 | | fbnetc_100 | 128 | 57.8504 | 55.8367 | 58.0617 | 56.4342 | | rexnet_100 | 128 | 57.8403 | 56.2595 | 57.5706 | 55.9677 | | coat_lite_mini | 128 | 57.6942 | 58.2846 | 54.0633 | 54.5718 | | tinynet_a | 128 | 56.3532 | 55.1254 | 56.2176 | 55.1946 | | mobilevit_s | 64 | 56.0432 | 55.5573 | 54.433 | 53.7596 | | sebotnet33ts_256 | 64 | 50.45 | 49.4999 | 50.1737 | 49.16 | | crossvit_9_240 | 128 | 48.8902 | 49.6442 | 43.9164 | 44.6908 | | spnasnet_100 | 128 | 48.5291 | 46.618 | 48.6425 | 46.4811 | | ghostnet_100 | 128 | 47.9938 | 57.2881 | 47.8285 | 54.8628 | | ese_vovnet19b_dw | 128 | 45.2378 | 44.5388 | 44.8207 | 44.129 | | mobilenetv2_100 | 128 | 44.4712 | 42.7257 | 44.5527 | 42.8295 | | selecsls42b | 128 | 42.3522 | 42.3434 | 42.2274 | 42.304 | | mnasnet_100 | 128 | 42.1794 | 40.6623 | 42.3015 | 40.6776 | | resmlp_12_224 | 128 | 41.6781 | 41.8082 | 37.5586 | 37.7228 | | mobilenetv3_large_100 | 128 | 40.2061 | 40.5582 | 40.2527 | 40.32 | | regnety_002 | 128 | 25.6615 | 29.6504 | 25.6012 | 31.4771 | | lcnet_050 | 128 | 17.516 | 20.2249 | 17.4373 | 20.0605 | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_100_10_04_23_performance_amp_441/timm_models_amp.png : ![](https://i.imgur.com/ZubyviE.png)

Build Summary

### Run name ### day_100_10_04_23_performance_amp_441 ### Commit hashes ### pytorch commit: f55e72c0f6bd6da016aaa51de379e6ba6d7891cc pytorch commit date: 2023-04-07 17:30:27+00:00 torchbench commit: 735f1927996c8d9ab81f0b0c05dd1ebdb26a6250 torchbench commit date: 2023-04-05 09:43:21-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+gitf55e72c ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (2.0 release binary oneoff)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 88%, 53/60 | 100%, 45/45 | 100%, 60/60 |
|       aot_eager        | 87%, 52/60 | 100%, 45/45 | 97%, 58/60  |
|        inductor        | 85%, 51/60 | 91%, 41/45  | 100%, 60/60 |
| inductor_no_cudagraphs | 87%, 52/60 | 96%, 43/45  | 100%, 60/60 |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|        inductor        |   1.59x    |    1.58x    |    1.41x    |
| inductor_no_cudagraphs |   1.27x    |    1.50x    |    1.39x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    4.85    |    7.26     |    5.99     |
|       aot_eager        |    9.37    |    15.82    |    13.21    |
|        inductor        |   63.80    |    62.92    |   111.25    |
| inductor_no_cudagraphs |   64.01    |    72.27    |   110.32    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.90x    |    0.88x    |
|        inductor        |   0.79x    |    0.91x    |    0.91x    |
| inductor_no_cudagraphs |   0.94x    |    1.05x    |    1.01x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+-------------------------------+-----------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+-----------------+------------------------+ | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | Background_Matting | eager_variation | eager_variation | | torchbench | vision_maskrcnn | eager_variation | eager_variation | | torchbench | tacotron2 | 0.0000 | 0.0000 | | torchbench | gat | 0.0000 | 0.0000 | | torchbench | gcn | 0.0000 | 0.0000 | | torchbench | llama | 0.0000 | 0.0000 | | torchbench | sage | 0.0000 | 0.0000 | | torchbench | torchrec_dlrm | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | AlbertForQuestionAnswering | fail_accuracy | fail_accuracy | +-------------+-------------------------------+-----------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | dcgan | 1.4097 | 0.8227 | | torchbench | lennard_jones | 1.3901 | 0.8762 | | torchbench | soft_actor_critic | 1.0289 | 0.7237 | | torchbench | tts_angular | 0.9646 | 0.949 | | torchbench | timm_vovnet | 0.9395 | 0.9242 | | torchbench | nvidia_deeprecommender | 0.8715 | 1.0183 | | torchbench | timm_vision_transformer_large | 0.0 | 1.0813 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | torchbench | gat | 0.0 | 0.0 | | torchbench | gcn | 0.0 | 0.0 | | torchbench | sage | 0.0 | 0.0 | | torchbench | tacotron2 | 0.0 | 0.0 | | torchbench | torchrec_dlrm | 0.0 | 0.0 | | huggingface | DebertaForMaskedLM | 0.9657 | 0.8392 | | huggingface | DebertaV2ForMaskedLM | 0.8861 | 0.6807 | | huggingface | DebertaV2ForQuestionAnswering | 0.8253 | 0.6939 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.2351 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+--------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+----------+------------------------+ | torchbench | hf_T5_large | 175.9942 | 175.5837 | | torchbench | phlippe_densenet | 165.4425 | 165.5889 | | torchbench | hf_BigBird | 151.3503 | 127.7386 | | torchbench | timm_efficientnet | 144.2798 | 145.5432 | | torchbench | mobilenet_v3_large | 141.1027 | 139.9887 | | torchbench | densenet121 | 139.3279 | 137.4203 | | torchbench | mobilenet_v2 | 133.2833 | 132.3333 | | torchbench | yolov3 | 121.0227 | 118.0911 | | torchbench | timm_vision_transformer_large | nan | 125.7471 | | huggingface | MobileBertForMaskedLM | 150.6225 | 148.9238 | | huggingface | MobileBertForQuestionAnswering | 140.7136 | 653.5987 | | huggingface | DebertaV2ForMaskedLM | 138.8626 | 75.8935 | | huggingface | DebertaV2ForQuestionAnswering | 137.5645 | 72.7915 | | huggingface | M2M100ForConditionalGeneration | 135.8697 | 137.9521 | | huggingface | MT5ForConditionalGeneration | 134.411 | 133.8593 | | huggingface | XGLMForCausalLM | 133.5907 | 132.7691 | | timm_models | rexnet_100 | 275.1515 | 276.6628 | | timm_models | hrnet_w18 | 255.7748 | 249.9176 | | timm_models | ghostnet_100 | 244.5686 | 243.1281 | | timm_models | fbnetv3_b | 178.1302 | 174.7349 | | timm_models | pnasnet5large | 167.4111 | 161.8981 | | timm_models | resnest101e | 166.679 | 168.3601 | | timm_models | mobilevit_s | 164.6195 | 161.0406 | | timm_models | gluon_inception_v3 | 162.3067 | 161.1515 | | timm_models | adv_inception_v3 | 162.0021 | 163.0911 | | timm_models | tinynet_a | 160.6803 | 156.4597 | | timm_models | mobilenetv3_large_100 | 160.5822 | 153.4424 | | timm_models | mixnet_l | 159.9061 | 158.8318 | | timm_models | inception_v3 | 156.6721 | 159.4795 | | timm_models | tf_mixnet_l | 156.3445 | 156.0664 | | timm_models | res2net101_26w_4s | 153.8691 | 153.8122 | | timm_models | twins_pcpvt_base | 149.9154 | 147.8376 | | timm_models | tf_efficientnet_b0 | 149.8097 | 154.6713 | | timm_models | fbnetc_100 | 136.5809 | 133.1516 | | timm_models | spnasnet_100 | 136.5 | 137.2727 | | timm_models | xcit_large_24_p8_224 | 135.2229 | 132.5646 | | timm_models | mobilenetv2_100 | 130.7307 | 133.0086 | | timm_models | mnasnet_100 | 126.2514 | 126.592 | | timm_models | res2net50_14w_8s | 123.4541 | 126.3791 | +-------------+--------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | hf_GPT2_large | 0.8904 | 1.128 | | torchbench | yolov3 | 0.8742 | 1.0155 | | torchbench | timm_efficientnet | 0.8696 | 0.9417 | | torchbench | speech_transformer | 0.8651 | 0.8682 | | torchbench | timm_resnest | 0.8604 | 0.9668 | | torchbench | shufflenet_v2_x1_0 | 0.8598 | 0.9587 | | torchbench | timm_vision_transformer | 0.8593 | 0.8835 | | torchbench | timm_regnet | 0.8507 | 0.9508 | | torchbench | resnet152 | 0.8501 | 0.9397 | | torchbench | Background_Matting | 0.8485 | 1.0406 | | torchbench | hf_DistilBert | 0.8476 | 0.9945 | | torchbench | hf_T5_large | 0.8201 | 1.168 | | torchbench | pytorch_unet | 0.8134 | 0.9308 | | torchbench | phlippe_densenet | 0.8058 | 0.8659 | | torchbench | hf_Bart | 0.7933 | 0.9173 | | torchbench | resnet50 | 0.7821 | 0.8839 | | torchbench | dcgan | 0.7821 | 0.9645 | | torchbench | demucs | 0.773 | 0.9656 | | torchbench | squeezenet1_1 | 0.7722 | 0.908 | | torchbench | pytorch_stargan | 0.7715 | 0.8893 | | torchbench | timm_vovnet | 0.7529 | 0.8869 | | torchbench | mnasnet1_0 | 0.7438 | 0.778 | | torchbench | pytorch_struct | 0.7277 | 0.7362 | | torchbench | vgg16 | 0.7227 | 0.9808 | | torchbench | densenet121 | 0.7096 | 0.7998 | | torchbench | alexnet | 0.7091 | 0.939 | | torchbench | mobilenet_v3_large | 0.6984 | 0.8724 | | torchbench | hf_BigBird | 0.6961 | 1.1191 | | torchbench | resnext50_32x4d | 0.6682 | 0.772 | | torchbench | nvidia_deeprecommender | 0.6585 | 0.8931 | | torchbench | drq | 0.6379 | 0.9573 | | torchbench | soft_actor_critic | 0.6066 | 0.9973 | | torchbench | LearningToPaint | 0.5925 | 0.7463 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.5904 | 0.6004 | | torchbench | resnet18 | 0.5395 | 0.6097 | | torchbench | lennard_jones | 0.5317 | 0.9997 | | torchbench | hf_Reformer | 0.4538 | 0.8022 | | torchbench | functorch_dp_cifar10 | 0.3991 | 0.4424 | | torchbench | phlippe_resnet | 0.3169 | 0.3395 | | huggingface | PegasusForCausalLM | 0.893 | 0.9864 | | huggingface | DistilBertForMaskedLM | 0.8849 | 0.9624 | | huggingface | TrOCRForCausalLM | 0.8836 | 0.9583 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.8729 | 0.9803 | | huggingface | PegasusForConditionalGeneration | 0.8689 | 1.0689 | | huggingface | MBartForConditionalGeneration | 0.8672 | 1.0307 | | huggingface | BartForConditionalGeneration | 0.8456 | 1.0139 | | huggingface | MegatronBertForCausalLM | 0.845 | 1.0962 | | huggingface | BlenderbotSmallForCausalLM | 0.8184 | 0.9119 | | huggingface | Speech2Text2ForCausalLM | 0.789 | 0.8779 | | huggingface | M2M100ForConditionalGeneration | 0.7651 | 0.9908 | | huggingface | MobileBertForMaskedLM | 0.7473 | 1.016 | | huggingface | XGLMForCausalLM | 0.7117 | 0.9792 | | huggingface | MobileBertForQuestionAnswering | 0.6569 | 0.8392 | | huggingface | DebertaForMaskedLM | 0.5501 | 0.9978 | | huggingface | DebertaV2ForMaskedLM | 0.5197 | 0.9665 | | huggingface | DebertaV2ForQuestionAnswering | 0.487 | 0.9802 | | huggingface | DebertaForQuestionAnswering | 0.4601 | 1.1526 | | timm_models | hrnet_w18 | 0.8918 | 0.99 | | timm_models | sebotnet33ts_256 | 0.891 | 1.1115 | | timm_models | inception_v3 | 0.8904 | 1.0171 | | timm_models | gluon_inception_v3 | 0.8904 | 1.0171 | | timm_models | adv_inception_v3 | 0.8904 | 1.0171 | | timm_models | dpn107 | 0.8833 | 0.9642 | | timm_models | gluon_xception65 | 0.8831 | 0.9705 | | timm_models | ghostnet_100 | 0.8807 | 0.977 | | timm_models | spnasnet_100 | 0.8786 | 0.9451 | | timm_models | mobilenetv3_large_100 | 0.877 | 0.9361 | | timm_models | poolformer_m36 | 0.8768 | 1.1871 | | timm_models | eca_botnext26ts_256 | 0.8738 | 1.0072 | | timm_models | xcit_large_24_p8_224 | 0.8721 | 0.9732 | | timm_models | res2net50_14w_8s | 0.8712 | 0.9607 | | timm_models | res2net101_26w_4s | 0.871 | 0.9483 | | timm_models | mixnet_l | 0.8687 | 0.9902 | | timm_models | mnasnet_100 | 0.8683 | 0.9403 | | timm_models | res2next50 | 0.866 | 0.9547 | | timm_models | cait_m36_384 | 0.8632 | 0.989 | | timm_models | fbnetc_100 | 0.8596 | 0.9535 | | timm_models | pit_b_224 | 0.8578 | 1.0242 | | timm_models | selecsls42b | 0.8576 | 0.9664 | | timm_models | convnext_base | 0.8505 | 1.0338 | | timm_models | gernet_l | 0.8499 | 0.9706 | | timm_models | swsl_resnext101_32x16d | 0.8461 | 0.9786 | | timm_models | coat_lite_mini | 0.8402 | 1.0202 | | timm_models | botnet26t_256 | 0.8239 | 0.9779 | | timm_models | lcnet_050 | 0.805 | 0.884 | | timm_models | repvgg_a2 | 0.7738 | 0.9611 | | timm_models | regnety_002 | 0.7602 | 0.8966 | | timm_models | crossvit_9_240 | 0.7526 | 0.9898 | | timm_models | swin_base_patch4_window7_224 | 0.7214 | 0.9045 | | timm_models | jx_nest_base | 0.6693 | 0.9604 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | BERT_pytorch | 16 | 0.9922 | 0.8069 | 3.5997 | 2.1159 | | functorch_dp_cifar10 | 64 | 0.9647 | 0.9154 | 3.5789 | 1.3412 | | densenet121 | 4 | 0.9882 | 0.7174 | 2.7605 | 1.0121 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9728 | 0.8939 | 2.6642 | 1.8162 | | hf_BigBird | 2 | 0.9603 | 0.7797 | 2.6313 | 1.6709 | | hf_Albert | 8 | 0.9918 | 0.956 | 2.3383 | 2.2906 | | hf_T5_large | 2 | 0.9737 | 0.8029 | 2.2331 | 1.8342 | | phlippe_densenet | 128 | 0.9825 | 0.7652 | 2.0708 | 0.9888 | | mobilenet_v3_large | 32 | 0.9988 | 0.772 | 2.0664 | 1.1584 | | squeezenet1_1 | 32 | 0.9841 | 0.9338 | 2.063 | 1.2499 | | dlrm | 1024 | 0.9526 | 0.827 | 1.9633 | 1.2072 | | hf_GPT2 | 4 | 0.9997 | 0.9597 | 1.9349 | 1.8027 | | hf_T5 | 8 | 0.9842 | 0.8532 | 1.9135 | 1.9983 | | hf_Bert | 4 | 0.9947 | 0.8383 | 1.8512 | 1.5752 | | phlippe_resnet | 128 | 0.9851 | 0.7556 | 1.8385 | 0.9817 | | resnext50_32x4d | 8 | 0.9819 | 0.7161 | 1.7426 | 0.9686 | | mnasnet1_0 | 32 | 0.9936 | 0.7307 | 1.7159 | 1.0948 | | timm_vision_transformer | 32 | 0.9809 | 0.8515 | 1.711 | 1.3864 | | hf_Bart | 4 | 0.9789 | 0.8422 | 1.678 | 1.5056 | | hf_GPT2_large | 4 | 0.9828 | 0.9713 | 1.6777 | 1.7374 | | shufflenet_v2_x1_0 | 128 | 0.9933 | 0.7467 | 1.673 | 1.2154 | | speech_transformer | 32 | 0.9757 | 0.7876 | 1.6058 | 1.5851 | | hf_Bert_large | 4 | 1.0027 | 0.8536 | 1.6025 | 1.5538 | | resnet18 | 16 | 0.9871 | 0.76 | 1.5882 | 0.9764 | | timm_resnest | 32 | 0.9937 | 0.8577 | 1.5568 | 1.4949 | | fastNLP_Bert | 6 | 0.9862 | 0.798 | 1.5446 | 1.5062 | | pytorch_struct | 200 | 0.9148 | 0.7762 | 1.5382 | 1.1431 | | timm_nfnet | 128 | 0.986 | 0.9854 | 1.5349 | 1.468 | | mobilenet_v2 | 96 | 0.9967 | 0.777 | 1.5261 | 1.5179 | | drq | 1 | 0.9633 | 0.7385 | 1.5083 | 1.0341 | | attention_is_all_you_need_pytorch | 256 | 0.9864 | 0.8339 | 1.5013 | 1.4689 | | hf_DistilBert | 8 | 0.9963 | 0.9573 | 1.4862 | 1.4746 | | timm_efficientnet | 32 | 0.9365 | 0.6228 | 1.4462 | 1.0716 | | dcgan | 32 | 0.8588 | 0.6885 | 1.4097 | 0.8227 | | lennard_jones | 1000 | 0.8643 | 0.7665 | 1.3901 | 0.8762 | | pytorch_unet | 1 | 0.9963 | 0.2048 | 1.3577 | 1.3522 | | LearningToPaint | 96 | 0.9851 | 0.7718 | 1.3021 | 1.0694 | | pytorch_stargan | 16 | 0.9907 | 0.8009 | 1.2742 | 1.2469 | | resnet152 | 32 | 0.9946 | 0.7479 | 1.2512 | 1.0055 | | vgg16 | 64 | 0.9994 | 0.9983 | 1.2406 | 1.2536 | | Super_SloMo | 6 | 0.997 | 0.1792 | 1.2323 | 1.2329 | | Background_Matting | 4 | 0.9985 | 0.1369 | 1.2132 | 1.2076 | | yolov3 | 16 | 0.9957 | 0.8061 | 1.1973 | 1.1979 | | resnet50 | 32 | 0.994 | 0.7755 | 1.1916 | 1.0536 | | hf_Reformer | 4 | 0.9857 | 0.963 | 1.1225 | 1.0582 | | alexnet | 128 | 0.9989 | 0.9975 | 1.0872 | 1.1367 | | demucs | 4 | 0.9987 | 1.0013 | 1.0425 | 1.0389 | | soft_actor_critic | 256 | 0.8469 | 0.627 | 1.0289 | 0.7237 | | timm_regnet | 32 | 0.9173 | 0.7724 | 0.9877 | 0.9643 | | tts_angular | 64 | 0.9128 | 0.8758 | 0.9646 | 0.949 | | timm_vovnet | 32 | 0.855 | 0.7008 | 0.9395 | 0.9242 | | nvidia_deeprecommender | 256 | 0.9987 | 0.9986 | 0.8715 | 1.0183 | | timm_vision_transformer_large | 32 | 0.9981 | 0.0 | 0.0 | 1.0813 | | hf_Longformer | 2 | 1.0048 | 0.6888 | 0.0 | 0.0 | | moco | 32 | 0.9358 | 0.0 | 0.0 | 0.0 | | gat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | gcn | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | sage | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | speech_transformer | 4 | pass | pass | pass | pass | | phlippe_resnet | 4 | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | | resnet152 | 4 | pass | pass | pass | pass | | resnet18 | 4 | pass | pass | pass | pass | | resnet50 | 4 | pass | pass | pass | pass | | resnext50_32x4d | 4 | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 4 | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | | squeezenet1_1 | 4 | pass | pass | pass | pass | | nvidia_deeprecommender | 4 | pass | pass | pass | pass | | timm_efficientnet | 4 | pass | pass | pass | pass | | timm_nfnet | 4 | pass | pass | pass | pass | | timm_regnet | 4 | pass | pass | pass | pass | | timm_resnest | 4 | pass | pass | pass | pass | | timm_vision_transformer | 4 | pass | pass | pass | pass | | timm_vovnet | 4 | pass | pass | pass | pass | | tts_angular | 4 | pass | pass | pass | pass | | vgg16 | 4 | pass | pass | pass | pass | | yolov3 | 4 | pass | pass | pass | pass | | BERT_pytorch | 4 | fail_accuracy | pass | pass | pass | | phlippe_densenet | 4 | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | | mobilenet_v3_large | 4 | pass | pass | pass | pass | | hf_Albert | 4 | pass | pass | pass | pass | | LearningToPaint | 4 | pass | pass | pass | pass | | Super_SloMo | 4 | pass | pass | pass | pass | | alexnet | 4 | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 4 | pass | pass | pass | pass | | dcgan | 4 | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | | densenet121 | 4 | pass | pass | pass | pass | | mobilenet_v2 | 4 | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | | fastNLP_Bert | 4 | pass | pass | pass | pass | | functorch_dp_cifar10 | 4 | pass | pass | pass | pass | | dlrm | 4 | pass | pass | pass | pass | | hf_Bart | 4 | pass | pass | pass | pass | | hf_Reformer | 4 | pass | pass | pass | pass | | hf_Bert | 4 | pass | pass | pass | pass | | lennard_jones | 4 | pass | pass | pass | pass | | hf_T5_base | 4 | pass | pass | pass | pass | | hf_T5 | 4 | pass | pass | pass | pass | | mnasnet1_0 | 4 | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | | hf_DistilBert | 4 | pass | pass | pass | pass | | hf_BigBird | 4 | pass | pass | pass | pass | | hf_Bert_large | 4 | pass | pass | pass | pass | | hf_Longformer | 4 | pass | pass | fail_to_run | fail_to_run | | moco | 4 | pass | fail_to_run | fail_to_run | fail_to_run | | Background_Matting | 4 | eager_variation | eager_variation | eager_variation | eager_variation | | vision_maskrcnn | 4 | fail_accuracy | 0.0000 | eager_variation | eager_variation | | tacotron2 | 4 | fail_to_run | fail_to_run | 0.0000 | 0.0000 | | gat | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | gcn | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | llama | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | sage | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | torchrec_dlrm | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------+------------------------+ | hf_T5_large | 2 | 27.0016 | 55.4844 | 175.9942 | 175.5837 | | phlippe_densenet | 128 | 3.2932 | 7.1632 | 165.4425 | 165.5889 | | hf_BigBird | 2 | 13.0269 | 37.7408 | 151.3503 | 127.7386 | | timm_efficientnet | 32 | 4.9739 | 10.2058 | 144.2798 | 145.5432 | | mobilenet_v3_large | 32 | 3.4446 | 8.179 | 141.1027 | 139.9887 | | densenet121 | 4 | 7.6852 | 18.1939 | 139.3279 | 137.4203 | | mobilenet_v2 | 96 | 3.1576 | 7.0356 | 133.2833 | 132.3333 | | yolov3 | 16 | 5.0054 | 10.7865 | 121.0227 | 118.0911 | | mnasnet1_0 | 32 | 3.1362 | 6.849 | 111.3499 | 111.0362 | | hf_GPT2_large | 4 | 15.1052 | 30.3518 | 109.1564 | 104.9821 | | resnet152 | 32 | 9.1608 | 20.5326 | 108.0717 | 106.4709 | | timm_resnest | 32 | 1.8451 | 3.9997 | 95.9391 | 100.3221 | | shufflenet_v2_x1_0 | 128 | 3.4764 | 7.7384 | 83.0937 | 84.7648 | | speech_transformer | 32 | 5.972 | 13.784 | 78.4976 | 77.929 | | attention_is_all_you_need_pytorch | 256 | 4.4253 | 11.1024 | 76.5471 | 74.6155 | | timm_nfnet | 128 | 6.2417 | 11.1718 | 75.4307 | 73.0296 | | timm_regnet | 32 | 6.8625 | 12.4395 | 72.4393 | 73.1837 | | Background_Matting | 4 | 3.1076 | 11.5416 | 70.8216 | 67.8165 | | BERT_pytorch | 16 | 4.9397 | 11.6625 | 70.4056 | 70.713 | | resnet50 | 32 | 3.2582 | 7.0396 | 67.8751 | 65.1941 | | hf_Bert_large | 4 | 10.4292 | 21.3581 | 65.8159 | 63.4937 | | timm_vovnet | 32 | 3.6376 | 6.4162 | 64.4312 | 63.4021 | | pytorch_unet | 1 | 1.5486 | 4.451 | 60.7753 | 58.7652 | | functorch_dp_cifar10 | 64 | 1.2096 | 2.4342 | 57.3101 | 56.1893 | | resnext50_32x4d | 8 | 3.2174 | 7.0465 | 54.085 | 54.1601 | | timm_vision_transformer | 32 | 3.3304 | 7.4161 | 53.1279 | 51.8676 | | hf_T5 | 8 | 5.9817 | 13.6137 | 52.3317 | 51.655 | | fastNLP_Bert | 6 | 5.2394 | 11.3087 | 51.3959 | 51.806 | | hf_Bart | 4 | 6.3516 | 13.9029 | 49.4887 | 50.6385 | | hf_Reformer | 4 | 4.165 | 6.0667 | 48.277 | 43.817 | | pytorch_stargan | 16 | 1.2151 | 3.2242 | 46.5096 | 47.096 | | LearningToPaint | 96 | 1.4142 | 2.9085 | 46.3854 | 45.0346 | | resnet18 | 16 | 1.3514 | 2.9053 | 45.4315 | 44.3227 | | Super_SloMo | 6 | 2.7691 | 10.2726 | 43.8271 | 44.9401 | | hf_GPT2 | 4 | 4.9339 | 9.6931 | 42.6598 | 43.531 | | hf_Bert | 4 | 5.1149 | 10.5558 | 39.4039 | 40.363 | | hf_Albert | 8 | 2.6288 | 8.0988 | 39.0277 | 39.6987 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.2908 | 2.9679 | 37.8462 | 36.5206 | | phlippe_resnet | 128 | 1.3605 | 2.9018 | 32.9531 | 32.6658 | | hf_DistilBert | 8 | 2.4936 | 5.2869 | 32.0276 | 30.0542 | | demucs | 4 | 1.4353 | 2.1884 | 31.8417 | 29.9797 | | squeezenet1_1 | 32 | 1.0522 | 1.7709 | 23.9868 | 25.4235 | | pytorch_struct | 200 | 0.7499 | 1.3475 | 21.7295 | 21.2596 | | vgg16 | 64 | 0.6368 | 1.1245 | 17.3685 | 17.052 | | alexnet | 128 | 0.4866 | 0.7789 | 15.4395 | 15.4033 | | nvidia_deeprecommender | 256 | 0.4753 | 0.8006 | 10.797 | 10.7915 | | drq | 1 | 0.6622 | 1.0246 | 9.6767 | 9.3641 | | dcgan | 32 | 0.4382 | 0.7187 | 9.1784 | 8.8061 | | soft_actor_critic | 256 | 0.4318 | 0.6065 | 8.3155 | 7.8794 | | dlrm | 1024 | 0.379 | 0.7845 | 7.8935 | 8.4439 | | lennard_jones | 1000 | 0.3995 | 0.6023 | 7.0385 | 7.041 | | tts_angular | 64 | 0.4509 | 0.5187 | 6.798 | 6.7437 | | timm_vision_transformer_large | 32 | 9.4809 | nan | nan | 125.7471 | | hf_Longformer | 2 | 9.8627 | 30.5952 | nan | nan | | moco | 32 | 33.564 | nan | nan | nan | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | Super_SloMo | 6 | 1.0014 | 0.822 | 1.208 | 1.208 | | hf_Albert | 8 | 0.9599 | 0.9008 | 1.0863 | 1.2557 | | fastNLP_Bert | 6 | 1.0003 | 0.8878 | 1.0496 | 1.1593 | | hf_T5 | 8 | 0.9507 | 0.8891 | 1.0163 | 1.1719 | | mobilenet_v2 | 96 | 0.9863 | 0.7657 | 1.0107 | 1.1025 | | tts_angular | 64 | 0.9983 | 0.9983 | 0.9895 | 0.9983 | | attention_is_all_you_need_pytorch | 256 | 0.9648 | 0.9066 | 0.9689 | 1.1266 | | timm_nfnet | 128 | 0.9071 | 0.8753 | 0.9677 | 1.073 | | dlrm | 1024 | 0.9995 | 0.9944 | 0.952 | 1.0009 | | BERT_pytorch | 16 | 1.0003 | 0.8671 | 0.9428 | 1.1717 | | hf_Bert | 4 | 0.9645 | 0.8353 | 0.9422 | 1.0258 | | hf_Bert_large | 4 | 0.9845 | 0.8521 | 0.9402 | 1.0725 | | hf_GPT2 | 4 | 0.9357 | 0.8198 | 0.9321 | 1.0713 | | hf_GPT2_large | 4 | 0.9663 | 0.8303 | 0.8904 | 1.128 | | yolov3 | 16 | 0.9837 | 0.846 | 0.8742 | 1.0155 | | timm_efficientnet | 32 | 0.9846 | 0.7674 | 0.8696 | 0.9417 | | speech_transformer | 32 | 0.9915 | 0.9 | 0.8651 | 0.8682 | | timm_resnest | 32 | 0.9881 | 0.8984 | 0.8604 | 0.9668 | | shufflenet_v2_x1_0 | 128 | 0.9549 | 0.8395 | 0.8598 | 0.9587 | | timm_vision_transformer | 32 | 0.9907 | 0.9299 | 0.8593 | 0.8835 | | timm_regnet | 32 | 0.9908 | 0.8523 | 0.8507 | 0.9508 | | resnet152 | 32 | 0.9959 | 0.8912 | 0.8501 | 0.9397 | | Background_Matting | 4 | 1.0127 | 0.6489 | 0.8485 | 1.0406 | | hf_DistilBert | 8 | 0.9262 | 0.8146 | 0.8476 | 0.9945 | | hf_T5_large | 2 | 0.9831 | 0.8302 | 0.8201 | 1.168 | | pytorch_unet | 1 | 0.9953 | 0.7154 | 0.8134 | 0.9308 | | phlippe_densenet | 128 | 0.9983 | 0.9982 | 0.8058 | 0.8659 | | hf_Bart | 4 | 0.9087 | 0.7524 | 0.7933 | 0.9173 | | resnet50 | 32 | 0.9894 | 0.8606 | 0.7821 | 0.8839 | | dcgan | 32 | 0.9647 | 0.7957 | 0.7821 | 0.9645 | | demucs | 4 | 0.9661 | 0.9657 | 0.773 | 0.9656 | | squeezenet1_1 | 32 | 0.9666 | 0.9321 | 0.7722 | 0.908 | | pytorch_stargan | 16 | 0.9914 | 0.969 | 0.7715 | 0.8893 | | timm_vovnet | 32 | 0.9892 | 0.8166 | 0.7529 | 0.8869 | | mnasnet1_0 | 32 | 0.9801 | 0.8971 | 0.7438 | 0.778 | | pytorch_struct | 200 | 0.9992 | 0.5106 | 0.7277 | 0.7362 | | vgg16 | 64 | 0.9923 | 0.7245 | 0.7227 | 0.9808 | | densenet121 | 4 | 0.994 | 0.9823 | 0.7096 | 0.7998 | | alexnet | 128 | 0.9454 | 0.7939 | 0.7091 | 0.939 | | mobilenet_v3_large | 32 | 0.979 | 0.8383 | 0.6984 | 0.8724 | | hf_BigBird | 2 | 0.9486 | 0.9264 | 0.6961 | 1.1191 | | resnext50_32x4d | 8 | 0.9942 | 0.8441 | 0.6682 | 0.772 | | nvidia_deeprecommender | 256 | 0.9176 | 0.8055 | 0.6585 | 0.8931 | | drq | 1 | 0.9877 | 0.8852 | 0.6379 | 0.9573 | | soft_actor_critic | 256 | 0.9995 | 0.9239 | 0.6066 | 0.9973 | | LearningToPaint | 96 | 0.9192 | 0.7116 | 0.5925 | 0.7463 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9965 | 0.8796 | 0.5904 | 0.6004 | | resnet18 | 16 | 0.9753 | 0.7786 | 0.5395 | 0.6097 | | lennard_jones | 1000 | 0.9996 | 0.9997 | 0.5317 | 0.9997 | | hf_Reformer | 4 | 0.8004 | 0.8004 | 0.4538 | 0.8022 | | functorch_dp_cifar10 | 64 | 0.9953 | 0.8396 | 0.3991 | 0.4424 | | phlippe_resnet | 128 | 0.9881 | 0.864 | 0.3169 | 0.3395 | | timm_vision_transformer_large | 32 | 0.9992 | nan | nan | 0.9724 | | hf_Longformer | 2 | 0.9511 | 0.893 | nan | nan | | moco | 32 | 0.9994 | nan | nan | nan | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------+------------------------+ | hf_GPT2_large | 4 | 212.5713 | 215.0116 | 124.5754 | 120.1776 | | Background_Matting | 4 | 125.9211 | 918.9735 | 103.6754 | 104.2231 | | hf_T5_large | 2 | 226.7349 | 274.3269 | 101.6134 | 123.3929 | | hf_T5 | 8 | 182.5237 | 212.6779 | 93.9617 | 90.7301 | | timm_nfnet | 128 | 119.6924 | 119.881 | 76.9207 | 80.3142 | | hf_BigBird | 2 | 199.7247 | 282.9694 | 74.0526 | 119.391 | | hf_Reformer | 4 | 82.1887 | 84.0477 | 72.3456 | 76.472 | | Super_SloMo | 6 | 79.7687 | 444.0523 | 64.3392 | 64.3528 | | yolov3 | 16 | 68.8495 | 84.9866 | 57.271 | 57.2379 | | timm_regnet | 32 | 61.2759 | 72.6624 | 57.0 | 58.5843 | | vgg16 | 64 | 66.2181 | 66.283 | 53.4235 | 52.888 | | resnet152 | 32 | 65.944 | 88.1843 | 53.4196 | 69.4877 | | hf_Bert_large | 4 | 83.431 | 96.3325 | 51.9581 | 52.8354 | | demucs | 4 | 53.8259 | 53.7991 | 51.8241 | 51.8682 | | attention_is_all_you_need_pytorch | 256 | 58.61 | 68.8829 | 36.1536 | 36.2749 | | speech_transformer | 32 | 68.2645 | 84.1561 | 36.1351 | 40.846 | | hf_Bart | 4 | 72.7967 | 91.1483 | 34.8988 | 57.3494 | | fastNLP_Bert | 6 | 57.4997 | 70.1987 | 33.8033 | 34.6908 | | mobilenet_v2 | 96 | 47.0762 | 60.4194 | 30.7384 | 30.9666 | | pytorch_unet | 1 | 39.9373 | 194.3137 | 29.3036 | 29.4218 | | hf_Albert | 8 | 70.2281 | 71.3937 | 29.1167 | 29.7844 | | hf_GPT2 | 4 | 53.1371 | 50.6452 | 27.2817 | 27.0821 | | timm_vovnet | 32 | 28.9548 | 35.2513 | 26.3159 | 26.7092 | | hf_Bert | 4 | 41.7318 | 48.7282 | 22.4821 | 26.0855 | | timm_efficientnet | 32 | 34.3449 | 51.7878 | 22.3151 | 30.4378 | | resnet50 | 32 | 26.9137 | 33.4846 | 22.034 | 25.551 | | hf_DistilBert | 8 | 33.6447 | 32.7199 | 21.5907 | 21.2494 | | densenet121 | 4 | 60.6301 | 73.3102 | 19.1877 | 57.5659 | | shufflenet_v2_x1_0 | 128 | 32.1091 | 42.4701 | 18.8091 | 25.2013 | | timm_vision_transformer | 32 | 33.3921 | 38.255 | 18.3129 | 22.6728 | | BERT_pytorch | 16 | 63.7573 | 78.4856 | 17.6744 | 25.7188 | | timm_resnest | 32 | 24.3614 | 28.0114 | 15.3539 | 16.1654 | | resnext50_32x4d | 8 | 20.0174 | 27.6844 | 12.9061 | 23.0576 | | mnasnet1_0 | 32 | 23.5617 | 31.9797 | 12.9007 | 20.4177 | | mobilenet_v3_large | 32 | 28.7084 | 36.8709 | 12.8173 | 24.9966 | | nvidia_deeprecommender | 256 | 10.2273 | 10.2441 | 11.7005 | 10.039 | | pytorch_stargan | 16 | 14.8667 | 18.3652 | 11.619 | 11.8572 | | phlippe_densenet | 128 | 23.9634 | 30.7791 | 11.5006 | 23.8297 | | alexnet | 128 | 9.8213 | 9.8453 | 9.0183 | 8.6404 | | LearningToPaint | 96 | 11.366 | 15.1778 | 8.5461 | 10.5252 | | tts_angular | 64 | 6.8889 | 7.1934 | 7.3287 | 6.7854 | | resnet18 | 16 | 9.0945 | 11.8108 | 6.1737 | 9.2452 | | pytorch_CycleGAN_and_pix2pix | 1 | 15.3745 | 17.0378 | 5.7449 | 8.4458 | | squeezenet1_1 | 32 | 11.141 | 11.7973 | 5.4112 | 8.8339 | | phlippe_resnet | 128 | 9.1401 | 12.0766 | 4.9692 | 9.2926 | | pytorch_struct | 200 | 5.2106 | 6.0388 | 3.1803 | 4.7757 | | functorch_dp_cifar10 | 64 | 10.6262 | 11.161 | 2.829 | 7.632 | | soft_actor_critic | 256 | 1.8478 | 2.7498 | 2.3774 | 3.1232 | | dlrm | 1024 | 4.9623 | 5.6626 | 2.134 | 3.5273 | | drq | 1 | 3.4367 | 4.4024 | 2.1296 | 3.1579 | | dcgan | 32 | 2.4622 | 3.1088 | 1.535 | 2.8966 | | lennard_jones | 1000 | 1.8775 | 2.1456 | 1.1697 | 1.796 | | timm_vision_transformer_large | 32 | 464.5636 | nan | nan | 428.4966 | | hf_Longformer | 2 | 122.311 | 162.3272 | nan | nan | | moco | 32 | 55.3786 | nan | nan | nan | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+----------+-----------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9882 | 0.9043 | 2.4603 | 2.4844 | | MobileBertForMaskedLM | 64 | 0.945 | 0.8098 | 2.4031 | 1.0739 | | GPT2ForSequenceClassification | 4 | 0.9749 | 0.9511 | 2.2733 | 2.2876 | | MobileBertForQuestionAnswering | 128 | 0.9492 | 0.8023 | 2.1662 | 1.1065 | | MT5ForConditionalGeneration | 16 | 0.9868 | 0.8375 | 2.131 | 2.1327 | | ElectraForQuestionAnswering | 64 | 0.9871 | 0.9754 | 2.1164 | 2.1086 | | ElectraForCausalLM | 32 | 0.9814 | 0.9375 | 1.8425 | 1.8404 | | XLNetLMHeadModel | 8 | 0.9952 | 0.9672 | 1.8089 | 1.8186 | | LayoutLMForSequenceClassification | 16 | 0.9844 | 0.9706 | 1.801 | 1.7894 | | RobertaForQuestionAnswering | 16 | 0.9842 | 0.9694 | 1.7883 | 1.7572 | | BertForQuestionAnswering | 16 | 0.984 | 0.9695 | 1.7746 | 1.761 | | XGLMForCausalLM | 8 | 1.0009 | 0.8353 | 1.7021 | 1.467 | | RobertaForCausalLM | 16 | 0.9868 | 0.9619 | 1.6805 | 1.6658 | | M2M100ForConditionalGeneration | 16 | 0.9694 | 0.8432 | 1.671 | 1.3683 | | DistillGPT2 | 16 | 0.9866 | 0.9543 | 1.6568 | 1.6994 | | AlbertForQuestionAnswering | 4 | 0.9997 | 0.8856 | 1.6476 | 1.6435 | | PLBartForCausalLM | 8 | 0.985 | 0.9581 | 1.6399 | 1.6817 | | AlbertForMaskedLM | 4 | 0.9996 | 0.8847 | 1.6394 | 1.6363 | | T5Small | 4 | 0.979 | 0.8493 | 1.6338 | 1.7547 | | T5ForConditionalGeneration | 4 | 0.9781 | 0.8491 | 1.6216 | 1.7275 | | PLBartForConditionalGeneration | 4 | 0.9863 | 0.9462 | 1.6209 | 1.6515 | | MegatronBertForQuestionAnswering | 8 | 0.9802 | 0.9611 | 1.6048 | 1.6287 | | BertForMaskedLM | 16 | 0.9858 | 0.9609 | 1.5947 | 1.5825 | | LayoutLMForMaskedLM | 16 | 0.9861 | 0.9622 | 1.5805 | 1.5935 | | CamemBert | 16 | 0.9869 | 0.9635 | 1.5453 | 1.5353 | | Speech2Text2ForCausalLM | 256 | 0.9717 | 0.9143 | 1.5334 | 1.5754 | | BartForCausalLM | 4 | 0.9848 | 0.9561 | 1.5161 | 1.55 | | YituTechConvBert | 16 | 0.9856 | 0.9579 | 1.5119 | 1.491 | | MBartForCausalLM | 4 | 0.9827 | 0.9526 | 1.5088 | 1.5417 | | MegatronBertForCausalLM | 4 | 0.9946 | 0.9099 | 1.4689 | 1.4965 | | BartForConditionalGeneration | 2 | 0.9949 | 0.9698 | 1.4594 | 1.4429 | | MBartForConditionalGeneration | 2 | 0.9964 | 0.9611 | 1.4485 | 1.4278 | | DistilBertForQuestionAnswering | 256 | 0.9938 | 0.9868 | 1.4465 | 1.4456 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9967 | 0.9194 | 1.3618 | 1.4146 | | PegasusForConditionalGeneration | 32 | 0.9954 | 0.9419 | 1.343 | 1.3484 | | BlenderbotSmallForCausalLM | 64 | 0.9818 | 0.9067 | 1.2689 | 1.2793 | | TrOCRForCausalLM | 32 | 0.9875 | 0.9527 | 1.2554 | 1.2906 | | DistilBertForMaskedLM | 128 | 0.9924 | 0.9504 | 1.2082 | 1.2325 | | PegasusForCausalLM | 32 | 0.9769 | 0.927 | 1.1827 | 1.2771 | | DebertaForQuestionAnswering | 8 | 0.7931 | 0.697 | 1.0464 | 0.9605 | | DebertaForMaskedLM | 4 | 0.7155 | 0.5797 | 0.9657 | 0.8392 | | DebertaV2ForMaskedLM | 1 | 0.6824 | 0.5188 | 0.8861 | 0.6807 | | DebertaV2ForQuestionAnswering | 2 | 0.6852 | 0.523 | 0.8253 | 0.6939 | | BlenderbotForCausalLM | 4 | 0.9807 | 0.8479 | 0.0 | 1.2351 | | AllenaiLongformerBase | 4 | 1.0039 | 0.6715 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_accuracy | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | MobileBertForMaskedLM | 64 | 17.3446 | 40.2025 | 150.6225 | 148.9238 | | MobileBertForQuestionAnswering | 128 | 17.3915 | 39.9163 | 140.7136 | 653.5987 | | DebertaV2ForMaskedLM | 1 | 15.5696 | 27.5237 | 138.8626 | 75.8935 | | DebertaV2ForQuestionAnswering | 2 | 15.4095 | 27.3197 | 137.5645 | 72.7915 | | M2M100ForConditionalGeneration | 16 | 12.2124 | 26.5896 | 135.8697 | 137.9521 | | MT5ForConditionalGeneration | 16 | 8.1726 | 18.3017 | 134.411 | 133.8593 | | XGLMForCausalLM | 8 | 9.4942 | 21.0463 | 133.5907 | 132.7691 | | XLNetLMHeadModel | 8 | 10.5172 | 27.7844 | 94.9186 | 94.954 | | DebertaForMaskedLM | 4 | 7.3861 | 14.0928 | 86.1956 | 55.2549 | | DebertaForQuestionAnswering | 8 | 7.2761 | 13.5437 | 82.1909 | 54.083 | | MBartForConditionalGeneration | 2 | 11.8022 | 26.2283 | 81.8046 | 79.1892 | | BartForConditionalGeneration | 2 | 11.619 | 26.1386 | 77.0969 | 76.5973 | | PegasusForConditionalGeneration | 32 | 5.3819 | 19.4859 | 70.171 | 69.1206 | | MegatronBertForQuestionAnswering | 8 | 10.5797 | 21.3237 | 69.5369 | 66.5752 | | YituTechConvBert | 16 | 7.1759 | 15.7872 | 69.047 | 69.9462 | | MegatronBertForCausalLM | 4 | 10.5922 | 21.7549 | 67.6891 | 66.815 | | BlenderbotSmallForConditionalGeneration | 64 | 7.7023 | 17.2411 | 56.1683 | 56.4529 | | T5Small | 4 | 5.5999 | 12.7517 | 52.2982 | 51.5535 | | T5ForConditionalGeneration | 4 | 5.6543 | 12.8023 | 51.9082 | 51.0954 | | ElectraForCausalLM | 32 | 5.3042 | 10.8613 | 50.7892 | 54.2764 | | PLBartForConditionalGeneration | 4 | 6.2761 | 13.4427 | 49.7039 | 49.0915 | | LayoutLMForSequenceClassification | 16 | 5.6218 | 11.1507 | 48.1803 | 47.1304 | | ElectraForQuestionAnswering | 64 | 5.259 | 11.5491 | 43.4886 | 46.9035 | | MBartForCausalLM | 4 | 5.7291 | 11.2827 | 42.565 | 40.9752 | | BertForMaskedLM | 16 | 5.3048 | 10.981 | 40.7287 | 40.9174 | | BertForQuestionAnswering | 16 | 5.2297 | 10.8328 | 40.6538 | 39.3134 | | LayoutLMForMaskedLM | 16 | 5.5829 | 11.2154 | 39.796 | 42.1987 | | RobertaForCausalLM | 16 | 5.2612 | 10.9255 | 39.5288 | 37.6301 | | OPTForCausalLM | 2 | 4.7853 | 10.2633 | 39.2228 | 38.2162 | | BartForCausalLM | 4 | 5.7122 | 11.0283 | 39.1731 | 40.0025 | | PegasusForCausalLM | 32 | 5.686 | 11.2324 | 38.8494 | 38.2764 | | TrOCRForCausalLM | 32 | 5.6649 | 10.9667 | 38.4776 | 37.8722 | | GPT2ForSequenceClassification | 4 | 4.8679 | 9.946 | 37.9841 | 36.1413 | | RobertaForQuestionAnswering | 16 | 5.2309 | 10.8488 | 37.8191 | 37.8725 | | AlbertForMaskedLM | 4 | 2.2973 | 8.1403 | 37.8175 | 39.2703 | | CamemBert | 16 | 5.257 | 10.8434 | 36.9008 | 39.1965 | | DistilBertForQuestionAnswering | 256 | 2.5187 | 5.3678 | 35.8255 | 37.2627 | | AlbertForQuestionAnswering | 4 | 2.359 | 8.1263 | 34.1647 | 35.0516 | | DistilBertForMaskedLM | 128 | 2.5152 | 5.5412 | 33.8824 | 35.4997 | | BlenderbotSmallForCausalLM | 64 | 3.8756 | 7.5187 | 31.0369 | 29.927 | | DistillGPT2 | 16 | 2.5873 | 5.1278 | 30.1948 | 28.9185 | | Speech2Text2ForCausalLM | 256 | 3.0315 | 5.7656 | 27.228 | 26.774 | | PLBartForCausalLM | 8 | 3.0107 | 5.9704 | 26.8497 | 26.839 | | BlenderbotForCausalLM | 4 | 11.0248 | 21.9303 | nan | 69.4999 | | AllenaiLongformerBase | 4 | 9.7652 | 31.4218 | nan | nan | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | ElectraForQuestionAnswering | 64 | 1.0014 | 0.9537 | 1.1387 | 1.195 | | XLNetLMHeadModel | 8 | 0.9843 | 0.9603 | 1.1342 | 1.1342 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.906 | 1.1139 | 1.2307 | | OPTForCausalLM | 2 | 0.9999 | 0.9165 | 1.094 | 1.1346 | | RobertaForQuestionAnswering | 16 | 1.0012 | 0.9279 | 1.0865 | 1.1724 | | BertForQuestionAnswering | 16 | 1.0017 | 0.9284 | 1.0818 | 1.1729 | | LayoutLMForSequenceClassification | 16 | 1.0014 | 0.9295 | 1.0583 | 1.1368 | | RobertaForCausalLM | 16 | 0.9999 | 0.9209 | 1.0541 | 1.0519 | | BertForMaskedLM | 16 | 0.9998 | 0.9207 | 1.0539 | 1.0518 | | CamemBert | 16 | 1.0 | 0.9184 | 1.0511 | 1.0491 | | YituTechConvBert | 16 | 1.0 | 0.9143 | 1.0402 | 1.0411 | | T5ForConditionalGeneration | 4 | 0.9999 | 0.9516 | 1.0382 | 1.1813 | | T5Small | 4 | 0.9999 | 0.9516 | 1.0382 | 1.1813 | | DistilBertForQuestionAnswering | 256 | 1.0114 | 0.9556 | 1.0299 | 1.1479 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9211 | 1.0078 | 1.0518 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.7449 | 0.9734 | 1.3147 | | ElectraForCausalLM | 32 | 1.0 | 0.8475 | 0.9731 | 0.9739 | | DistillGPT2 | 16 | 1.0 | 0.8591 | 0.9682 | 1.0642 | | PLBartForConditionalGeneration | 4 | 1.0001 | 0.9301 | 0.9649 | 1.052 | | AlbertForMaskedLM | 4 | 1.0 | 0.7338 | 0.9574 | 1.268 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.904 | 0.953 | 1.1152 | | MBartForCausalLM | 4 | 1.0 | 0.8937 | 0.9281 | 0.9912 | | PLBartForCausalLM | 8 | 1.0 | 0.8677 | 0.9138 | 0.9886 | | BartForCausalLM | 4 | 1.0 | 0.8936 | 0.9137 | 0.9749 | | MT5ForConditionalGeneration | 16 | 0.9999 | 0.8495 | 0.9089 | 1.0018 | | PegasusForCausalLM | 32 | 1.0 | 0.8822 | 0.893 | 0.9864 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8468 | 0.8849 | 0.9624 | | TrOCRForCausalLM | 32 | 1.0 | 0.873 | 0.8836 | 0.9583 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8895 | 0.8729 | 0.9803 | | PegasusForConditionalGeneration | 32 | 1.0 | 0.91 | 0.8689 | 1.0689 | | MBartForConditionalGeneration | 2 | 1.0 | 0.8946 | 0.8672 | 1.0307 | | BartForConditionalGeneration | 2 | 1.0 | 0.8987 | 0.8456 | 1.0139 | | MegatronBertForCausalLM | 4 | 1.0 | 0.8644 | 0.845 | 1.0962 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8137 | 0.8184 | 0.9119 | | Speech2Text2ForCausalLM | 256 | 1.0 | 0.8183 | 0.789 | 0.8779 | | M2M100ForConditionalGeneration | 16 | 1.0 | 0.8084 | 0.7651 | 0.9908 | | MobileBertForMaskedLM | 64 | 1.0 | 0.8769 | 0.7473 | 1.016 | | XGLMForCausalLM | 8 | 1.0 | 0.7834 | 0.7117 | 0.9792 | | MobileBertForQuestionAnswering | 128 | 1.0161 | 1.0064 | 0.6569 | 0.8392 | | DebertaForMaskedLM | 4 | 0.9316 | 0.9156 | 0.5501 | 0.9978 | | DebertaV2ForMaskedLM | 1 | 0.977 | 0.9068 | 0.5197 | 0.9665 | | DebertaV2ForQuestionAnswering | 2 | 0.9762 | 0.9763 | 0.487 | 0.9802 | | DebertaForQuestionAnswering | 8 | 0.9525 | 1.0537 | 0.4601 | 1.1526 | | BlenderbotForCausalLM | 4 | 0.9978 | 0.9099 | nan | 0.999 | | AllenaiLongformerBase | 4 | 0.9508 | 0.8684 | nan | nan | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.0741 | 300.5957 | 162.3949 | 162.7744 | | AlbertForQuestionAnswering | 4 | 263.9479 | 297.7421 | 160.4003 | 160.7389 | | XLNetLMHeadModel | 8 | 281.0544 | 288.5521 | 155.1764 | 152.0193 | | DebertaV2ForQuestionAnswering | 2 | 156.3569 | 204.4826 | 130.8562 | 176.3497 | | DebertaV2ForMaskedLM | 1 | 152.8454 | 198.3433 | 122.8313 | 170.5682 | | PegasusForConditionalGeneration | 32 | 140.5407 | 147.003 | 113.8014 | 111.6971 | | TrOCRForCausalLM | 32 | 138.9949 | 143.9675 | 110.2195 | 106.7772 | | MBartForConditionalGeneration | 2 | 139.5138 | 144.5445 | 95.3065 | 101.5438 | | BartForConditionalGeneration | 2 | 138.9865 | 141.9254 | 94.5213 | 99.6872 | | MegatronBertForQuestionAnswering | 8 | 144.6633 | 147.2943 | 88.5423 | 87.1288 | | YituTechConvBert | 16 | 127.002 | 130.6102 | 83.1027 | 83.9486 | | BlenderbotSmallForConditionalGeneration | 64 | 114.5927 | 120.5037 | 81.1285 | 79.534 | | MobileBertForQuestionAnswering | 128 | 177.187 | 208.6011 | 80.8667 | 166.8054 | | CamemBert | 16 | 119.8016 | 122.7432 | 76.5838 | 77.1308 | | MBartForCausalLM | 4 | 115.3325 | 118.9225 | 75.7891 | 73.5422 | | M2M100ForConditionalGeneration | 16 | 128.6238 | 133.153 | 75.3887 | 98.9848 | | BartForCausalLM | 4 | 115.0677 | 118.3685 | 74.6634 | 73.1909 | | MobileBertForMaskedLM | 64 | 180.1248 | 213.0675 | 73.625 | 167.0572 | | PLBartForConditionalGeneration | 4 | 119.2534 | 123.0497 | 73.2631 | 72.1816 | | DebertaForQuestionAnswering | 8 | 95.34 | 108.4551 | 72.6669 | 78.8288 | | DistilBertForQuestionAnswering | 256 | 103.9379 | 104.6612 | 71.6685 | 71.6706 | | LayoutLMForMaskedLM | 16 | 114.0344 | 116.8081 | 71.225 | 70.7144 | | PLBartForCausalLM | 8 | 117.5152 | 117.9444 | 70.2469 | 68.9389 | | DistilBertForMaskedLM | 128 | 85.2454 | 88.9772 | 70.024 | 68.646 | | OPTForCausalLM | 2 | 170.5106 | 182.1011 | 69.414 | 68.1763 | | BertForMaskedLM | 16 | 111.5972 | 114.3774 | 68.9088 | 69.4608 | | RobertaForCausalLM | 16 | 116.5124 | 119.445 | 68.5252 | 69.0099 | | DebertaForMaskedLM | 4 | 88.363 | 121.4233 | 66.3399 | 82.1714 | | T5Small | 4 | 106.3489 | 122.9907 | 64.4048 | 60.8002 | | T5ForConditionalGeneration | 4 | 106.5098 | 122.9733 | 64.2992 | 60.4403 | | DistillGPT2 | 16 | 107.1357 | 110.7336 | 63.7492 | 62.1923 | | MegatronBertForCausalLM | 4 | 88.8653 | 95.8709 | 59.3401 | 58.3729 | | PegasusForCausalLM | 32 | 71.0425 | 74.4433 | 59.0177 | 58.396 | | XGLMForCausalLM | 8 | 90.0649 | 110.7601 | 54.2954 | 80.4944 | | LayoutLMForSequenceClassification | 16 | 99.102 | 100.4559 | 54.2432 | 54.6501 | | ElectraForQuestionAnswering | 64 | 116.1258 | 117.7138 | 54.0483 | 55.3243 | | RobertaForQuestionAnswering | 16 | 96.963 | 98.5077 | 53.6013 | 54.3902 | | BertForQuestionAnswering | 16 | 96.6566 | 98.0127 | 53.5906 | 54.0604 | | ElectraForCausalLM | 32 | 89.58 | 93.7388 | 47.6422 | 48.9399 | | BlenderbotSmallForCausalLM | 64 | 59.1455 | 64.8676 | 47.4692 | 46.018 | | MT5ForConditionalGeneration | 16 | 94.1362 | 109.5801 | 44.2144 | 50.4218 | | GPT2ForSequenceClassification | 4 | 93.8274 | 96.1364 | 40.721 | 40.0718 | | Speech2Text2ForCausalLM | 256 | 55.0411 | 57.9019 | 35.1635 | 34.3504 | | BlenderbotForCausalLM | 4 | 112.6806 | 122.6812 | nan | 89.3649 | | AllenaiLongformerBase | 4 | 180.6334 | 271.5463 | nan | nan | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 0.9984 | 0.9975 | 3.0126 | 2.9689 | | xcit_large_24_p8_224 | 5 | 0.9916 | 0.8567 | 2.0536 | 1.5766 | | twins_pcpvt_base | 64 | 0.9964 | 0.9026 | 1.9888 | 1.6746 | | coat_lite_mini | 128 | 0.9967 | 0.9949 | 1.9441 | 1.9189 | | ghostnet_100 | 128 | 0.9921 | 0.7618 | 1.8488 | 1.6141 | | gmlp_s16_224 | 128 | 0.9946 | 1.0823 | 1.8427 | 1.8272 | | gmixer_24_224 | 128 | 0.9952 | 0.8887 | 1.7584 | 1.749 | | volo_d1_224 | 64 | 0.9943 | 0.9732 | 1.6883 | 1.6657 | | lcnet_050 | 128 | 0.9406 | 0.7365 | 1.6844 | 1.4333 | | crossvit_9_240 | 128 | 0.9907 | 0.7829 | 1.6438 | 1.6158 | | swin_base_patch4_window7_224 | 64 | 0.9906 | 0.9424 | 1.6135 | 1.607 | | convit_base | 64 | 0.998 | 0.9976 | 1.6129 | 1.6106 | | gluon_inception_v3 | 128 | 0.9965 | 0.8652 | 1.5319 | 1.5218 | | inception_v3 | 128 | 0.9962 | 0.8642 | 1.5309 | 1.519 | | adv_inception_v3 | 128 | 0.9964 | 0.8603 | 1.5307 | 1.5178 | | dla102 | 128 | 0.9956 | 0.8148 | 1.5256 | 1.5213 | | convnext_base | 64 | 0.9837 | 0.9843 | 1.4875 | 1.4696 | | nfnet_l0 | 128 | 0.9892 | 0.8141 | 1.4861 | 1.4363 | | sebotnet33ts_256 | 64 | 0.9567 | 0.7649 | 1.4808 | 1.5326 | | dm_nfnet_f0 | 128 | 0.9873 | 0.9852 | 1.4754 | 1.4281 | | eca_botnext26ts_256 | 128 | 0.9735 | 0.7194 | 1.4387 | 1.4237 | | mobilenetv3_large_100 | 128 | 0.949 | 0.7604 | 1.4347 | 1.3885 | | pit_b_224 | 64 | 0.9947 | 0.9925 | 1.4347 | 1.4287 | | resnest101e | 64 | 0.9942 | 0.8678 | 1.4338 | 1.3532 | | mnasnet_100 | 128 | 0.948 | 0.7407 | 1.429 | 1.4981 | | mobilevit_s | 64 | 0.9614 | 0.7305 | 1.4264 | 1.4403 | | regnety_002 | 128 | 0.9505 | 0.7097 | 1.4125 | 1.2311 | | selecsls42b | 128 | 0.9985 | 0.8117 | 1.4105 | 1.4118 | | botnet26t_256 | 128 | 0.973 | 0.8519 | 1.4081 | 1.4225 | | res2net50_14w_8s | 128 | 0.9988 | 0.7899 | 1.3787 | 1.3566 | | res2next50 | 128 | 0.9991 | 0.8256 | 1.3711 | 1.3638 | | jx_nest_base | 32 | 0.9869 | 0.9851 | 1.3661 | 1.3574 | | mixer_b16_224 | 128 | 0.997 | 1.0181 | 1.3622 | 1.3601 | | hrnet_w18 | 128 | 0.9925 | 0.6446 | 1.3579 | 1.3448 | | mobilenetv2_100 | 128 | 0.9486 | 0.7368 | 1.3578 | 1.4448 | | spnasnet_100 | 128 | 0.9413 | 0.7389 | 1.3569 | 1.4187 | | ese_vovnet19b_dw | 128 | 0.9582 | 0.8331 | 1.3544 | 1.3722 | | beit_base_patch16_224 | 64 | 0.9964 | 0.9584 | 1.3519 | 1.352 | | fbnetc_100 | 128 | 0.9497 | 0.7394 | 1.3515 | 1.404 | | cait_m36_384 | 4 | 0.9948 | 0.9439 | 1.3501 | 1.3483 | | tf_efficientnet_b0 | 128 | 0.9603 | 0.6814 | 1.3498 | 1.384 | | poolformer_m36 | 64 | 0.9863 | 0.9834 | 1.3271 | 1.3175 | | fbnetv3_b | 128 | 0.949 | 0.7693 | 1.3139 | 1.2565 | | rexnet_100 | 128 | 0.9515 | 0.7031 | 1.2965 | 1.3361 | | resmlp_12_224 | 128 | 0.9931 | 0.8893 | 1.2598 | 1.2563 | | deit_base_distilled_patch16_224 | 64 | 0.9962 | 0.994 | 1.2546 | 1.2545 | | vit_base_patch16_224 | 64 | 0.9961 | 0.9936 | 1.2353 | 1.2352 | | tinynet_a | 128 | 0.9471 | 0.6782 | 1.2245 | 1.2324 | | cspdarknet53 | 64 | 0.9329 | 0.7858 | 1.2226 | 1.2588 | | tf_mixnet_l | 128 | 0.9758 | 0.8265 | 1.1838 | 1.191 | | mixnet_l | 128 | 0.9763 | 0.8206 | 1.1745 | 1.1816 | | visformer_small | 128 | 0.9962 | 0.9449 | 1.1736 | 1.1656 | | res2net101_26w_4s | 64 | 0.998 | 0.7839 | 1.1582 | 1.0901 | | pnasnet5large | 16 | 0.9853 | 0.9189 | 1.0927 | 1.1131 | | dpn107 | 32 | 0.932 | 0.8074 | 1.0901 | 1.1336 | | repvgg_a2 | 128 | 0.9348 | 0.7549 | 1.087 | 1.118 | | gluon_xception65 | 32 | 0.9921 | 0.8422 | 1.0751 | 1.0787 | | swsl_resnext101_32x16d | 32 | 0.9976 | 0.8426 | 1.0564 | 1.0211 | | gernet_l | 128 | 0.9354 | 0.7935 | 1.0215 | 1.0663 | | convmixer_768_32 | 32 | 0.9986 | 0.9645 | 1.0016 | 1.0027 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------+------------------------+ | adv_inception_v3 | 8 | pass | pass | pass | pass | | beit_base_patch16_224 | 8 | pass | pass | pass | pass | | mobilevit_s | 8 | pass | pass | pass | pass | | nfnet_l0 | 8 | pass | pass | pass | pass | | pit_b_224 | 8 | pass | pass | pass | pass | | pnasnet5large | 8 | pass | pass | pass | pass | | poolformer_m36 | 8 | pass | pass | pass | pass | | regnety_002 | 8 | pass | pass | pass | pass | | repvgg_a2 | 8 | pass | pass | pass | pass | | res2net101_26w_4s | 8 | pass | pass | pass | pass | | res2net50_14w_8s | 8 | pass | pass | pass | pass | | res2next50 | 8 | pass | pass | pass | pass | | resmlp_12_224 | 8 | pass | pass | pass | pass | | resnest101e | 8 | pass | pass | pass | pass | | rexnet_100 | 8 | pass | pass | pass | pass | | sebotnet33ts_256 | 8 | pass | pass | pass | pass | | selecsls42b | 8 | pass | pass | pass | pass | | spnasnet_100 | 8 | pass | pass | pass | pass | | swin_base_patch4_window7_224 | 8 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 8 | pass | pass | pass | pass | | tf_efficientnet_b0 | 8 | pass | pass | pass | pass | | tf_mixnet_l | 8 | pass | pass | pass | pass | | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | twins_pcpvt_base | 8 | pass | pass | pass | pass | | visformer_small | 8 | pass | pass | pass | pass | | vit_base_patch16_224 | 8 | pass | pass | pass | pass | | volo_d1_224 | 8 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 8 | pass | pass | pass | pass | | lcnet_050 | 8 | pass | fail_accuracy | pass | pass | | mobilenetv3_large_100 | 8 | pass | pass | pass | pass | | mobilenetv2_100 | 8 | pass | pass | pass | pass | | mnasnet_100 | 8 | pass | pass | pass | pass | | eca_botnext26ts_256 | 8 | pass | pass | pass | pass | | botnet26t_256 | 8 | pass | pass | pass | pass | | cait_m36_384 | 4 | pass | pass | pass | pass | | coat_lite_mini | 8 | pass | pass | pass | pass | | convit_base | 8 | pass | pass | pass | pass | | convmixer_768_32 | 8 | pass | pass | pass | pass | | convnext_base | 8 | pass | pass | pass | pass | | crossvit_9_240 | 8 | pass | pass | pass | pass | | cspdarknet53 | 8 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 8 | pass | pass | pass | pass | | dla102 | 8 | pass | pass | pass | pass | | dm_nfnet_f0 | 8 | pass | pass | pass | pass | | dpn107 | 8 | pass | pass | pass | pass | | ese_vovnet19b_dw | 8 | pass | pass | pass | pass | | mixnet_l | 8 | pass | pass | pass | pass | | fbnetc_100 | 8 | pass | pass | pass | pass | | fbnetv3_b | 8 | pass | pass | pass | pass | | gernet_l | 8 | pass | pass | pass | pass | | ghostnet_100 | 8 | pass | pass | pass | pass | | gluon_inception_v3 | 8 | pass | pass | pass | pass | | gluon_xception65 | 8 | pass | pass | pass | pass | | gmixer_24_224 | 8 | pass | pass | pass | pass | | gmlp_s16_224 | 8 | pass | pass | pass | pass | | hrnet_w18 | 8 | pass | pass | pass | pass | | inception_v3 | 8 | pass | pass | pass | pass | | jx_nest_base | 8 | pass | pass | pass | pass | | mixer_b16_224 | 8 | pass | pass | pass | pass | | tinynet_a | 8 | pass | fail_accuracy | pass | pass | +---------------------------------+----+-------+---------------+----------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+---------+-----------+----------+------------------------+ | rexnet_100 | 128 | 5.6993 | 11.1994 | 275.1515 | 276.6628 | | hrnet_w18 | 128 | 9.5989 | 36.4291 | 255.7748 | 249.9176 | | ghostnet_100 | 128 | 7.9291 | 14.9681 | 244.5686 | 243.1281 | | fbnetv3_b | 128 | 8.3509 | 16.9235 | 178.1302 | 174.7349 | | pnasnet5large | 16 | 8.2713 | 26.1052 | 167.4111 | 161.8981 | | resnest101e | 64 | 11.0908 | 24.5085 | 166.679 | 168.3601 | | mobilevit_s | 64 | 5.3554 | 11.389 | 164.6195 | 161.0406 | | gluon_inception_v3 | 128 | 5.6874 | 12.6224 | 162.3067 | 161.1515 | | adv_inception_v3 | 128 | 5.7431 | 12.661 | 162.0021 | 163.0911 | | tinynet_a | 128 | 5.9901 | 12.2575 | 160.6803 | 156.4597 | | mobilenetv3_large_100 | 128 | 4.2468 | 8.4014 | 160.5822 | 153.4424 | | mixnet_l | 128 | 8.7634 | 16.282 | 159.9061 | 158.8318 | | inception_v3 | 128 | 6.0812 | 12.5695 | 156.6721 | 159.4795 | | tf_mixnet_l | 128 | 9.4942 | 16.827 | 156.3445 | 156.0664 | | res2net101_26w_4s | 64 | 10.8721 | 25.1048 | 153.8691 | 153.8122 | | twins_pcpvt_base | 64 | 10.5615 | 23.4808 | 149.9154 | 147.8376 | | tf_efficientnet_b0 | 128 | 5.1005 | 10.495 | 149.8097 | 154.6713 | | fbnetc_100 | 128 | 4.9279 | 9.256 | 136.5809 | 133.1516 | | spnasnet_100 | 128 | 5.0268 | 9.2387 | 136.5 | 137.2727 | | xcit_large_24_p8_224 | 5 | 13.4196 | 28.5272 | 135.2229 | 132.5646 | | mobilenetv2_100 | 128 | 4.0544 | 7.9001 | 130.7307 | 133.0086 | | mnasnet_100 | 128 | 4.1163 | 8.1373 | 126.2514 | 126.592 | | res2net50_14w_8s | 128 | 9.4369 | 22.5034 | 123.4541 | 126.3791 | | cait_m36_384 | 4 | 14.6993 | 32.7877 | 117.823 | 115.3478 | | swin_base_patch4_window7_224 | 64 | 8.833 | 19.181 | 112.5906 | 109.616 | | regnety_002 | 128 | 4.8494 | 8.8289 | 109.1437 | 108.2611 | | sebotnet33ts_256 | 64 | 4.2077 | 8.8238 | 107.283 | 106.4203 | | cspdarknet53 | 64 | 5.7987 | 10.8478 | 102.8756 | 102.9829 | | poolformer_m36 | 64 | 7.6214 | 13.8123 | 102.5988 | 101.2815 | | dpn107 | 32 | 9.768 | 19.4637 | 102.452 | 99.7743 | | eca_botnext26ts_256 | 128 | 3.0752 | 6.8258 | 101.9253 | 99.7253 | | dla102 | 128 | 6.2123 | 14.0217 | 99.5575 | 98.5763 | | lcnet_050 | 128 | 2.5388 | 4.9787 | 96.5038 | 100.1957 | | gluon_xception65 | 32 | 7.8211 | 16.9041 | 96.4024 | 96.3856 | | botnet26t_256 | 128 | 3.0199 | 5.9755 | 93.8917 | 91.0777 | | selecsls42b | 128 | 2.5041 | 5.3691 | 93.0398 | 90.9342 | | res2next50 | 128 | 5.0693 | 12.1397 | 90.5335 | 87.4438 | | coat_lite_mini | 128 | 3.3219 | 8.3018 | 90.2117 | 91.3832 | | crossvit_9_240 | 128 | 5.8162 | 13.2613 | 87.8356 | 88.4924 | | jx_nest_base | 32 | 6.6893 | 14.8517 | 85.221 | 83.1627 | | gernet_l | 128 | 5.0745 | 8.8521 | 82.5061 | 82.9219 | | nfnet_l0 | 128 | 5.3011 | 11.0453 | 81.5994 | 79.7887 | | ese_vovnet19b_dw | 128 | 2.5455 | 4.5574 | 77.4739 | 79.0887 | | volo_d1_224 | 64 | 5.3472 | 11.8284 | 75.533 | 75.7515 | | dm_nfnet_f0 | 128 | 5.9807 | 11.4623 | 74.5388 | 74.8234 | | tnt_s_patch16_224 | 128 | 6.9195 | 17.0106 | 69.4291 | 71.1891 | | visformer_small | 128 | 2.6158 | 6.0686 | 68.4679 | 67.1496 | | swsl_resnext101_32x16d | 32 | 6.1312 | 13.6653 | 65.6469 | 62.8965 | | repvgg_a2 | 128 | 4.861 | 9.2605 | 62.4885 | 61.1467 | | gmlp_s16_224 | 128 | 5.6677 | 11.9838 | 61.9518 | 62.6159 | | convnext_base | 64 | 6.6326 | 12.5169 | 61.102 | 59.6916 | | gmixer_24_224 | 128 | 5.8246 | 12.8653 | 53.639 | 52.8536 | | convit_base | 64 | 3.689 | 8.5796 | 49.5443 | 49.2829 | | pit_b_224 | 64 | 3.3955 | 7.9829 | 47.546 | 47.3044 | | deit_base_distilled_patch16_224 | 64 | 3.2802 | 7.1145 | 45.2268 | 43.3123 | | resmlp_12_224 | 128 | 2.8034 | 5.2855 | 42.2218 | 41.8847 | | vit_base_patch16_224 | 64 | 3.091 | 7.0077 | 41.5768 | 40.4922 | | convmixer_768_32 | 32 | 1.678 | 6.8306 | 39.8388 | 37.2196 | | beit_base_patch16_224 | 64 | 3.9267 | 9.2819 | 36.842 | 34.9824 | | mixer_b16_224 | 128 | 2.7715 | 5.8739 | 34.7621 | 33.8914 | +---------------------------------+-----+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | gmlp_s16_224 | 128 | 0.9951 | 0.9727 | 1.1858 | 1.2049 | | pnasnet5large | 16 | 1.059 | 0.9907 | 1.1712 | 1.2836 | | gmixer_24_224 | 128 | 0.9928 | 0.9706 | 1.1129 | 1.1596 | | convit_base | 64 | 0.9967 | 0.8482 | 1.0948 | 1.157 | | mobilenetv2_100 | 128 | 0.9865 | 0.7647 | 1.0266 | 1.1179 | | dm_nfnet_f0 | 128 | 0.9742 | 0.8946 | 1.013 | 1.0845 | | resmlp_12_224 | 128 | 0.9826 | 0.9506 | 1.0099 | 1.0351 | | tinynet_a | 128 | 0.9892 | 0.7906 | 0.9984 | 1.0721 | | resnest101e | 64 | 0.9947 | 0.9986 | 0.9972 | 1.0876 | | tf_efficientnet_b0 | 128 | 0.9863 | 0.7735 | 0.9872 | 1.0728 | | tnt_s_patch16_224 | 128 | 0.9947 | 0.9729 | 0.9834 | 1.0506 | | convmixer_768_32 | 32 | 0.9981 | 0.9795 | 0.9762 | 0.9854 | | rexnet_100 | 128 | 0.9898 | 0.7866 | 0.9747 | 1.0457 | | twins_pcpvt_base | 64 | 0.9961 | 0.9232 | 0.9729 | 1.0539 | | mobilevit_s | 64 | 0.9929 | 0.7794 | 0.9557 | 1.0057 | | dla102 | 128 | 0.9634 | 0.9151 | 0.9536 | 1.0326 | | mixer_b16_224 | 128 | 0.9919 | 0.9569 | 0.951 | 0.9948 | | vit_base_patch16_224 | 64 | 0.9949 | 0.9316 | 0.9362 | 0.955 | | deit_base_distilled_patch16_224 | 64 | 0.9942 | 0.9313 | 0.9353 | 0.9528 | | visformer_small | 128 | 0.9896 | 0.9236 | 0.9348 | 1.0194 | | tf_mixnet_l | 128 | 0.9905 | 0.858 | 0.9346 | 1.0675 | | beit_base_patch16_224 | 64 | 0.9949 | 0.9303 | 0.9285 | 0.989 | | fbnetv3_b | 128 | 0.9857 | 0.7935 | 0.9228 | 0.9793 | | nfnet_l0 | 128 | 0.9892 | 0.8404 | 0.9215 | 0.9952 | | volo_d1_224 | 64 | 0.9959 | 0.9469 | 0.9131 | 0.9727 | | cspdarknet53 | 64 | 0.9909 | 0.8538 | 0.9097 | 1.0328 | | ese_vovnet19b_dw | 128 | 0.9861 | 0.8968 | 0.9047 | 0.9903 | | hrnet_w18 | 128 | 0.9909 | 0.9196 | 0.8918 | 0.99 | | sebotnet33ts_256 | 64 | 0.9925 | 0.7116 | 0.891 | 1.1115 | | inception_v3 | 128 | 0.9825 | 0.8621 | 0.8904 | 1.0171 | | gluon_inception_v3 | 128 | 0.9825 | 0.8621 | 0.8904 | 1.0171 | | adv_inception_v3 | 128 | 0.9825 | 0.8621 | 0.8904 | 1.0171 | | dpn107 | 32 | 0.9932 | 0.904 | 0.8833 | 0.9642 | | gluon_xception65 | 32 | 0.9954 | 0.8841 | 0.8831 | 0.9705 | | ghostnet_100 | 128 | 0.9748 | 0.8689 | 0.8807 | 0.977 | | spnasnet_100 | 128 | 0.9796 | 0.8826 | 0.8786 | 0.9451 | | mobilenetv3_large_100 | 128 | 0.9777 | 0.8424 | 0.877 | 0.9361 | | poolformer_m36 | 64 | 0.9981 | 0.9485 | 0.8768 | 1.1871 | | eca_botnext26ts_256 | 128 | 0.9881 | 0.7722 | 0.8738 | 1.0072 | | xcit_large_24_p8_224 | 5 | 0.9983 | 0.8871 | 0.8721 | 0.9732 | | res2net50_14w_8s | 128 | 0.9912 | 0.9074 | 0.8712 | 0.9607 | | res2net101_26w_4s | 64 | 0.9937 | 0.9132 | 0.871 | 0.9483 | | mixnet_l | 128 | 0.99 | 0.8469 | 0.8687 | 0.9902 | | mnasnet_100 | 128 | 0.9777 | 0.8719 | 0.8683 | 0.9403 | | res2next50 | 128 | 0.9913 | 0.9106 | 0.866 | 0.9547 | | cait_m36_384 | 4 | 0.9998 | 0.913 | 0.8632 | 0.989 | | fbnetc_100 | 128 | 0.9819 | 0.8512 | 0.8596 | 0.9535 | | pit_b_224 | 64 | 0.9969 | 0.8011 | 0.8578 | 1.0242 | | selecsls42b | 128 | 0.9806 | 0.8786 | 0.8576 | 0.9664 | | convnext_base | 64 | 1.001 | 0.924 | 0.8505 | 1.0338 | | gernet_l | 128 | 0.9781 | 0.8499 | 0.8499 | 0.9706 | | swsl_resnext101_32x16d | 32 | 0.998 | 0.8688 | 0.8461 | 0.9786 | | coat_lite_mini | 128 | 1.0337 | 0.9207 | 0.8402 | 1.0202 | | botnet26t_256 | 128 | 0.9842 | 0.8676 | 0.8239 | 0.9779 | | lcnet_050 | 128 | 0.9447 | 0.7712 | 0.805 | 0.884 | | repvgg_a2 | 128 | 0.9761 | 0.7778 | 0.7738 | 0.9611 | | regnety_002 | 128 | 0.9523 | 0.8281 | 0.7602 | 0.8966 | | crossvit_9_240 | 128 | 0.9851 | 0.8711 | 0.7526 | 0.9898 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9204 | 0.7214 | 0.9045 | | jx_nest_base | 32 | 0.9985 | 0.8927 | 0.6693 | 0.9604 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------+------------------------+ | convmixer_768_32 | 32 | 300.4785 | 311.1037 | 300.2453 | 299.2399 | | hrnet_w18 | 128 | 280.5567 | 433.2557 | 205.2586 | 207.4694 | | pnasnet5large | 16 | 198.6957 | 212.7856 | 179.4515 | 176.8271 | | tf_mixnet_l | 128 | 193.8395 | 228.9561 | 159.9145 | 159.006 | | mixnet_l | 128 | 185.4911 | 220.5022 | 153.9461 | 153.0711 | | cait_m36_384 | 4 | 173.9893 | 181.402 | 123.7248 | 123.6517 | | resnest101e | 64 | 165.259 | 188.3443 | 113.9577 | 121.5095 | | dla102 | 128 | 172.3799 | 210.6412 | 112.5554 | 113.0406 | | swsl_resnext101_32x16d | 32 | 118.6215 | 140.3636 | 111.8232 | 115.6837 | | poolformer_m36 | 64 | 146.8757 | 147.1298 | 108.979 | 109.8046 | | tnt_s_patch16_224 | 128 | 323.4275 | 323.8514 | 107.2166 | 108.8224 | | adv_inception_v3 | 128 | 160.5676 | 185.9636 | 104.6174 | 105.5461 | | inception_v3 | 128 | 160.6964 | 185.2429 | 104.585 | 105.4366 | | gluon_inception_v3 | 128 | 160.8065 | 185.1928 | 104.5689 | 105.2862 | | res2net50_14w_8s | 128 | 140.935 | 177.9539 | 102.2278 | 103.6838 | | convit_base | 64 | 163.2652 | 163.0411 | 100.9347 | 101.0903 | | dpn107 | 32 | 113.7021 | 131.014 | 97.1951 | 93.4699 | | gluon_xception65 | 32 | 99.8143 | 117.2946 | 92.136 | 91.6258 | | res2next50 | 128 | 125.9044 | 152.2697 | 91.7131 | 92.2184 | | swin_base_patch4_window7_224 | 64 | 147.6003 | 154.6088 | 90.4329 | 90.7476 | | dm_nfnet_f0 | 128 | 128.6559 | 128.8463 | 85.7176 | 88.8292 | | mixer_b16_224 | 128 | 116.6407 | 114.2483 | 85.6271 | 85.5345 | | res2net101_26w_4s | 64 | 100.7719 | 126.5222 | 85.1775 | 91.9015 | | fbnetv3_b | 128 | 115.2918 | 142.0549 | 83.2112 | 87.1757 | | pit_b_224 | 64 | 118.7083 | 119.0224 | 82.2987 | 82.554 | | convnext_base | 64 | 124.4767 | 123.9587 | 82.125 | 83.2952 | | visformer_small | 128 | 91.2132 | 96.1167 | 77.5302 | 77.9409 | | nfnet_l0 | 128 | 112.9672 | 136.6466 | 75.1601 | 77.8623 | | beit_base_patch16_224 | 64 | 101.5011 | 105.6243 | 74.93 | 74.7418 | | gmlp_s16_224 | 128 | 137.4137 | 126.3103 | 74.3719 | 74.8948 | | eca_botnext26ts_256 | 128 | 108.6756 | 147.1409 | 73.6353 | 74.3115 | | jx_nest_base | 32 | 101.4118 | 101.6103 | 73.2177 | 73.8287 | | cspdarknet53 | 64 | 94.887 | 112.5543 | 72.5498 | 70.3858 | | gernet_l | 128 | 77.7055 | 91.5737 | 71.2669 | 68.2328 | | volo_d1_224 | 64 | 121.1454 | 123.3736 | 71.2254 | 72.128 | | botnet26t_256 | 128 | 101.7933 | 116.3549 | 70.4664 | 69.7069 | | vit_base_patch16_224 | 64 | 86.9746 | 87.1374 | 70.1112 | 69.9958 | | deit_base_distilled_patch16_224 | 64 | 84.9154 | 85.0308 | 67.423 | 67.449 | | repvgg_a2 | 128 | 77.6178 | 96.1671 | 66.8631 | 64.9452 | | gmixer_24_224 | 128 | 118.032 | 131.9766 | 66.8484 | 67.1371 | | xcit_large_24_p8_224 | 5 | 144.0041 | 166.1195 | 62.3256 | 77.9301 | | tf_efficientnet_b0 | 128 | 84.6526 | 119.5026 | 60.2613 | 58.8375 | | twins_pcpvt_base | 64 | 117.7882 | 141.6634 | 60.2158 | 69.1968 | | rexnet_100 | 128 | 80.0842 | 108.2213 | 58.7168 | 56.9406 | | fbnetc_100 | 128 | 82.7335 | 106.2835 | 58.1617 | 55.9695 | | coat_lite_mini | 128 | 112.9578 | 113.2281 | 58.0038 | 58.6439 | | mobilevit_s | 64 | 84.5601 | 111.3721 | 56.9916 | 56.5112 | | tinynet_a | 128 | 73.5434 | 102.6301 | 56.8566 | 56.5248 | | sebotnet33ts_256 | 64 | 80.5053 | 100.537 | 51.9924 | 50.285 | | crossvit_9_240 | 128 | 82.4008 | 104.1625 | 49.7802 | 50.4592 | | spnasnet_100 | 128 | 70.4153 | 89.5947 | 48.8766 | 46.7056 | | ghostnet_100 | 128 | 90.5123 | 117.7778 | 48.5371 | 55.7042 | | mobilenetv2_100 | 128 | 65.4555 | 84.3222 | 45.8039 | 43.0103 | | ese_vovnet19b_dw | 128 | 64.5312 | 74.3313 | 45.7309 | 45.0924 | | mnasnet_100 | 128 | 64.216 | 82.3891 | 42.6134 | 40.6277 | | selecsls42b | 128 | 60.0775 | 73.8697 | 42.506 | 42.4269 | | resmlp_12_224 | 128 | 53.4379 | 59.7054 | 42.1271 | 42.2446 | | mobilenetv3_large_100 | 128 | 61.3007 | 76.4433 | 40.5678 | 41.9268 | | regnety_002 | 128 | 41.2991 | 55.4889 | 26.5131 | 30.3718 | | lcnet_050 | 128 | 31.6951 | 40.4419 | 17.6729 | 20.7928 | +---------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_100_10_04_23_performance_amp_229/huggingface_amp.png : ![](https://i.imgur.com/mUGyDSe.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_100_10_04_23_performance_amp_229/timm_models_amp.png : ![](https://i.imgur.com/Me5BqVt.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_100_10_04_23_performance_amp_229/torchbench_amp.png : ![](https://i.imgur.com/f0dDr9F.png)

Build Summary

### Run name ### day_100_10_04_23_performance_amp_229 ### Commit hashes ### pytorch commit: f55e72c0f6bd6da016aaa51de379e6ba6d7891cc pytorch commit date: 2023-04-07 17:30:27+00:00 torchbench commit: 735f1927996c8d9ab81f0b0c05dd1ebdb26a6250 torchbench commit date: 2023-04-05 09:43:21-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+gitf55e72c ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for float32 precision (2.0 release binary oneoff)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 82%, 50/61 | 100%, 46/46 | 100%, 60/60 |
|       aot_eager        | 77%, 47/61 | 100%, 46/46 | 100%, 60/60 |
|        inductor        | 74%, 45/61 | 93%, 43/46  | 100%, 60/60 |
| inductor_no_cudagraphs | 75%, 46/61 | 98%, 45/46  | 100%, 60/60 |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|        inductor        |   1.32x    |    1.22x    |    1.23x    |
| inductor_no_cudagraphs |   1.18x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    3.64    |    4.90     |    4.08     |
|       aot_eager        |    7.69    |    11.32    |    9.93     |
|        inductor        |   59.38    |    51.27    |   100.75    |
| inductor_no_cudagraphs |   58.66    |    47.84    |    99.83    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.99x    |    1.00x    |    1.00x    |
|       aot_eager        |   0.88x    |    0.92x    |    0.89x    |
|        inductor        |   0.81x    |    0.84x    |    0.92x    |
| inductor_no_cudagraphs |   0.97x    |    0.98x    |    1.02x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+-------------------------------+------------------------+-----------------+ | suite | name | inductor_no_cudagraphs | inductor | +-------------+-------------------------------+------------------------+-----------------+ | torchbench | moco | fail_to_run | fail_to_run | | torchbench | resnet50_quantized_qat | fail_to_run | fail_to_run | | torchbench | mobilenet_v2_quantized_qat | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | Background_Matting | eager_variation | eager_variation | | torchbench | Super_SloMo | eager_variation | eager_variation | | torchbench | alexnet | eager_variation | eager_variation | | torchbench | pytorch_CycleGAN_and_pix2pix | eager_variation | eager_variation | | torchbench | pytorch_unet | eager_variation | eager_variation | | torchbench | vgg16 | eager_variation | eager_variation | | torchbench | vision_maskrcnn | eager_variation | eager_variation | | torchbench | tacotron2 | 0.0000 | 0.0000 | | torchbench | gat | 0.0000 | 0.0000 | | torchbench | gcn | 0.0000 | 0.0000 | | torchbench | llama | 0.0000 | 0.0000 | | torchbench | sage | 0.0000 | 0.0000 | | torchbench | torchrec_dlrm | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | pass | fail_to_run | +-------------+-------------------------------+------------------------+-----------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+------------------------+----------+ | suite | name | inductor_no_cudagraphs | inductor | +-------------+-------------------------------+------------------------+----------+ | torchbench | lennard_jones | 0.883 | 1.3312 | | torchbench | dcgan | 0.8639 | 1.2259 | | torchbench | soft_actor_critic | 0.794 | 1.0635 | | torchbench | timm_vovnet | 0.9712 | 0.9419 | | torchbench | nvidia_deeprecommender | 0.9666 | 0.7988 | | torchbench | gat | 0.0 | 0.0 | | torchbench | tacotron2 | 0.0 | 0.0 | | torchbench | sage | 0.0 | 0.0 | | torchbench | gcn | 0.0 | 0.0 | | torchbench | hf_GPT2_large | 1.3826 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | resnet50_quantized_qat | 0.0 | 0.0 | | torchbench | mobilenet_v2_quantized_qat | 0.0 | 0.0 | | torchbench | torchrec_dlrm | 0.0 | 0.0 | | huggingface | DebertaForMaskedLM | 0.8475 | 0.8603 | | huggingface | DebertaV2ForQuestionAnswering | 0.7018 | 0.7474 | | huggingface | DebertaV2ForMaskedLM | 0.6235 | 0.7285 | | huggingface | BlenderbotForCausalLM | 1.0243 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.9299 | 0.9303 | +-------------+-------------------------------+------------------------+----------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+------------------------+----------+ | suite | name | inductor_no_cudagraphs | inductor | +-------------+-------------------------------+------------------------+----------+ | torchbench | phlippe_densenet | 164.3401 | 164.7971 | | torchbench | hf_T5_large | 147.2658 | 149.7604 | | torchbench | timm_efficientnet | 133.6245 | 135.2297 | | torchbench | mobilenet_v3_large | 127.5956 | 131.7941 | | torchbench | hf_BigBird | 115.2682 | 131.5293 | | torchbench | mobilenet_v2 | 121.9159 | 125.891 | | torchbench | densenet121 | 128.3598 | 125.1659 | | huggingface | MT5ForConditionalGeneration | 124.5428 | 123.6133 | | huggingface | DebertaV2ForMaskedLM | 56.2815 | 122.3893 | | huggingface | DebertaV2ForQuestionAnswering | 54.1742 | 120.5788 | | timm_models | rexnet_100 | 273.9646 | 276.7602 | | timm_models | hrnet_w18 | 225.2791 | 233.938 | | timm_models | ghostnet_100 | 229.9564 | 231.1308 | | timm_models | mobilevit_s | 178.0214 | 183.8375 | | timm_models | fbnetv3_b | 157.2946 | 160.7021 | | timm_models | gluon_inception_v3 | 146.5848 | 153.5986 | | timm_models | inception_v3 | 148.1424 | 152.5033 | | timm_models | mobilenetv3_large_100 | 143.8063 | 152.1616 | | timm_models | adv_inception_v3 | 152.1385 | 149.3431 | | timm_models | tinynet_a | 146.7822 | 148.3809 | | timm_models | tf_efficientnet_b0 | 143.9997 | 146.624 | | timm_models | mixnet_l | 138.3035 | 146.5976 | | timm_models | pnasnet5large | 147.4728 | 146.1797 | | timm_models | resnest101e | 147.637 | 145.8378 | | timm_models | tf_mixnet_l | 147.7897 | 141.5892 | | timm_models | res2net101_26w_4s | 139.8832 | 133.0005 | | timm_models | fbnetc_100 | 130.5764 | 132.591 | | timm_models | spnasnet_100 | 131.7982 | 129.5098 | | timm_models | mobilenetv2_100 | 123.3439 | 126.203 | +-------------+-------------------------------+------------------------+----------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+------------------------+----------+ | suite | name | inductor_no_cudagraphs | inductor | +-------------+-----------------------------------------+------------------------+----------+ | torchbench | pytorch_stargan | 1.0715 | 0.8997 | | torchbench | timm_resnest | 1.0032 | 0.8975 | | torchbench | resnet152 | 0.9666 | 0.8892 | | torchbench | timm_vision_transformer | 0.9267 | 0.8846 | | torchbench | hf_T5 | 1.1711 | 0.8774 | | torchbench | timm_nfnet | 1.1331 | 0.8734 | | torchbench | timm_regnet | 0.982 | 0.8628 | | torchbench | phlippe_densenet | 0.9199 | 0.8562 | | torchbench | pytorch_unet | 0.9923 | 0.8501 | | torchbench | mobilenet_v3_large | 0.9276 | 0.8424 | | torchbench | resnet50 | 0.9405 | 0.8404 | | torchbench | speech_transformer | 0.844 | 0.84 | | torchbench | alexnet | 1.0006 | 0.8346 | | torchbench | hf_DistilBert | 0.9835 | 0.8317 | | torchbench | dcgan | 0.9932 | 0.8287 | | torchbench | resnext50_32x4d | 0.919 | 0.8236 | | torchbench | hf_T5_large | 1.1293 | 0.8219 | | torchbench | squeezenet1_1 | 0.9868 | 0.8131 | | torchbench | mnasnet1_0 | 0.8757 | 0.8117 | | torchbench | hf_Bart | 1.0392 | 0.794 | | torchbench | attention_is_all_you_need_pytorch | 0.9488 | 0.7913 | | torchbench | demucs | 0.9979 | 0.7557 | | torchbench | timm_vovnet | 0.952 | 0.7515 | | torchbench | pytorch_struct | 0.7355 | 0.726 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.7115 | 0.6919 | | torchbench | hf_BigBird | 1.0398 | 0.6819 | | torchbench | vgg16 | 0.9999 | 0.6712 | | torchbench | nvidia_deeprecommender | 0.9844 | 0.6651 | | torchbench | drq | 0.9801 | 0.6493 | | torchbench | densenet121 | 0.7937 | 0.647 | | torchbench | LearningToPaint | 0.8274 | 0.6394 | | torchbench | resnet18 | 0.6981 | 0.6283 | | torchbench | soft_actor_critic | 0.9997 | 0.6192 | | torchbench | lennard_jones | 1.0 | 0.5322 | | torchbench | phlippe_resnet | 0.4791 | 0.4452 | | torchbench | functorch_dp_cifar10 | 0.423 | 0.3989 | | torchbench | hf_Reformer | 0.7865 | 0.3861 | | huggingface | DistillGPT2 | 0.9667 | 0.8492 | | huggingface | MegatronBertForCausalLM | 1.0256 | 0.8374 | | huggingface | MBartForCausalLM | 0.9142 | 0.8347 | | huggingface | BartForCausalLM | 0.9141 | 0.8345 | | huggingface | PLBartForCausalLM | 0.9244 | 0.8267 | | huggingface | MBartForConditionalGeneration | 0.9794 | 0.8239 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.9055 | 0.8181 | | huggingface | PegasusForConditionalGeneration | 0.9645 | 0.8166 | | huggingface | BartForConditionalGeneration | 0.9794 | 0.8137 | | huggingface | DistilBertForMaskedLM | 0.8955 | 0.7997 | | huggingface | PegasusForCausalLM | 0.8904 | 0.7939 | | huggingface | MT5ForConditionalGeneration | 0.9119 | 0.7921 | | huggingface | TrOCRForCausalLM | 0.8833 | 0.7827 | | huggingface | AlbertForQuestionAnswering | 1.2465 | 0.7775 | | huggingface | AlbertForMaskedLM | 1.2096 | 0.7684 | | huggingface | M2M100ForConditionalGeneration | 0.9016 | 0.7498 | | huggingface | BlenderbotSmallForCausalLM | 0.8426 | 0.7233 | | huggingface | Speech2Text2ForCausalLM | 0.8199 | 0.7096 | | huggingface | XGLMForCausalLM | 0.9136 | 0.7095 | | huggingface | MobileBertForMaskedLM | 0.736 | 0.5643 | | huggingface | DebertaV2ForMaskedLM | 0.988 | 0.5506 | | huggingface | DebertaForMaskedLM | 1.0107 | 0.5429 | | huggingface | MobileBertForQuestionAnswering | 0.5574 | 0.4677 | | huggingface | DebertaForQuestionAnswering | 1.1413 | 0.4577 | | huggingface | DebertaV2ForQuestionAnswering | 0.9759 | 0.4556 | | timm_models | volo_d1_224 | 0.9634 | 0.8975 | | timm_models | ese_vovnet19b_dw | 1.0127 | 0.8974 | | timm_models | gluon_xception65 | 0.9923 | 0.8947 | | timm_models | fbnetc_100 | 0.9847 | 0.8935 | | timm_models | mixnet_l | 1.0338 | 0.8918 | | timm_models | lcnet_050 | 0.9552 | 0.881 | | timm_models | dm_nfnet_f0 | 1.1277 | 0.8735 | | timm_models | gmlp_s16_224 | 0.8743 | 0.8656 | | timm_models | swin_base_patch4_window7_224 | 0.988 | 0.8653 | | timm_models | botnet26t_256 | 0.9861 | 0.8623 | | timm_models | gernet_l | 0.998 | 0.8613 | | timm_models | twins_pcpvt_base | 0.9398 | 0.86 | | timm_models | jx_nest_base | 0.9832 | 0.8479 | | timm_models | sebotnet33ts_256 | 1.0449 | 0.8393 | | timm_models | crossvit_9_240 | 0.9829 | 0.82 | | timm_models | poolformer_m36 | 1.1099 | 0.8195 | | timm_models | regnety_002 | 0.9579 | 0.8013 | | timm_models | pit_b_224 | 0.9905 | 0.7981 | | timm_models | repvgg_a2 | 1.005 | 0.7788 | | timm_models | convnext_base | 0.9504 | 0.7585 | | timm_models | coat_lite_mini | 0.9347 | 0.7543 | +-------------+-----------------------------------------+------------------------+----------+ ~~~

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | functorch_dp_cifar10 | 64 | 0.9892 | 1.0233 | 2.4578 | 1.2468 | | densenet121 | 4 | 0.9914 | 0.7178 | 2.3994 | 1.0269 | | hf_BigBird | 2 | 0.9584 | 0.8102 | 2.3453 | 1.6393 | | BERT_pytorch | 16 | 0.9915 | 0.8688 | 1.8793 | 1.8898 | | phlippe_densenet | 128 | 0.9859 | 0.7984 | 1.7597 | 1.0625 | | dlrm | 1024 | 0.9863 | 0.9297 | 1.7536 | 1.1839 | | hf_Albert | 8 | 0.9993 | 0.9991 | 1.64 | 1.6547 | | mobilenet_v3_large | 32 | 0.9973 | 0.8255 | 1.601 | 1.1435 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.991 | 0.9867 | 1.529 | 1.4943 | | phlippe_resnet | 128 | 0.9954 | 0.7643 | 1.5247 | 1.0439 | | squeezenet1_1 | 32 | 0.9867 | 0.9881 | 1.5138 | 1.2732 | | speech_transformer | 32 | 0.9843 | 0.869 | 1.4552 | 1.4468 | | hf_T5_large | 2 | 0.9846 | 0.8446 | 1.4357 | 1.5018 | | timm_nfnet | 128 | 0.991 | 0.99 | 1.4282 | 1.3882 | | fastNLP_Bert | 6 | 0.9821 | 0.9599 | 1.387 | 1.373 | | hf_T5 | 8 | 0.9897 | 0.8148 | 1.3856 | 1.3987 | | pytorch_struct | 200 | 0.9834 | 0.7408 | 1.3682 | 1.0671 | | timm_resnest | 32 | 0.9944 | 0.8806 | 1.3642 | 1.344 | | shufflenet_v2_x1_0 | 128 | 0.9953 | 0.7791 | 1.3535 | 1.216 | | hf_GPT2 | 4 | 0.9829 | 0.9545 | 1.3502 | 1.3962 | | mobilenet_v2 | 96 | 0.9975 | 0.8432 | 1.3448 | 1.3512 | | lennard_jones | 1000 | 0.9286 | 0.8268 | 1.3312 | 0.883 | | resnext50_32x4d | 8 | 0.9954 | 0.7545 | 1.3049 | 0.9658 | | resnet18 | 16 | 0.9946 | 0.7806 | 1.2596 | 1.0033 | | mnasnet1_0 | 32 | 0.992 | 0.7753 | 1.2441 | 1.0429 | | dcgan | 32 | 0.9294 | 0.7511 | 1.2259 | 0.8639 | | drq | 1 | 0.9611 | 0.7341 | 1.2247 | 0.9858 | | pytorch_stargan | 16 | 0.9956 | 0.9596 | 1.2061 | 1.1967 | | hf_Bart | 4 | 0.9974 | 0.8982 | 1.1787 | 1.4096 | | pytorch_unet | 1 | 0.9973 | 0.2734 | 1.1752 | 1.1744 | | vgg16 | 64 | 0.9995 | 0.9988 | 1.154 | 1.1605 | | hf_Bert_large | 4 | 0.973 | 0.9585 | 1.149 | 1.1431 | | LearningToPaint | 96 | 0.9919 | 0.8502 | 1.1467 | 1.0756 | | hf_DistilBert | 8 | 0.9866 | 0.9315 | 1.1449 | 1.1652 | | yolov3 | 16 | 0.9972 | 0.8491 | 1.1405 | 1.1465 | | hf_Bert | 4 | 0.9974 | 0.9115 | 1.1393 | 1.1418 | | Super_SloMo | 6 | 0.9985 | 0.2444 | 1.1385 | 1.1381 | | timm_efficientnet | 32 | 0.9443 | 0.7005 | 1.1246 | 1.1065 | | resnet50 | 32 | 0.9962 | 0.8766 | 1.1175 | 1.1243 | | timm_vision_transformer | 32 | 0.9914 | 0.9794 | 1.1042 | 1.1013 | | hf_Reformer | 4 | 0.991 | 0.9912 | 1.1042 | 1.0966 | | alexnet | 128 | 0.9989 | 0.9963 | 1.0719 | 1.1119 | | Background_Matting | 4 | 0.9992 | 0.1875 | 1.0695 | 1.0644 | | soft_actor_critic | 256 | 0.9479 | 0.67 | 1.0635 | 0.794 | | attention_is_all_you_need_pytorch | 256 | 0.9891 | 0.9599 | 1.0624 | 1.0642 | | timm_regnet | 32 | 0.9516 | 0.8724 | 1.0564 | 1.0494 | | resnet152 | 32 | 0.9956 | 0.8188 | 1.054 | 1.0427 | | demucs | 4 | 0.9992 | 0.9988 | 1.0287 | 1.0306 | | tts_angular | 64 | 0.9984 | 0.9825 | 0.9983 | 0.994 | | timm_vovnet | 32 | 0.8891 | 0.8204 | 0.9419 | 0.9712 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9653 | 0.7988 | 0.9666 | | gat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | sage | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | gcn | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 0.9845 | 0.9627 | 0.0 | 1.3826 | | moco | 32 | 0.9532 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 1.011 | 0.6513 | 0.0 | 0.0 | | resnet50_quantized_qat | 32 | 1.0004 | 0.8255 | 0.0 | 0.0 | | mobilenet_v2_quantized_qat | 96 | 1.0007 | 0.8393 | 0.0 | 0.0 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | soft_actor_critic | 256 | pass | pass | pass | pass | | nvidia_deeprecommender | 4 | pass | pass | pass | pass | | phlippe_densenet | 4 | pass | pass | pass | pass | | phlippe_resnet | 4 | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | | resnet18 | 4 | pass | pass | pass | pass | | resnet50 | 4 | pass | pass | pass | pass | | resnext50_32x4d | 4 | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 4 | pass | pass | pass | pass | | speech_transformer | 4 | pass | pass | pass | pass | | mobilenet_v2 | 4 | pass | pass | pass | pass | | squeezenet1_1 | 4 | pass | pass | pass | pass | | timm_efficientnet | 4 | pass | pass | pass | pass | | timm_nfnet | 4 | pass | pass | pass | pass | | timm_regnet | 4 | pass | pass | pass | pass | | timm_resnest | 4 | pass | pass | pass | pass | | timm_vision_transformer | 4 | pass | pass | pass | pass | | timm_vovnet | 4 | pass | pass | pass | pass | | tts_angular | 4 | pass | pass | pass | pass | | yolov3 | 4 | pass | pass | pass | pass | | mobilenet_v3_large | 4 | pass | pass | pass | pass | | resnet152 | 4 | pass | pass | pass | pass | | mnasnet1_0 | 4 | pass | pass | pass | pass | | functorch_dp_cifar10 | 4 | pass | pass | pass | pass | | BERT_pytorch | 4 | pass | pass | pass | pass | | LearningToPaint | 4 | pass | pass | pass | pass | | lennard_jones | 4 | pass | pass | pass | pass | | dcgan | 4 | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | | densenet121 | 4 | pass | pass | pass | pass | | dlrm | 4 | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | | fastNLP_Bert | 4 | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 4 | pass | pass | pass | pass | | hf_Albert | 4 | pass | pass | pass | pass | | hf_Bert | 4 | pass | pass | pass | pass | | hf_Bert_large | 4 | pass | pass | pass | pass | | hf_BigBird | 4 | pass | pass | pass | pass | | hf_DistilBert | 4 | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | | hf_Reformer | 4 | pass | pass | pass | pass | | hf_T5 | 4 | pass | pass | pass | pass | | hf_T5_base | 4 | pass | pass | pass | pass | | hf_Bart | 4 | pass | pass | pass | pass | | moco | 4 | pass | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 4 | pass | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v2_quantized_qat | 4 | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 4 | pass | pass | fail_to_run | fail_to_run | | Background_Matting | 4 | eager_variation | eager_variation | eager_variation | eager_variation | | Super_SloMo | 4 | eager_variation | eager_variation | eager_variation | eager_variation | | alexnet | 4 | eager_variation | eager_variation | eager_variation | eager_variation | | pytorch_CycleGAN_and_pix2pix | 1 | eager_variation | eager_variation | eager_variation | eager_variation | | pytorch_unet | 2 | eager_variation | eager_variation | eager_variation | eager_variation | | vgg16 | 4 | eager_variation | eager_variation | eager_variation | eager_variation | | vision_maskrcnn | 4 | eager_variation | eager_variation | eager_variation | eager_variation | | tacotron2 | 4 | fail_to_run | fail_to_run | 0.0000 | 0.0000 | | gat | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | gcn | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | llama | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | sage | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | torchrec_dlrm | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------+------------------------+ | phlippe_densenet | 128 | 2.0434 | 5.2908 | 164.7971 | 164.3401 | | hf_T5_large | 2 | 21.4732 | 45.4038 | 149.7604 | 147.2658 | | timm_efficientnet | 32 | 3.4574 | 8.2183 | 135.2297 | 133.6245 | | mobilenet_v3_large | 32 | 2.1226 | 5.5377 | 131.7941 | 127.5956 | | hf_BigBird | 2 | 10.4138 | 32.39 | 131.5293 | 115.2682 | | mobilenet_v2 | 96 | 1.8645 | 5.2314 | 125.891 | 121.9159 | | densenet121 | 4 | 5.0182 | 14.1483 | 125.1659 | 128.3598 | | yolov3 | 16 | 3.3578 | 8.249 | 106.548 | 107.9441 | | mnasnet1_0 | 32 | 1.8866 | 5.04 | 102.6443 | 107.335 | | resnet152 | 32 | 5.8926 | 15.6119 | 96.0523 | 94.6267 | | timm_resnest | 32 | 1.201 | 2.8365 | 94.2961 | 92.8541 | | shufflenet_v2_x1_0 | 128 | 2.1886 | 5.9156 | 75.8802 | 76.4391 | | speech_transformer | 32 | 3.8319 | 9.5836 | 71.0866 | 66.3364 | | timm_regnet | 32 | 4.7255 | 9.2573 | 67.3946 | 65.5732 | | timm_nfnet | 128 | 4.3111 | 8.7662 | 67.0176 | 65.5873 | | attention_is_all_you_need_pytorch | 256 | 2.9718 | 8.4526 | 66.3485 | 66.4413 | | resnet50 | 32 | 1.9673 | 5.2497 | 63.072 | 60.4255 | | timm_vovnet | 32 | 2.6075 | 5.0828 | 60.6609 | 57.895 | | Background_Matting | 4 | 1.77 | 9.3554 | 58.4763 | 61.5001 | | BERT_pytorch | 16 | 3.2058 | 8.2991 | 58.3789 | 58.1252 | | functorch_dp_cifar10 | 64 | 0.7285 | 1.66 | 55.4975 | 51.3286 | | hf_Bert_large | 4 | 7.0004 | 15.0229 | 53.0323 | 51.7548 | | pytorch_unet | 1 | 1.0032 | 3.6044 | 51.237 | 52.2397 | | resnext50_32x4d | 8 | 1.9993 | 5.3057 | 50.0594 | 48.2653 | | hf_T5 | 8 | 4.2815 | 11.0157 | 44.8169 | 42.0625 | | pytorch_stargan | 16 | 0.8089 | 2.5276 | 43.9681 | 43.702 | | resnet18 | 16 | 0.8488 | 2.0641 | 43.1969 | 41.8598 | | hf_Bart | 4 | 3.7981 | 9.5703 | 42.0159 | 40.7236 | | timm_vision_transformer | 32 | 1.9661 | 5.0041 | 41.5039 | 39.7002 | | LearningToPaint | 96 | 0.8934 | 2.1626 | 41.3686 | 42.7328 | | fastNLP_Bert | 6 | 3.4399 | 7.8405 | 40.772 | 40.5171 | | hf_Reformer | 4 | 3.4828 | 5.0013 | 39.1851 | 37.2445 | | Super_SloMo | 6 | 2.0205 | 8.1186 | 36.8793 | 37.0915 | | hf_GPT2 | 4 | 3.2502 | 7.0724 | 35.8465 | 35.201 | | hf_Albert | 8 | 1.8409 | 6.2215 | 33.981 | 35.9237 | | hf_Bert | 4 | 3.4755 | 7.5101 | 33.0238 | 31.2422 | | phlippe_resnet | 128 | 0.8666 | 2.0713 | 30.4201 | 29.2593 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.7983 | 2.3081 | 29.3072 | 29.8332 | | hf_DistilBert | 8 | 1.5105 | 3.87 | 27.3705 | 25.8568 | | demucs | 4 | 0.7699 | 1.2502 | 25.426 | 25.2651 | | pytorch_struct | 200 | 0.4584 | 0.8927 | 24.9754 | 24.2444 | | squeezenet1_1 | 32 | 0.6277 | 1.0986 | 22.1159 | 21.7369 | | alexnet | 128 | 0.3074 | 0.5007 | 13.7286 | 13.5414 | | vgg16 | 64 | 0.3428 | 0.6748 | 13.6218 | 13.8409 | | drq | 1 | 0.4692 | 0.7062 | 9.6357 | 7.7873 | | nvidia_deeprecommender | 256 | 0.3126 | 0.5004 | 9.3936 | 9.0937 | | soft_actor_critic | 256 | 0.3123 | 0.4303 | 7.539 | 6.1807 | | dcgan | 32 | 0.2867 | 0.5134 | 6.6982 | 6.32 | | dlrm | 1024 | 0.3075 | 0.6226 | 6.668 | 6.5399 | | tts_angular | 64 | 0.2707 | 0.3275 | 5.1372 | 4.9307 | | lennard_jones | 1000 | 0.2497 | 0.376 | 4.9453 | 5.3687 | | hf_GPT2_large | 4 | 10.5783 | 23.7133 | nan | 85.4726 | | hf_Longformer | 2 | 7.3288 | 28.3443 | nan | nan | | mobilenet_v2_quantized_qat | 96 | 2.5628 | 10.6738 | nan | nan | | resnet50_quantized_qat | 32 | 2.4272 | 10.1255 | nan | nan | | moco | 32 | 30.4064 | nan | nan | nan | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | Super_SloMo | 6 | 1.0073 | 0.9035 | 1.3192 | 1.3192 | | mobilenet_v2 | 96 | 1.0002 | 0.7663 | 1.1503 | 1.2552 | | fastNLP_Bert | 6 | 1.0002 | 0.9109 | 1.1054 | 1.2176 | | hf_Albert | 8 | 1.0 | 0.9487 | 1.0049 | 1.1711 | | hf_Bert | 4 | 1.0 | 0.892 | 0.9854 | 0.9889 | | timm_efficientnet | 32 | 1.0014 | 0.7903 | 0.9844 | 1.0585 | | tts_angular | 64 | 1.0 | 1.0 | 0.9819 | 1.0 | | shufflenet_v2_x1_0 | 128 | 1.0 | 0.9152 | 0.9673 | 1.0646 | | hf_Bert_large | 4 | 1.0 | 0.8872 | 0.9556 | 1.0278 | | dlrm | 1024 | 1.0 | 0.9945 | 0.9522 | 1.001 | | yolov3 | 16 | 0.9999 | 0.8557 | 0.9276 | 1.1038 | | BERT_pytorch | 16 | 1.0 | 0.8854 | 0.913 | 1.1114 | | Background_Matting | 4 | 1.0027 | 0.8166 | 0.9124 | 1.0422 | | hf_GPT2 | 4 | 1.0 | 0.8882 | 0.9095 | 1.1129 | | pytorch_stargan | 16 | 1.0 | 1.0123 | 0.8997 | 1.0715 | | timm_resnest | 32 | 1.0022 | 0.9221 | 0.8975 | 1.0032 | | resnet152 | 32 | 1.0002 | 0.9113 | 0.8892 | 0.9666 | | timm_vision_transformer | 32 | 1.0001 | 0.9359 | 0.8846 | 0.9267 | | hf_T5 | 8 | 1.0 | 0.9409 | 0.8774 | 1.1711 | | timm_nfnet | 128 | 0.9114 | 0.8889 | 0.8734 | 1.1331 | | timm_regnet | 32 | 1.0004 | 0.866 | 0.8628 | 0.982 | | phlippe_densenet | 128 | 1.0 | 0.9031 | 0.8562 | 0.9199 | | pytorch_unet | 1 | 1.0005 | 0.8208 | 0.8501 | 0.9923 | | mobilenet_v3_large | 32 | 1.0 | 0.8899 | 0.8424 | 0.9276 | | resnet50 | 32 | 1.0004 | 0.8706 | 0.8404 | 0.9405 | | speech_transformer | 32 | 0.9961 | 0.9115 | 0.84 | 0.844 | | alexnet | 128 | 1.0003 | 0.877 | 0.8346 | 1.0006 | | hf_DistilBert | 8 | 1.0 | 0.899 | 0.8317 | 0.9835 | | dcgan | 32 | 1.0 | 0.8428 | 0.8287 | 0.9932 | | resnext50_32x4d | 8 | 0.999 | 0.888 | 0.8236 | 0.919 | | hf_T5_large | 2 | 1.0 | 0.8482 | 0.8219 | 1.1293 | | squeezenet1_1 | 32 | 0.9994 | 0.8302 | 0.8131 | 0.9868 | | mnasnet1_0 | 32 | 1.0021 | 0.9062 | 0.8117 | 0.8757 | | hf_Bart | 4 | 1.0 | 0.8676 | 0.794 | 1.0392 | | attention_is_all_you_need_pytorch | 256 | 1.0021 | 0.9238 | 0.7913 | 0.9488 | | demucs | 4 | 0.9981 | 0.9982 | 0.7557 | 0.9979 | | timm_vovnet | 32 | 1.0014 | 0.7568 | 0.7515 | 0.952 | | pytorch_struct | 200 | 1.0 | 0.5108 | 0.726 | 0.7355 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.9023 | 0.6919 | 0.7115 | | hf_BigBird | 2 | 0.9886 | 0.9851 | 0.6819 | 1.0398 | | vgg16 | 64 | 0.9999 | 0.6744 | 0.6712 | 0.9999 | | nvidia_deeprecommender | 256 | 1.0002 | 0.8886 | 0.6651 | 0.9844 | | drq | 1 | 1.0 | 0.98 | 0.6493 | 0.9801 | | densenet121 | 4 | 1.0027 | 0.7954 | 0.647 | 0.7937 | | LearningToPaint | 96 | 0.9989 | 0.7184 | 0.6394 | 0.8274 | | resnet18 | 16 | 0.9996 | 0.8022 | 0.6283 | 0.6981 | | soft_actor_critic | 256 | 0.9999 | 0.9689 | 0.6192 | 0.9997 | | lennard_jones | 1000 | 1.0 | 1.0 | 0.5322 | 1.0 | | phlippe_resnet | 128 | 1.0 | 0.8597 | 0.4452 | 0.4791 | | functorch_dp_cifar10 | 64 | 1.0 | 0.9209 | 0.3989 | 0.423 | | hf_Reformer | 4 | 0.7852 | 0.7852 | 0.3861 | 0.7865 | | hf_GPT2_large | 4 | 1.0 | 0.8611 | nan | 1.1216 | | hf_Longformer | 2 | 0.9991 | 0.9645 | nan | nan | | resnet50_quantized_qat | 32 | 1.0003 | 0.9473 | nan | nan | | mobilenet_v2_quantized_qat | 96 | 1.0002 | 0.8329 | nan | nan | | moco | 32 | 1.0125 | nan | nan | nan | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------+------------------------+ | Background_Matting | 4 | 183.5563 | 976.4087 | 171.3196 | 172.2008 | | timm_nfnet | 128 | 194.864 | 195.9181 | 135.1297 | 140.6248 | | hf_T5_large | 2 | 198.1343 | 227.2955 | 134.8755 | 131.7 | | hf_T5 | 8 | 184.5011 | 223.9152 | 133.1855 | 131.5264 | | Super_SloMo | 6 | 118.7721 | 484.8472 | 104.1504 | 104.1616 | | vgg16 | 64 | 107.124 | 107.332 | 92.8585 | 92.3837 | | yolov3 | 16 | 99.6302 | 116.8732 | 87.0073 | 86.6359 | | timm_regnet | 32 | 96.4354 | 105.4046 | 86.8946 | 87.7459 | | hf_BigBird | 2 | 193.3227 | 232.3756 | 82.4634 | 117.5966 | | hf_Bert_large | 4 | 94.3999 | 95.8017 | 80.1398 | 80.4483 | | resnet152 | 32 | 83.352 | 101.1892 | 78.5245 | 79.6654 | | hf_Reformer | 4 | 82.5132 | 82.6397 | 74.1087 | 74.6232 | | demucs | 4 | 75.0165 | 74.9495 | 72.8915 | 72.7964 | | attention_is_all_you_need_pytorch | 256 | 72.808 | 75.0754 | 67.9447 | 67.7891 | | mobilenet_v2 | 96 | 69.5288 | 82.273 | 51.5659 | 51.3613 | | pytorch_unet | 1 | 58.4208 | 212.9134 | 49.5746 | 49.592 | | hf_Bart | 4 | 54.9569 | 83.361 | 46.4361 | 45.5966 | | hf_Albert | 8 | 76.0351 | 76.115 | 45.9599 | 45.9271 | | fastNLP_Bert | 6 | 60.5081 | 61.9936 | 42.7522 | 43.2022 | | timm_vovnet | 32 | 42.175 | 45.7721 | 39.7809 | 38.6882 | | hf_GPT2 | 4 | 51.0283 | 52.3553 | 37.0157 | 35.9351 | | speech_transformer | 32 | 53.7805 | 56.5039 | 36.6872 | 37.2397 | | hf_DistilBert | 8 | 40.1388 | 42.587 | 34.2391 | 33.4453 | | resnet50 | 32 | 38.3426 | 43.4971 | 34.0744 | 33.8635 | | hf_Bert | 4 | 39.3833 | 42.5515 | 33.7546 | 34.0765 | | timm_efficientnet | 32 | 38.7203 | 57.9944 | 32.23 | 33.6749 | | timm_vision_transformer | 32 | 30.1458 | 30.2235 | 26.5782 | 26.8362 | | shufflenet_v2_x1_0 | 128 | 34.0006 | 43.8624 | 24.7304 | 27.5297 | | BERT_pytorch | 16 | 55.4602 | 62.765 | 24.5789 | 24.3929 | | timm_resnest | 32 | 31.1804 | 35.2712 | 22.6863 | 23.0978 | | densenet121 | 4 | 56.9023 | 66.2191 | 21.3421 | 46.2307 | | mnasnet1_0 | 32 | 26.2314 | 33.4751 | 20.6817 | 25.2216 | | pytorch_stargan | 16 | 23.7977 | 24.6021 | 19.5231 | 19.7364 | | mobilenet_v3_large | 32 | 27.518 | 32.2362 | 17.6835 | 23.7602 | | resnext50_32x4d | 8 | 20.4098 | 28.2847 | 15.448 | 22.2046 | | phlippe_densenet | 128 | 25.2362 | 30.8328 | 13.9288 | 23.1727 | | LearningToPaint | 96 | 13.7963 | 16.1152 | 12.1798 | 12.7358 | | alexnet | 128 | 12.2762 | 12.2953 | 11.437 | 11.0193 | | nvidia_deeprecommender | 256 | 8.5635 | 8.8596 | 10.7008 | 8.8471 | | tts_angular | 64 | 9.5202 | 9.6973 | 10.1497 | 9.3571 | | pytorch_CycleGAN_and_pix2pix | 1 | 12.6895 | 13.4687 | 8.0956 | 8.438 | | resnet18 | 16 | 9.3312 | 12.102 | 7.306 | 9.184 | | squeezenet1_1 | 32 | 9.4014 | 9.9142 | 7.0637 | 7.7263 | | phlippe_resnet | 128 | 8.1671 | 10.7136 | 5.3025 | 7.7777 | | functorch_dp_cifar10 | 64 | 7.9513 | 7.7878 | 3.2445 | 6.6713 | | pytorch_struct | 200 | 3.9275 | 5.0782 | 2.7707 | 3.5213 | | drq | 1 | 2.2941 | 3.0523 | 2.6294 | 2.6745 | | dlrm | 1024 | 3.9623 | 3.9053 | 2.2843 | 3.1055 | | dcgan | 32 | 1.9172 | 2.4214 | 1.5442 | 2.139 | | soft_actor_critic | 256 | 1.3241 | 1.7093 | 1.1223 | 1.4966 | | lennard_jones | 1000 | 1.2416 | 1.4443 | 0.9161 | 1.324 | | hf_GPT2_large | 4 | 243.5783 | 249.4878 | nan | 173.7771 | | hf_Longformer | 2 | 138.4195 | 227.2282 | nan | nan | | mobilenet_v2_quantized_qat | 96 | 143.9017 | 173.1822 | nan | nan | | resnet50_quantized_qat | 32 | 89.0956 | 108.4712 | nan | nan | | moco | 32 | 64.7334 | nan | nan | nan | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+----------+-----------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9951 | 0.9287 | 1.7924 | 1.8314 | | GPT2ForSequenceClassification | 4 | 0.9882 | 0.962 | 1.6283 | 1.6549 | | XLNetLMHeadModel | 8 | 0.9977 | 0.9647 | 1.6185 | 1.6244 | | MobileBertForMaskedLM | 64 | 0.9297 | 0.8159 | 1.6089 | 1.2759 | | GoogleFnet | 16 | 0.9866 | 0.9521 | 1.5586 | 1.5449 | | MT5ForConditionalGeneration | 16 | 0.9955 | 0.886 | 1.4886 | 1.4741 | | ElectraForCausalLM | 32 | 0.9871 | 0.9259 | 1.4355 | 1.4274 | | ElectraForQuestionAnswering | 64 | 0.9896 | 0.9805 | 1.4239 | 1.4119 | | DistillGPT2 | 16 | 0.9944 | 0.9386 | 1.3715 | 1.4236 | | XGLMForCausalLM | 8 | 1.0018 | 0.9453 | 1.3505 | 1.3069 | | LayoutLMForSequenceClassification | 16 | 0.9882 | 0.9772 | 1.2796 | 1.2844 | | RobertaForQuestionAnswering | 16 | 0.9881 | 0.9761 | 1.2709 | 1.2635 | | BertForQuestionAnswering | 16 | 0.9881 | 0.9761 | 1.2694 | 1.2621 | | RobertaForCausalLM | 16 | 0.9905 | 0.9606 | 1.2672 | 1.2605 | | AlbertForQuestionAnswering | 4 | 0.9996 | 1.0013 | 1.2531 | 1.2525 | | AlbertForMaskedLM | 4 | 1.0008 | 0.9991 | 1.2504 | 1.248 | | T5ForConditionalGeneration | 4 | 0.9853 | 0.8135 | 1.2391 | 1.3141 | | T5Small | 4 | 0.9849 | 0.8164 | 1.2368 | 1.3147 | | PLBartForCausalLM | 8 | 0.9959 | 0.9501 | 1.2293 | 1.2584 | | PLBartForConditionalGeneration | 4 | 0.9954 | 0.9545 | 1.2248 | 1.2388 | | MobileBertForQuestionAnswering | 128 | 0.9388 | 0.8869 | 1.2088 | 1.3298 | | YituTechConvBert | 16 | 0.9902 | 0.9596 | 1.2023 | 1.1982 | | CamemBert | 16 | 0.99 | 0.9584 | 1.1903 | 1.188 | | BertForMaskedLM | 16 | 0.99 | 0.9583 | 1.1882 | 1.1886 | | MegatronBertForQuestionAnswering | 8 | 0.985 | 0.9705 | 1.1852 | 1.2016 | | LayoutLMForMaskedLM | 16 | 0.9905 | 0.9597 | 1.1844 | 1.2016 | | DistilBertForQuestionAnswering | 256 | 0.998 | 0.9919 | 1.1479 | 1.1475 | | Speech2Text2ForCausalLM | 256 | 0.9949 | 0.9142 | 1.1457 | 1.18 | | BartForCausalLM | 4 | 0.9894 | 0.9552 | 1.1388 | 1.1687 | | MBartForCausalLM | 4 | 0.9957 | 0.9552 | 1.1336 | 1.1575 | | MegatronBertForCausalLM | 4 | 0.9755 | 0.963 | 1.1307 | 1.1548 | | MBartForConditionalGeneration | 2 | 0.9915 | 0.9769 | 1.0933 | 1.1074 | | BartForConditionalGeneration | 2 | 0.9909 | 0.9745 | 1.0908 | 1.1125 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9927 | 0.9307 | 1.067 | 1.0876 | | DebertaForQuestionAnswering | 8 | 0.8483 | 0.7994 | 1.0467 | 1.0367 | | M2M100ForConditionalGeneration | 16 | 0.9955 | 0.9589 | 1.0401 | 1.096 | | DistilBertForMaskedLM | 128 | 0.9963 | 0.9483 | 1.0303 | 1.0545 | | TrOCRForCausalLM | 32 | 0.9956 | 0.9481 | 1.0262 | 1.0554 | | PegasusForConditionalGeneration | 32 | 0.9917 | 0.9611 | 1.0146 | 1.0469 | | PegasusForCausalLM | 32 | 0.9913 | 0.9384 | 0.9987 | 1.0299 | | BlenderbotSmallForCausalLM | 64 | 0.9936 | 0.9001 | 0.9902 | 1.041 | | DebertaForMaskedLM | 4 | 0.7215 | 0.593 | 0.8603 | 0.8475 | | DebertaV2ForQuestionAnswering | 2 | 0.7042 | 0.635 | 0.7474 | 0.7018 | | DebertaV2ForMaskedLM | 1 | 0.6643 | 0.5353 | 0.7285 | 0.6235 | | BlenderbotForCausalLM | 4 | 0.9966 | 0.9254 | 0.0 | 1.0243 | | AllenaiLongformerBase | 4 | 0.9965 | 0.5955 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | PegasusForCausalLM | 1 | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | pass | | GoogleFnet | 1 | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | pass | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | MT5ForConditionalGeneration | 16 | 6.0306 | 14.9876 | 123.6133 | 124.5428 | | DebertaV2ForMaskedLM | 1 | 11.0895 | 20.1937 | 122.3893 | 56.2815 | | DebertaV2ForQuestionAnswering | 2 | 11.0277 | 19.613 | 120.5788 | 54.1742 | | MobileBertForMaskedLM | 64 | 14.9969 | 31.3009 | 118.6652 | 116.5587 | | M2M100ForConditionalGeneration | 16 | 7.3935 | 18.2064 | 112.9 | 111.2294 | | MobileBertForQuestionAnswering | 128 | 15.1632 | 30.7802 | 111.0429 | 111.8068 | | XGLMForCausalLM | 8 | 5.9312 | 14.3923 | 107.2883 | 107.2729 | | DebertaForQuestionAnswering | 8 | 5.8017 | 10.9267 | 82.5257 | 44.9883 | | XLNetLMHeadModel | 8 | 7.1862 | 21.3912 | 81.8546 | 80.7136 | | DebertaForMaskedLM | 4 | 5.8912 | 11.5427 | 75.1533 | 43.9535 | | MBartForConditionalGeneration | 2 | 7.5776 | 18.0258 | 62.0776 | 60.3802 | | BartForConditionalGeneration | 2 | 7.4367 | 18.0763 | 59.3816 | 58.8567 | | PegasusForConditionalGeneration | 32 | 4.3856 | 14.7653 | 58.6293 | 58.4029 | | YituTechConvBert | 16 | 4.8974 | 11.2509 | 54.7476 | 54.8362 | | MegatronBertForCausalLM | 4 | 6.9317 | 15.3855 | 54.7446 | 53.8561 | | MegatronBertForQuestionAnswering | 8 | 6.8451 | 15.2455 | 53.911 | 53.2993 | | BlenderbotSmallForConditionalGeneration | 64 | 5.0077 | 11.7142 | 45.1028 | 43.5075 | | T5ForConditionalGeneration | 4 | 4.0546 | 10.1857 | 44.8251 | 43.684 | | T5Small | 4 | 4.0669 | 10.201 | 44.6733 | 44.2553 | | ElectraForCausalLM | 32 | 3.49 | 7.6327 | 43.5182 | 42.8592 | | LayoutLMForSequenceClassification | 16 | 3.5895 | 7.8777 | 39.8379 | 40.2257 | | PLBartForConditionalGeneration | 4 | 3.7711 | 9.302 | 39.3145 | 39.2942 | | ElectraForQuestionAnswering | 64 | 3.4177 | 7.5813 | 38.1008 | 35.9447 | | BertForQuestionAnswering | 16 | 3.4449 | 7.5345 | 35.3922 | 32.0624 | | LayoutLMForMaskedLM | 16 | 3.666 | 7.9313 | 34.5733 | 33.0403 | | BertForMaskedLM | 16 | 3.4448 | 7.5144 | 33.4232 | 32.3938 | | AlbertForMaskedLM | 4 | 1.7264 | 5.8716 | 33.3856 | 32.6207 | | MBartForCausalLM | 4 | 3.1934 | 7.2127 | 33.2318 | 33.3187 | | DistilBertForQuestionAnswering | 256 | 1.6929 | 3.7184 | 32.9286 | 33.0869 | | PegasusForCausalLM | 32 | 3.0876 | 7.1433 | 32.3449 | 31.9244 | | BartForCausalLM | 4 | 3.1263 | 7.2331 | 31.3449 | 30.4841 | | OPTForCausalLM | 2 | 2.9414 | 7.0093 | 30.9589 | 30.6932 | | CamemBert | 16 | 3.4923 | 7.6519 | 30.7026 | 30.5741 | | GPT2ForSequenceClassification | 4 | 3.3 | 7.1728 | 30.6641 | 29.6677 | | DistilBertForMaskedLM | 128 | 1.6488 | 3.7467 | 30.5608 | 29.8045 | | RobertaForCausalLM | 16 | 3.4806 | 7.56 | 30.4575 | 30.9289 | | AlbertForQuestionAnswering | 4 | 1.7361 | 5.8229 | 29.9435 | 29.1805 | | RobertaForQuestionAnswering | 16 | 3.4405 | 7.4946 | 29.8235 | 29.7004 | | TrOCRForCausalLM | 32 | 3.1085 | 7.1161 | 29.7668 | 28.9086 | | GoogleFnet | 16 | 1.9636 | 3.9306 | 27.9837 | 26.8862 | | DistillGPT2 | 16 | 1.8164 | 3.8511 | 25.6592 | 25.3368 | | BlenderbotSmallForCausalLM | 64 | 2.1887 | 4.7941 | 23.9729 | 23.4394 | | PLBartForCausalLM | 8 | 1.7353 | 3.8075 | 22.4415 | 21.8727 | | Speech2Text2ForCausalLM | 256 | 1.7042 | 3.8062 | 20.8635 | 22.2097 | | BlenderbotForCausalLM | 4 | 6.2057 | 14.3338 | nan | 53.9075 | | AllenaiLongformerBase | 4 | 7.3138 | 27.7093 | nan | nan | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | GoogleFnet | 16 | 1.0 | 0.9205 | 1.1428 | 1.1437 | | XLNetLMHeadModel | 8 | 1.0 | 0.9738 | 1.0737 | 1.0737 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9522 | 1.0207 | 1.0758 | | GPT2ForSequenceClassification | 4 | 1.0 | 0.8955 | 1.0169 | 1.1459 | | OPTForCausalLM | 2 | 0.9999 | 0.9236 | 1.0142 | 1.094 | | RobertaForQuestionAnswering | 16 | 1.0 | 0.9325 | 0.9949 | 1.0711 | | BertForQuestionAnswering | 16 | 1.0 | 0.9325 | 0.9949 | 1.0711 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9327 | 0.994 | 1.0557 | | BertForMaskedLM | 16 | 1.0 | 0.9392 | 0.9843 | 0.9848 | | RobertaForCausalLM | 16 | 1.0 | 0.9389 | 0.9841 | 0.9847 | | CamemBert | 16 | 1.0 | 0.9372 | 0.9815 | 0.982 | | YituTechConvBert | 16 | 1.0 | 0.9351 | 0.9445 | 0.945 | | DistilBertForQuestionAnswering | 256 | 1.0 | 0.9594 | 0.9362 | 1.0349 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9393 | 0.9249 | 0.9848 | | T5Small | 4 | 1.0 | 0.9589 | 0.9202 | 1.0871 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9589 | 0.9202 | 1.0871 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.9167 | 0.915 | 1.063 | | ElectraForCausalLM | 32 | 1.0 | 0.8827 | 0.9094 | 0.9099 | | PLBartForConditionalGeneration | 4 | 0.9999 | 0.9321 | 0.9018 | 0.9919 | | DistillGPT2 | 16 | 1.0 | 0.8755 | 0.8492 | 0.9667 | | MegatronBertForCausalLM | 4 | 1.0 | 0.8909 | 0.8374 | 1.0256 | | MBartForCausalLM | 4 | 1.0 | 0.9069 | 0.8347 | 0.9142 | | BartForCausalLM | 4 | 1.0 | 0.9067 | 0.8345 | 0.9141 | | PLBartForCausalLM | 8 | 1.0 | 0.8876 | 0.8267 | 0.9244 | | MBartForConditionalGeneration | 2 | 1.0 | 0.882 | 0.8239 | 0.9794 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8954 | 0.8181 | 0.9055 | | PegasusForConditionalGeneration | 32 | 1.0 | 0.9169 | 0.8166 | 0.9645 | | BartForConditionalGeneration | 2 | 1.0 | 0.8824 | 0.8137 | 0.9794 | | DistilBertForMaskedLM | 128 | 1.0 | 0.883 | 0.7997 | 0.8955 | | PegasusForCausalLM | 32 | 1.0 | 0.8813 | 0.7939 | 0.8904 | | MT5ForConditionalGeneration | 16 | 1.0006 | 0.869 | 0.7921 | 0.9119 | | TrOCRForCausalLM | 32 | 1.0 | 0.8737 | 0.7827 | 0.8833 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9399 | 0.7775 | 1.2465 | | AlbertForMaskedLM | 4 | 1.0 | 0.9222 | 0.7684 | 1.2096 | | M2M100ForConditionalGeneration | 16 | 1.0 | 0.843 | 0.7498 | 0.9016 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8375 | 0.7233 | 0.8426 | | Speech2Text2ForCausalLM | 256 | 1.0 | 0.8419 | 0.7096 | 0.8199 | | XGLMForCausalLM | 8 | 1.0 | 0.818 | 0.7095 | 0.9136 | | MobileBertForMaskedLM | 64 | 1.0 | 0.8258 | 0.5643 | 0.736 | | DebertaV2ForMaskedLM | 1 | 0.9877 | 0.9876 | 0.5506 | 0.988 | | DebertaForMaskedLM | 4 | 0.9751 | 0.9598 | 0.5429 | 1.0107 | | MobileBertForQuestionAnswering | 128 | 1.0 | 0.9908 | 0.4677 | 0.5574 | | DebertaForQuestionAnswering | 8 | 0.9614 | 1.0317 | 0.4577 | 1.1413 | | DebertaV2ForQuestionAnswering | 2 | 0.9762 | 0.9724 | 0.4556 | 0.9759 | | BlenderbotForCausalLM | 4 | 1.0005 | 1.0017 | nan | 1.001 | | AllenaiLongformerBase | 4 | 0.9986 | 0.9301 | nan | nan | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | AlbertForMaskedLM | 4 | 389.0115 | 389.2913 | 310.855 | 312.6066 | | AlbertForQuestionAnswering | 4 | 386.6616 | 385.6037 | 308.6096 | 309.324 | | XLNetLMHeadModel | 8 | 378.8146 | 391.4038 | 233.8937 | 232.714 | | PegasusForConditionalGeneration | 32 | 173.7936 | 179.6325 | 171.3355 | 166.1784 | | TrOCRForCausalLM | 32 | 169.986 | 176.7528 | 164.9937 | 160.7474 | | MegatronBertForQuestionAnswering | 8 | 175.0354 | 177.6833 | 146.1871 | 143.9478 | | DebertaV2ForQuestionAnswering | 2 | 150.0347 | 166.2364 | 141.55 | 151.3088 | | MBartForConditionalGeneration | 2 | 149.9696 | 152.0281 | 136.274 | 135.587 | | BartForConditionalGeneration | 2 | 150.516 | 154.1089 | 135.9758 | 133.4038 | | YituTechConvBert | 16 | 156.8592 | 161.6919 | 129.2601 | 129.6307 | | DistilBertForQuestionAnswering | 256 | 145.52 | 146.5274 | 127.2631 | 127.1935 | | DebertaV2ForMaskedLM | 1 | 133.0088 | 167.0629 | 123.1884 | 141.8007 | | DistilBertForMaskedLM | 128 | 122.8925 | 129.7185 | 119.6985 | 116.5623 | | LayoutLMForMaskedLM | 16 | 138.3723 | 142.863 | 116.1603 | 114.3909 | | MobileBertForQuestionAnswering | 128 | 150.8937 | 158.9379 | 115.2885 | 122.3066 | | CamemBert | 16 | 137.3197 | 141.6983 | 114.6608 | 114.6942 | | BertForMaskedLM | 16 | 135.9518 | 140.3558 | 113.6014 | 113.5406 | | RobertaForCausalLM | 16 | 144.2961 | 148.6017 | 113.1714 | 113.6962 | | BlenderbotSmallForConditionalGeneration | 64 | 120.1667 | 129.791 | 112.0214 | 109.6656 | | M2M100ForConditionalGeneration | 16 | 119.428 | 122.0898 | 111.9917 | 109.8338 | | MBartForCausalLM | 4 | 123.8205 | 129.0186 | 109.3868 | 107.3908 | | BartForCausalLM | 4 | 125.0899 | 129.0296 | 108.9578 | 106.2221 | | PLBartForConditionalGeneration | 4 | 121.8425 | 126.4535 | 100.5217 | 97.4683 | | PLBartForCausalLM | 8 | 120.6186 | 126.9439 | 97.6729 | 93.818 | | MobileBertForMaskedLM | 64 | 131.454 | 153.5142 | 95.4707 | 111.4202 | | OPTForCausalLM | 2 | 173.7386 | 184.1215 | 95.4297 | 93.1427 | | MegatronBertForCausalLM | 4 | 104.076 | 105.1513 | 90.5093 | 88.4321 | | LayoutLMForSequenceClassification | 16 | 114.87 | 116.26 | 89.1693 | 88.8637 | | ElectraForQuestionAnswering | 64 | 126.3775 | 127.5094 | 88.6793 | 88.5631 | | BertForQuestionAnswering | 16 | 112.4422 | 113.6181 | 88.5957 | 88.4051 | | RobertaForQuestionAnswering | 16 | 112.6914 | 114.0163 | 88.1282 | 88.5579 | | DistillGPT2 | 16 | 121.1643 | 128.8744 | 87.9638 | 84.6725 | | PegasusForCausalLM | 32 | 85.9548 | 90.78 | 86.953 | 83.362 | | T5ForConditionalGeneration | 4 | 105.7331 | 128.1551 | 84.5827 | 79.4515 | | T5Small | 4 | 105.898 | 128.076 | 84.5623 | 79.4429 | | DebertaForQuestionAnswering | 8 | 97.7999 | 102.5552 | 79.4854 | 79.0613 | | ElectraForCausalLM | 32 | 107.3826 | 114.4564 | 73.9921 | 74.2814 | | XGLMForCausalLM | 8 | 81.788 | 85.2463 | 71.0978 | 73.876 | | DebertaForMaskedLM | 4 | 84.2589 | 106.5771 | 70.5148 | 73.9413 | | GoogleFnet | 16 | 103.2382 | 106.9232 | 65.9896 | 66.0227 | | BlenderbotSmallForCausalLM | 64 | 64.6973 | 72.4413 | 65.0582 | 62.1556 | | GPT2ForSequenceClassification | 4 | 103.6259 | 106.3697 | 62.8925 | 61.8215 | | MT5ForConditionalGeneration | 16 | 88.3505 | 100.0412 | 59.4994 | 59.9986 | | Speech2Text2ForCausalLM | 256 | 63.8424 | 69.6142 | 56.6166 | 54.0307 | | BlenderbotForCausalLM | 4 | 102.5556 | 110.4037 | nan | 94.2818 | | AllenaiLongformerBase | 4 | 249.5502 | 417.8179 | nan | nan | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | ghostnet_100 | 128 | 0.9933 | 0.7712 | 1.7307 | 1.7072 | | tnt_s_patch16_224 | 128 | 0.9988 | 0.9982 | 1.71 | 1.704 | | lcnet_050 | 128 | 0.9516 | 0.7499 | 1.5224 | 1.5354 | | coat_lite_mini | 128 | 0.9983 | 0.9983 | 1.4452 | 1.3993 | | convit_base | 64 | 0.9981 | 0.9969 | 1.4426 | 1.443 | | dm_nfnet_f0 | 128 | 0.9912 | 0.9893 | 1.4254 | 1.3935 | | nfnet_l0 | 128 | 0.9933 | 0.7786 | 1.4172 | 1.3946 | | gmlp_s16_224 | 128 | 0.9949 | 1.0696 | 1.3669 | 1.3665 | | xcit_large_24_p8_224 | 5 | 0.9944 | 0.9545 | 1.3646 | 1.2864 | | gmixer_24_224 | 128 | 0.9961 | 0.8277 | 1.3569 | 1.3554 | | volo_d1_224 | 64 | 0.9954 | 0.9714 | 1.3563 | 1.3468 | | hrnet_w18 | 128 | 0.9947 | 0.7018 | 1.3469 | 1.3173 | | dla102 | 128 | 0.997 | 0.8733 | 1.3469 | 1.3466 | | crossvit_9_240 | 128 | 0.9912 | 0.8044 | 1.3385 | 1.3293 | | sebotnet33ts_256 | 64 | 0.9707 | 0.7635 | 1.3311 | 1.3487 | | adv_inception_v3 | 128 | 0.9973 | 0.8947 | 1.3083 | 1.3036 | | inception_v3 | 128 | 0.9974 | 0.8949 | 1.3075 | 1.3043 | | gluon_inception_v3 | 128 | 0.9974 | 0.8923 | 1.3073 | 1.3043 | | res2net50_14w_8s | 128 | 0.9992 | 0.8676 | 1.3059 | 1.298 | | mobilenetv2_100 | 128 | 0.9654 | 0.8228 | 1.2848 | 1.3199 | | mobilenetv3_large_100 | 128 | 0.9637 | 0.8127 | 1.2834 | 1.3128 | | twins_pcpvt_base | 64 | 0.9919 | 0.9876 | 1.2576 | 1.2314 | | tf_efficientnet_b0 | 128 | 0.9745 | 0.7141 | 1.2569 | 1.2752 | | botnet26t_256 | 128 | 0.9814 | 0.8942 | 1.2548 | 1.2648 | | eca_botnext26ts_256 | 128 | 0.9856 | 0.7508 | 1.244 | 1.2311 | | fbnetv3_b | 128 | 0.9629 | 0.8267 | 1.2345 | 1.2668 | | resnest101e | 64 | 0.9967 | 0.922 | 1.2341 | 1.2042 | | ese_vovnet19b_dw | 128 | 0.9736 | 0.8891 | 1.2197 | 1.2312 | | mnasnet_100 | 128 | 0.9651 | 0.8086 | 1.2169 | 1.2509 | | rexnet_100 | 128 | 0.9684 | 0.7308 | 1.2136 | 1.232 | | selecsls42b | 128 | 0.999 | 0.8635 | 1.2059 | 1.1993 | | fbnetc_100 | 128 | 0.966 | 0.821 | 1.205 | 1.2343 | | mobilevit_s | 64 | 0.9732 | 0.7151 | 1.2046 | 1.1926 | | regnety_002 | 128 | 0.9266 | 0.775 | 1.2044 | 1.2023 | | res2next50 | 128 | 0.9992 | 0.9119 | 1.1955 | 1.1763 | | jx_nest_base | 32 | 0.9888 | 0.9828 | 1.1901 | 1.1858 | | cait_m36_384 | 4 | 0.9957 | 0.9958 | 1.189 | 1.1915 | | pit_b_224 | 64 | 0.9963 | 0.9941 | 1.1869 | 1.1808 | | spnasnet_100 | 128 | 0.96 | 0.8037 | 1.1864 | 1.2203 | | tinynet_a | 128 | 0.9638 | 0.7008 | 1.1818 | 1.2082 | | cspdarknet53 | 64 | 0.9501 | 0.8277 | 1.1728 | 1.1978 | | poolformer_m36 | 64 | 0.9891 | 0.9857 | 1.1708 | 1.1666 | | swin_base_patch4_window7_224 | 64 | 0.9943 | 0.972 | 1.1702 | 1.1689 | | tf_mixnet_l | 128 | 0.9824 | 0.8382 | 1.1605 | 1.1664 | | dpn107 | 32 | 0.9589 | 0.9096 | 1.1563 | 1.1879 | | mixnet_l | 128 | 0.9822 | 0.8338 | 1.1492 | 1.1552 | | pnasnet5large | 16 | 0.9906 | 0.9535 | 1.1415 | 1.158 | | repvgg_a2 | 128 | 0.961 | 0.873 | 1.1363 | 1.1442 | | res2net101_26w_4s | 64 | 0.9989 | 0.8949 | 1.1226 | 1.1357 | | mixer_b16_224 | 128 | 0.9977 | 0.9997 | 1.1062 | 1.1092 | | beit_base_patch16_224 | 64 | 0.9982 | 0.9793 | 1.0974 | 1.1031 | | convnext_base | 64 | 0.9924 | 0.9906 | 1.0909 | 1.0866 | | deit_base_distilled_patch16_224 | 64 | 0.9977 | 0.9964 | 1.0856 | 1.0854 | | vit_base_patch16_224 | 64 | 0.9986 | 0.9966 | 1.0812 | 1.0823 | | swsl_resnext101_32x16d | 32 | 0.9986 | 0.9244 | 1.0667 | 1.0377 | | convmixer_768_32 | 32 | 0.9989 | 0.9904 | 1.0592 | 1.061 | | gernet_l | 128 | 0.9644 | 0.8881 | 1.0517 | 1.061 | | visformer_small | 128 | 0.9974 | 0.9528 | 1.0458 | 1.0157 | | gluon_xception65 | 32 | 0.9959 | 0.9063 | 1.0272 | 1.0345 | | resmlp_12_224 | 128 | 0.9935 | 0.8928 | 0.9303 | 0.9299 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-----------+----------+------------------------+ | adv_inception_v3 | 8 | pass | pass | pass | pass | | beit_base_patch16_224 | 8 | pass | pass | pass | pass | | mobilenetv3_large_100 | 8 | pass | pass | pass | pass | | mobilevit_s | 8 | pass | pass | pass | pass | | nfnet_l0 | 8 | pass | pass | pass | pass | | pit_b_224 | 8 | pass | pass | pass | pass | | pnasnet5large | 8 | pass | pass | pass | pass | | poolformer_m36 | 8 | pass | pass | pass | pass | | regnety_002 | 8 | pass | pass | pass | pass | | repvgg_a2 | 8 | pass | pass | pass | pass | | res2net101_26w_4s | 8 | pass | pass | pass | pass | | res2net50_14w_8s | 8 | pass | pass | pass | pass | | res2next50 | 8 | pass | pass | pass | pass | | resmlp_12_224 | 8 | pass | pass | pass | pass | | resnest101e | 8 | pass | pass | pass | pass | | rexnet_100 | 8 | pass | pass | pass | pass | | sebotnet33ts_256 | 8 | pass | pass | pass | pass | | selecsls42b | 8 | pass | pass | pass | pass | | spnasnet_100 | 8 | pass | pass | pass | pass | | swin_base_patch4_window7_224 | 8 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 8 | pass | pass | pass | pass | | tf_efficientnet_b0 | 8 | pass | pass | pass | pass | | tf_mixnet_l | 8 | pass | pass | pass | pass | | tinynet_a | 8 | pass | pass | pass | pass | | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | twins_pcpvt_base | 8 | pass | pass | pass | pass | | visformer_small | 8 | pass | pass | pass | pass | | vit_base_patch16_224 | 8 | pass | pass | pass | pass | | volo_d1_224 | 8 | pass | pass | pass | pass | | mobilenetv2_100 | 8 | pass | pass | pass | pass | | mnasnet_100 | 8 | pass | pass | pass | pass | | mixnet_l | 8 | pass | pass | pass | pass | | eca_botnext26ts_256 | 8 | pass | pass | pass | pass | | botnet26t_256 | 8 | pass | pass | pass | pass | | cait_m36_384 | 4 | pass | pass | pass | pass | | coat_lite_mini | 8 | pass | pass | pass | pass | | convit_base | 8 | pass | pass | pass | pass | | convmixer_768_32 | 8 | pass | pass | pass | pass | | convnext_base | 8 | pass | pass | pass | pass | | crossvit_9_240 | 8 | pass | pass | pass | pass | | cspdarknet53 | 8 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 8 | pass | pass | pass | pass | | dla102 | 8 | pass | pass | pass | pass | | dm_nfnet_f0 | 8 | pass | pass | pass | pass | | dpn107 | 8 | pass | pass | pass | pass | | ese_vovnet19b_dw | 8 | pass | pass | pass | pass | | mixer_b16_224 | 8 | pass | pass | pass | pass | | fbnetc_100 | 8 | pass | pass | pass | pass | | fbnetv3_b | 8 | pass | pass | pass | pass | | gernet_l | 8 | pass | pass | pass | pass | | ghostnet_100 | 8 | pass | pass | pass | pass | | gluon_inception_v3 | 8 | pass | pass | pass | pass | | gluon_xception65 | 8 | pass | pass | pass | pass | | gmixer_24_224 | 8 | pass | pass | pass | pass | | gmlp_s16_224 | 8 | pass | pass | pass | pass | | hrnet_w18 | 8 | pass | pass | pass | pass | | inception_v3 | 8 | pass | pass | pass | pass | | jx_nest_base | 8 | pass | pass | pass | pass | | lcnet_050 | 8 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 8 | pass | pass | pass | pass | +---------------------------------+----+-------+-----------+----------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | rexnet_100 | 128 | 4.0105 | 8.6583 | 276.7602 | 273.9646 | | hrnet_w18 | 128 | 8.6475 | 31.4324 | 233.938 | 225.2791 | | ghostnet_100 | 128 | 5.5447 | 11.9806 | 231.1308 | 229.9564 | | mobilevit_s | 64 | 3.6748 | 8.5316 | 183.8375 | 178.0214 | | fbnetv3_b | 128 | 6.0199 | 12.8531 | 160.7021 | 157.2946 | | gluon_inception_v3 | 128 | 3.8215 | 9.6792 | 153.5986 | 146.5848 | | inception_v3 | 128 | 3.8261 | 9.8092 | 152.5033 | 148.1424 | | mobilenetv3_large_100 | 128 | 2.979 | 6.4432 | 152.1616 | 143.8063 | | adv_inception_v3 | 128 | 3.8456 | 9.9342 | 149.3431 | 152.1385 | | tinynet_a | 128 | 4.0906 | 9.1665 | 148.3809 | 146.7822 | | tf_efficientnet_b0 | 128 | 3.5411 | 8.0063 | 146.624 | 143.9997 | | mixnet_l | 128 | 6.2324 | 12.6212 | 146.5976 | 138.3035 | | pnasnet5large | 16 | 7.5545 | 22.6856 | 146.1797 | 147.4728 | | resnest101e | 64 | 7.464 | 18.3107 | 145.8378 | 147.637 | | tf_mixnet_l | 128 | 6.8421 | 13.3153 | 141.5892 | 147.7897 | | res2net101_26w_4s | 64 | 6.8763 | 19.2526 | 133.0005 | 139.8832 | | fbnetc_100 | 128 | 3.5554 | 7.5188 | 132.591 | 130.5764 | | spnasnet_100 | 128 | 3.5338 | 7.4684 | 129.5098 | 131.7982 | | mobilenetv2_100 | 128 | 2.8945 | 6.1392 | 126.203 | 123.3439 | | twins_pcpvt_base | 64 | 6.2758 | 15.7811 | 118.1931 | 116.9325 | | mnasnet_100 | 128 | 2.9329 | 5.887 | 113.2946 | 112.9517 | | sebotnet33ts_256 | 64 | 3.1461 | 7.1507 | 113.2927 | 110.2737 | | res2net50_14w_8s | 128 | 5.8385 | 17.6618 | 111.7137 | 112.39 | | xcit_large_24_p8_224 | 5 | 7.8301 | 20.0694 | 109.2289 | 111.1661 | | regnety_002 | 128 | 3.3296 | 6.642 | 101.3164 | 102.7874 | | swin_base_patch4_window7_224 | 64 | 6.1619 | 14.7808 | 100.7511 | 102.5781 | | eca_botnext26ts_256 | 128 | 2.489 | 5.8238 | 98.2307 | 96.8318 | | lcnet_050 | 128 | 1.6455 | 3.8924 | 96.5549 | 95.9596 | | cait_m36_384 | 4 | 8.8692 | 21.8164 | 95.3837 | 93.2648 | | cspdarknet53 | 64 | 4.2361 | 8.6512 | 93.2061 | 91.5792 | | dpn107 | 32 | 7.2081 | 16.1371 | 91.1458 | 88.0163 | | dla102 | 128 | 4.0386 | 10.9094 | 89.4054 | 89.3383 | | selecsls42b | 128 | 1.4854 | 4.1342 | 85.9912 | 85.7256 | | botnet26t_256 | 128 | 2.3637 | 4.8399 | 85.0983 | 85.4387 | | poolformer_m36 | 64 | 5.1194 | 10.2649 | 85.0233 | 83.9623 | | gluon_xception65 | 32 | 4.9185 | 12.9028 | 83.8195 | 83.2555 | | res2next50 | 128 | 3.2575 | 9.3991 | 81.1746 | 79.9079 | | gernet_l | 128 | 3.7093 | 7.2589 | 78.4188 | 77.4023 | | crossvit_9_240 | 128 | 3.7687 | 9.7221 | 77.9705 | 76.6655 | | coat_lite_mini | 128 | 2.2574 | 5.7788 | 76.9479 | 76.6717 | | nfnet_l0 | 128 | 3.7411 | 8.4894 | 72.7622 | 69.9893 | | ese_vovnet19b_dw | 128 | 1.7335 | 3.6326 | 71.2505 | 75.9827 | | jx_nest_base | 32 | 4.49 | 11.0734 | 70.3396 | 70.0794 | | dm_nfnet_f0 | 128 | 4.4244 | 9.2973 | 66.0111 | 64.7937 | | volo_d1_224 | 64 | 3.3102 | 8.892 | 63.4175 | 62.6754 | | tnt_s_patch16_224 | 128 | 4.1952 | 11.5155 | 57.9447 | 56.8855 | | repvgg_a2 | 128 | 3.5682 | 7.1054 | 56.6497 | 56.0347 | | visformer_small | 128 | 1.7805 | 4.6255 | 56.6172 | 57.9289 | | swsl_resnext101_32x16d | 32 | 4.0933 | 10.7145 | 54.425 | 54.5669 | | gmlp_s16_224 | 128 | 3.4351 | 7.9731 | 50.2895 | 50.0782 | | convnext_base | 64 | 4.2918 | 8.671 | 44.082 | 43.2773 | | convit_base | 64 | 2.2987 | 6.5112 | 41.0221 | 39.5729 | | gmixer_24_224 | 128 | 3.3967 | 8.7836 | 40.9155 | 42.8216 | | pit_b_224 | 64 | 2.4085 | 5.6843 | 40.0445 | 39.8593 | | resmlp_12_224 | 128 | 1.7517 | 3.5746 | 35.2694 | 36.3512 | | convmixer_768_32 | 32 | 1.4819 | 5.8924 | 33.1188 | 30.0091 | | deit_base_distilled_patch16_224 | 64 | 2.0282 | 4.9177 | 30.6342 | 29.0745 | | beit_base_patch16_224 | 64 | 2.6669 | 6.3452 | 29.3731 | 28.95 | | vit_base_patch16_224 | 64 | 2.0335 | 4.8715 | 27.7207 | 28.0849 | | mixer_b16_224 | 128 | 1.5802 | 3.897 | 26.6254 | 26.6312 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | pnasnet5large | 16 | 1.0694 | 1.0163 | 1.2139 | 1.3259 | | mobilenetv2_100 | 128 | 1.0001 | 0.7668 | 1.1613 | 1.2656 | | tinynet_a | 128 | 1.0001 | 0.7829 | 1.0278 | 1.0949 | | convit_base | 64 | 1.0 | 0.8836 | 1.0277 | 1.1097 | | rexnet_100 | 128 | 1.0 | 0.7887 | 1.0166 | 1.0838 | | fbnetv3_b | 128 | 1.0 | 0.8078 | 0.9954 | 1.003 | | tf_efficientnet_b0 | 128 | 1.0 | 0.7729 | 0.9912 | 1.0929 | | convmixer_768_32 | 32 | 1.0 | 0.9865 | 0.9839 | 0.9959 | | selecsls42b | 128 | 1.0003 | 0.9762 | 0.9761 | 1.0215 | | resmlp_12_224 | 128 | 1.0 | 0.9524 | 0.9691 | 0.9868 | | dla102 | 128 | 0.9831 | 0.9212 | 0.9641 | 1.048 | | tf_mixnet_l | 128 | 0.9999 | 0.86 | 0.9609 | 1.1153 | | mixer_b16_224 | 128 | 1.0 | 0.9704 | 0.9604 | 1.0065 | | gmixer_24_224 | 128 | 1.0 | 0.9767 | 0.9576 | 0.9917 | | ghostnet_100 | 128 | 1.0005 | 0.9033 | 0.9565 | 1.0646 | | resnest101e | 64 | 1.0 | 0.9586 | 0.9562 | 1.0502 | | cspdarknet53 | 64 | 1.0 | 0.8713 | 0.9538 | 1.1043 | | xcit_large_24_p8_224 | 5 | 0.9995 | 0.9182 | 0.9329 | 1.0105 | | hrnet_w18 | 128 | 0.9997 | 0.9301 | 0.9322 | 1.0108 | | dpn107 | 32 | 1.0001 | 0.9526 | 0.9283 | 1.0017 | | mobilevit_s | 64 | 1.0 | 0.7758 | 0.9255 | 0.9901 | | beit_base_patch16_224 | 64 | 1.0 | 0.9562 | 0.9248 | 0.9992 | | tnt_s_patch16_224 | 128 | 1.0001 | 0.9808 | 0.9221 | 1.0036 | | spnasnet_100 | 128 | 0.9996 | 0.9208 | 0.9175 | 0.976 | | inception_v3 | 128 | 1.0002 | 0.8727 | 0.917 | 1.0689 | | gluon_inception_v3 | 128 | 1.0002 | 0.8727 | 0.917 | 1.0689 | | adv_inception_v3 | 128 | 1.0002 | 0.8727 | 0.917 | 1.0689 | | res2net101_26w_4s | 64 | 1.0 | 0.9279 | 0.9164 | 1.002 | | eca_botnext26ts_256 | 128 | 1.0 | 0.7715 | 0.916 | 1.0173 | | mobilenetv3_large_100 | 128 | 0.9996 | 0.8846 | 0.9144 | 0.9851 | | vit_base_patch16_224 | 64 | 1.0 | 0.9453 | 0.9119 | 0.9949 | | mnasnet_100 | 128 | 0.9998 | 0.9135 | 0.9103 | 0.9738 | | nfnet_l0 | 128 | 0.9999 | 0.8322 | 0.9098 | 0.998 | | deit_base_distilled_patch16_224 | 64 | 1.0005 | 0.9469 | 0.9075 | 0.9905 | | swsl_resnext101_32x16d | 32 | 1.0001 | 0.9085 | 0.9075 | 1.0 | | res2net50_14w_8s | 128 | 1.0001 | 0.9171 | 0.9054 | 1.0181 | | res2next50 | 128 | 1.0002 | 0.9196 | 0.9014 | 1.0134 | | visformer_small | 128 | 1.0004 | 0.9421 | 0.9007 | 0.9926 | | cait_m36_384 | 4 | 1.0001 | 0.935 | 0.9005 | 0.988 | | volo_d1_224 | 64 | 0.9999 | 0.9243 | 0.8975 | 0.9634 | | ese_vovnet19b_dw | 128 | 0.9999 | 0.8975 | 0.8974 | 1.0127 | | gluon_xception65 | 32 | 1.0 | 0.8967 | 0.8947 | 0.9923 | | fbnetc_100 | 128 | 0.9999 | 0.8607 | 0.8935 | 0.9847 | | mixnet_l | 128 | 1.0 | 0.8479 | 0.8918 | 1.0338 | | lcnet_050 | 128 | 1.0004 | 0.786 | 0.881 | 0.9552 | | dm_nfnet_f0 | 128 | 0.9113 | 0.8857 | 0.8735 | 1.1277 | | gmlp_s16_224 | 128 | 1.0 | 0.9822 | 0.8656 | 0.8743 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9295 | 0.8653 | 0.988 | | botnet26t_256 | 128 | 1.0 | 0.8666 | 0.8623 | 0.9861 | | gernet_l | 128 | 1.0 | 0.8663 | 0.8613 | 0.998 | | twins_pcpvt_base | 64 | 1.0005 | 0.921 | 0.86 | 0.9398 | | jx_nest_base | 32 | 1.002 | 0.8971 | 0.8479 | 0.9832 | | sebotnet33ts_256 | 64 | 1.0 | 0.7135 | 0.8393 | 1.0449 | | crossvit_9_240 | 128 | 1.0001 | 0.8744 | 0.82 | 0.9829 | | poolformer_m36 | 64 | 0.9998 | 0.9517 | 0.8195 | 1.1099 | | regnety_002 | 128 | 1.0001 | 0.8225 | 0.8013 | 0.9579 | | pit_b_224 | 64 | 1.0001 | 0.7934 | 0.7981 | 0.9905 | | repvgg_a2 | 128 | 1.0005 | 0.827 | 0.7788 | 1.005 | | convnext_base | 64 | 1.0001 | 0.9147 | 0.7585 | 0.9504 | | coat_lite_mini | 128 | 1.0111 | 0.8823 | 0.7543 | 0.9347 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------+------------------------+ | convmixer_768_32 | 32 | 355.5362 | 358.4894 | 335.1907 | 334.7495 | | hrnet_w18 | 128 | 376.6619 | 531.5092 | 277.106 | 283.2083 | | tnt_s_patch16_224 | 128 | 468.9229 | 469.0802 | 273.724 | 274.8511 | | pnasnet5large | 16 | 292.2616 | 303.402 | 253.8925 | 249.7514 | | convnext_base | 64 | 272.9205 | 273.5949 | 248.4064 | 249.464 | | tf_mixnet_l | 128 | 258.3502 | 302.7072 | 218.2655 | 217.1794 | | mixnet_l | 128 | 248.7933 | 292.7808 | 212.1573 | 211.2961 | | res2next50 | 128 | 251.5109 | 275.733 | 210.3495 | 214.2467 | | swsl_resnext101_32x16d | 32 | 217.2013 | 234.3006 | 203.2743 | 208.8541 | | resnest101e | 64 | 250.3652 | 269.8928 | 201.5444 | 207.2365 | | swin_base_patch4_window7_224 | 64 | 235.842 | 241.6752 | 200.5019 | 200.8853 | | dla102 | 128 | 250.2675 | 285.5803 | 185.2588 | 185.3246 | | cait_m36_384 | 4 | 216.5972 | 216.3744 | 181.4181 | 181.1613 | | gluon_xception65 | 32 | 180.2627 | 198.163 | 174.8302 | 173.9422 | | adv_inception_v3 | 128 | 223.1727 | 248.7867 | 170.856 | 170.8174 | | gluon_inception_v3 | 128 | 223.358 | 249.417 | 170.5258 | 170.6001 | | inception_v3 | 128 | 223.2336 | 248.7032 | 170.3955 | 170.6673 | | res2net50_14w_8s | 128 | 212.8003 | 244.9183 | 162.7145 | 163.8175 | | dpn107 | 32 | 192.5966 | 203.2194 | 159.6021 | 155.7298 | | eca_botnext26ts_256 | 128 | 195.4949 | 255.6489 | 154.2469 | 155.9393 | | mixer_b16_224 | 128 | 161.6875 | 161.3904 | 147.5941 | 147.4979 | | poolformer_m36 | 64 | 172.914 | 173.454 | 146.0974 | 146.7028 | | dm_nfnet_f0 | 128 | 196.5281 | 196.8797 | 136.8117 | 139.7913 | | convit_base | 64 | 196.4037 | 196.9776 | 136.3484 | 136.2296 | | res2net101_26w_4s | 64 | 151.0843 | 168.0723 | 134.6003 | 132.7344 | | coat_lite_mini | 128 | 192.8658 | 192.9047 | 133.2406 | 137.5414 | | pit_b_224 | 64 | 158.1907 | 158.4419 | 132.7388 | 133.4912 | | gernet_l | 128 | 141.4057 | 153.6261 | 129.9314 | 128.6654 | | fbnetv3_b | 128 | 160.2543 | 186.4211 | 124.7924 | 121.7648 | | visformer_small | 128 | 127.6969 | 133.7669 | 121.6278 | 125.3865 | | beit_base_patch16_224 | 64 | 129.1954 | 132.0865 | 119.4448 | 117.2841 | | nfnet_l0 | 128 | 169.7502 | 216.1149 | 118.6176 | 120.6885 | | gmlp_s16_224 | 128 | 162.7403 | 151.7253 | 118.6085 | 118.6111 | | botnet26t_256 | 128 | 149.5536 | 164.1675 | 116.972 | 116.0351 | | volo_d1_224 | 64 | 152.7537 | 156.5245 | 112.349 | 112.8275 | | deit_base_distilled_patch16_224 | 64 | 120.9586 | 121.3401 | 111.6025 | 111.4072 | | vit_base_patch16_224 | 64 | 120.6468 | 120.9155 | 111.1462 | 111.4671 | | repvgg_a2 | 128 | 125.9267 | 138.7671 | 107.1892 | 106.1287 | | gmixer_24_224 | 128 | 145.6243 | 175.0378 | 106.8543 | 106.8778 | | twins_pcpvt_base | 64 | 135.0741 | 135.6805 | 106.5016 | 109.0349 | | xcit_large_24_p8_224 | 5 | 134.8545 | 141.691 | 105.0561 | 105.6414 | | cspdarknet53 | 64 | 129.2247 | 148.3894 | 104.7582 | 102.4885 | | tf_efficientnet_b0 | 128 | 131.7866 | 180.0391 | 102.1566 | 100.7133 | | jx_nest_base | 32 | 120.9773 | 121.961 | 100.7285 | 100.9481 | | mobilevit_s | 64 | 120.5948 | 164.0299 | 97.1536 | 98.3099 | | fbnetc_100 | 128 | 120.3775 | 141.7194 | 96.512 | 94.2228 | | rexnet_100 | 128 | 117.2615 | 155.5388 | 93.4263 | 92.2906 | | tinynet_a | 128 | 107.8594 | 148.1958 | 87.7753 | 85.9203 | | sebotnet33ts_256 | 64 | 115.1202 | 146.1043 | 83.7935 | 82.7258 | | spnasnet_100 | 128 | 102.476 | 122.6691 | 82.978 | 80.7174 | | ese_vovnet19b_dw | 128 | 98.8806 | 108.2632 | 78.9117 | 78.2674 | | selecsls42b | 128 | 91.7843 | 106.1433 | 76.0327 | 76.4712 | | mnasnet_100 | 128 | 95.4617 | 113.7311 | 75.6008 | 73.7201 | | mobilenetv2_100 | 128 | 95.4738 | 112.1577 | 71.6766 | 69.9374 | | crossvit_9_240 | 128 | 94.805 | 117.0483 | 70.3365 | 70.7082 | | resmlp_12_224 | 128 | 63.9366 | 71.3055 | 68.4751 | 68.3251 | | ghostnet_100 | 128 | 110.1656 | 142.0036 | 63.1794 | 64.1466 | | mobilenetv3_large_100 | 128 | 83.1833 | 98.7643 | 62.4827 | 61.1277 | | regnety_002 | 128 | 48.5115 | 58.118 | 37.483 | 38.8874 | | lcnet_050 | 128 | 36.8528 | 46.8385 | 23.0214 | 22.8623 | +---------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_100_10_04_23_performance_float32_549/huggingface_float32.png : ![](https://i.imgur.com/CBQOV9k.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_100_10_04_23_performance_float32_549/torchbench_float32.png : ![](https://i.imgur.com/kAg6wQw.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_100_10_04_23_performance_float32_549/timm_models_float32.png : ![](https://i.imgur.com/TNY6xRx.png)

Build Summary

### Run name ### day_100_10_04_23_performance_float32_549 ### Commit hashes ### pytorch commit: f55e72c0f6bd6da016aaa51de379e6ba6d7891cc pytorch commit date: 2023-04-07 17:30:27+00:00 torchbench commit: 735f1927996c8d9ab81f0b0c05dd1ebdb26a6250 torchbench commit date: 2023-04-05 09:43:21-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+gitf55e72c ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (inductor max-autotune comparison on timm models, small)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |  100%, 2/2  |
|       inductor_no_cudagraphs        |  100%, 2/2  |
|        inductor_max_autotune        |  100%, 2/2  |
| inductor_max_autotune_no_cudagraphs |  100%, 2/2  |
+-------------------------------------+-------------+

Geometric mean speedup

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |    2.54x    |
|       inductor_no_cudagraphs        |    2.20x    |
|        inductor_max_autotune        |    2.72x    |
| inductor_max_autotune_no_cudagraphs |    2.33x    |
+-------------------------------------+-------------+

Mean compilation time (seconds)

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |   106.20    |
|       inductor_no_cudagraphs        |    69.36    |
|        inductor_max_autotune        |   748.14    |
| inductor_max_autotune_no_cudagraphs |    81.81    |
+-------------------------------------+-------------+

Peak memory footprint compression ratio (higher is better)

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |    0.90x    |
|       inductor_no_cudagraphs        |    1.03x    |
|        inductor_max_autotune        |    0.91x    |
| inductor_max_autotune_no_cudagraphs |    1.04x    |
+-------------------------------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Compilation latency (sec) warnings ~~~ +-------------+----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------+----------+------------------------+ | timm_models | xcit_large_24_p8_224 | 138.9123 | 88.2145 | +-------------+----------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------+----------+------------------------+ | timm_models | xcit_large_24_p8_224 | 0.8225 | 1.0063 | +-------------+----------------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 128 | 3.0267 | 2.9763 | 3.3506 | 3.2962 | | xcit_large_24_p8_224 | 5 | 2.127 | 1.6285 | 2.2127 | 1.6412 | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Accuracy ~~~ +----------------------+----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 8 | pass | pass | pass | pass | +----------------------+----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | xcit_large_24_p8_224 | 5 | 138.9123 | 88.2145 | 911.3296 | 100.6539 | | tnt_s_patch16_224 | 128 | 73.4942 | 50.505 | 584.9524 | 62.9705 | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 128 | 0.9834 | 1.0597 | 0.986 | 1.0597 | | xcit_large_24_p8_224 | 5 | 0.8225 | 1.0063 | 0.826 | 1.0104 | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 128 | 106.616 | 108.5939 | 96.367 | 98.0578 | | xcit_large_24_p8_224 | 5 | 60.855 | 79.6217 | 58.0096 | 79.91 | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_101_11_04_23_performance_amp_691/timm_models_amp.png : ![](https://i.imgur.com/ppcLqkA.png)

Build Summary

### Run name ### day_101_11_04_23_performance_amp_691 ### Commit hashes ### pytorch commit: f55e72c0f6bd6da016aaa51de379e6ba6d7891cc pytorch commit date: 2023-04-07 17:30:27+00:00 torchbench commit: 735f1927996c8d9ab81f0b0c05dd1ebdb26a6250 torchbench commit date: 2023-04-05 09:43:21-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+gitf55e72c ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (inductor max-autotune comparison on timm models, small, ran locally)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |  100%, 2/2  |
|       inductor_no_cudagraphs        |  100%, 2/2  |
|        inductor_max_autotune        |  100%, 2/2  |
| inductor_max_autotune_no_cudagraphs |  100%, 2/2  |
+-------------------------------------+-------------+

Geometric mean speedup

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |    2.31x    |
|       inductor_no_cudagraphs        |    2.01x    |
|        inductor_max_autotune        |    3.04x    |
| inductor_max_autotune_no_cudagraphs |    2.39x    |
+-------------------------------------+-------------+

Mean compilation time (seconds)

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |   108.16    |
|       inductor_no_cudagraphs        |    68.81    |
|        inductor_max_autotune        |   890.39    |
| inductor_max_autotune_no_cudagraphs |    83.96    |
+-------------------------------------+-------------+

Peak memory footprint compression ratio (higher is better)

+-------------------------------------+-------------+
|              Compiler               | timm_models |
+-------------------------------------+-------------+
|              inductor               |    0.90x    |
|       inductor_no_cudagraphs        |    1.03x    |
|        inductor_max_autotune        |    0.91x    |
| inductor_max_autotune_no_cudagraphs |    1.04x    |
+-------------------------------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Compilation latency (sec) warnings ~~~ +-------------+----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------+----------+------------------------+ | timm_models | xcit_large_24_p8_224 | 143.0577 | 89.0471 | +-------------+----------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------+----------+------------------------+ | timm_models | xcit_large_24_p8_224 | 0.8223 | 1.0062 | +-------------+----------------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 128 | 2.4789 | 2.4537 | 3.5366 | 3.5032 | | xcit_large_24_p8_224 | 5 | 2.1495 | 1.6444 | 2.6118 | 1.6356 | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Accuracy ~~~ +----------------------+----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 8 | pass | pass | pass | pass | +----------------------+----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | xcit_large_24_p8_224 | 5 | 143.0577 | 89.0471 | 1044.1055 | 103.8899 | | tnt_s_patch16_224 | 128 | 73.2719 | 48.5673 | 736.6731 | 64.0348 | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 128 | 0.9834 | 1.0597 | 0.986 | 1.0597 | | xcit_large_24_p8_224 | 5 | 0.8223 | 1.0062 | 0.8257 | 1.0118 | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 128 | 146.9206 | 148.3703 | 103.1044 | 104.003 | | xcit_large_24_p8_224 | 5 | 61.2006 | 79.4134 | 58.8175 | 78.9549 | +----------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_103_13_04_23_performance_amp_325/timm_models_amp.png : ![](https://i.imgur.com/09FVtIQ.png)

Build Summary

### Run name ### day_103_13_04_23_performance_amp_325 ### Commit hashes ### pytorch commit: 75f55ca63bd5623352c8eda8e31ff76ee5c960a7 pytorch commit date: 2023-04-13 00:45:48+00:00 torchbench commit: cd89d490ecbcca7d8ca50324522b31a1a198c753 torchbench commit date: 2023-04-13 11:05:33-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+git75f55ca ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (inductor max-autotune comparison on all suites, with warm start)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-------------------------------------+------------+-------------+-------------+
|              Compiler               | torchbench | huggingface | timm_models |
+-------------------------------------+------------+-------------+-------------+
|              inductor               | 85%, 51/60 | 91%, 41/45  | 100%, 60/60 |
|       inductor_no_cudagraphs        | 87%, 52/60 | 96%, 43/45  | 100%, 60/60 |
|        inductor_max_autotune        | 78%, 47/60 | 91%, 41/45  | 98%, 59/60  |
| inductor_max_autotune_no_cudagraphs | 82%, 49/60 | 96%, 43/45  | 100%, 60/60 |
+-------------------------------------+------------+-------------+-------------+

Geometric mean speedup

+-------------------------------------+------------+-------------+-------------+
|              Compiler               | torchbench | huggingface | timm_models |
+-------------------------------------+------------+-------------+-------------+
|              inductor               |   1.61x    |    1.60x    |    1.40x    |
|       inductor_no_cudagraphs        |   1.29x    |    1.51x    |    1.39x    |
|        inductor_max_autotune        |   1.61x    |    1.63x    |    1.44x    |
| inductor_max_autotune_no_cudagraphs |   1.35x    |    1.58x    |    1.42x    |
+-------------------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+-------------------------------------+------------+-------------+-------------+
|              Compiler               | torchbench | huggingface | timm_models |
+-------------------------------------+------------+-------------+-------------+
|              inductor               |   56.68    |    59.65    |    79.10    |
|       inductor_no_cudagraphs        |   30.39    |    42.67    |    46.98    |
|        inductor_max_autotune        |   257.92   |   186.71    |   381.29    |
| inductor_max_autotune_no_cudagraphs |   37.42    |    56.47    |    56.80    |
+-------------------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+-------------------------------------+------------+-------------+-------------+
|              Compiler               | torchbench | huggingface | timm_models |
+-------------------------------------+------------+-------------+-------------+
|              inductor               |   0.79x    |    0.91x    |    0.91x    |
|       inductor_no_cudagraphs        |   1.07x    |    1.06x    |    1.05x    |
|        inductor_max_autotune        |   0.76x    |    0.89x    |    0.91x    |
| inductor_max_autotune_no_cudagraphs |   1.07x    |    1.06x    |    1.05x    |
+-------------------------------------+------------+-------------+-------------+

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+-------------------------------+-----------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+-----------------+------------------------+ | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | Background_Matting | eager_variation | eager_variation | | torchbench | gat | 0.0000 | 0.0000 | | torchbench | gcn | 0.0000 | 0.0000 | | torchbench | llama | 0.0000 | 0.0000 | | torchbench | sage | 0.0000 | 0.0000 | | torchbench | tacotron2 | 0.0000 | 0.0000 | | torchbench | torchrec_dlrm | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | AlbertForQuestionAnswering | fail_accuracy | fail_accuracy | +-------------+-------------------------------+-----------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.4551 | 0.8351 | | torchbench | dcgan | 1.3676 | 0.84 | | torchbench | soft_actor_critic | 1.0041 | 0.8306 | | torchbench | timm_vovnet | 0.9088 | 0.9047 | | torchbench | nvidia_deeprecommender | 0.8719 | 1.0185 | | torchbench | timm_vision_transformer_large | 0.0 | 1.084 | | torchbench | gat | 0.0 | 0.0 | | torchbench | gcn | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | torchbench | sage | 0.0 | 0.0 | | torchbench | tacotron2 | 0.0 | 0.0 | | torchbench | torchrec_dlrm | 0.0 | 0.0 | | huggingface | DebertaForMaskedLM | 1.0944 | 0.9127 | | huggingface | DebertaV2ForMaskedLM | 1.0122 | 0.7354 | | huggingface | DebertaV2ForQuestionAnswering | 0.9377 | 0.7692 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.1121 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+--------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+----------+------------------------+ | torchbench | hf_T5_large | 173.0714 | 132.2447 | | torchbench | phlippe_densenet | 160.9749 | 30.0922 | | torchbench | hf_BigBird | 149.3215 | 103.7268 | | torchbench | densenet121 | 133.1323 | 73.2249 | | torchbench | mobilenet_v2 | 122.4436 | 30.0659 | | huggingface | MobileBertForMaskedLM | 144.5755 | 102.6639 | | huggingface | MobileBertForQuestionAnswering | 142.8046 | 101.7791 | | huggingface | DebertaV2ForMaskedLM | 140.6903 | 57.1807 | | huggingface | DebertaV2ForQuestionAnswering | 140.1457 | 61.8347 | | huggingface | M2M100ForConditionalGeneration | 137.1882 | 71.3113 | | huggingface | MT5ForConditionalGeneration | 133.1446 | 48.8807 | | huggingface | XGLMForCausalLM | 121.334 | 58.2817 | | timm_models | rexnet_100 | 224.9662 | 43.8651 | | timm_models | hrnet_w18 | 192.1312 | 151.1129 | | timm_models | pnasnet5large | 158.5366 | 110.0134 | | timm_models | ghostnet_100 | 153.8495 | 53.7871 | | timm_models | res2net101_26w_4s | 150.918 | 87.7778 | | timm_models | twins_pcpvt_base | 147.4936 | 70.2721 | | timm_models | adv_inception_v3 | 145.5702 | 52.6549 | | timm_models | fbnetv3_b | 132.0557 | 60.2455 | | timm_models | xcit_large_24_p8_224 | 126.2305 | 86.2627 | | timm_models | resnest101e | 124.9575 | 79.8068 | | timm_models | tinynet_a | 120.3408 | 43.3221 | +-------------+--------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | hf_GPT2_large | 0.8904 | 1.1718 | | torchbench | yolov3 | 0.8748 | 1.0642 | | torchbench | timm_efficientnet | 0.8701 | 1.0972 | | torchbench | resnet152 | 0.8697 | 1.0021 | | torchbench | speech_transformer | 0.8681 | 1.0968 | | torchbench | shufflenet_v2_x1_0 | 0.8627 | 1.0886 | | torchbench | timm_resnest | 0.8616 | 1.0911 | | torchbench | Super_SloMo | 0.8614 | 1.2225 | | torchbench | timm_vision_transformer | 0.8593 | 0.9978 | | torchbench | timm_regnet | 0.8513 | 1.0004 | | torchbench | Background_Matting | 0.8485 | 1.0482 | | torchbench | hf_DistilBert | 0.8476 | 1.0783 | | torchbench | hf_Bert | 0.8411 | 1.0767 | | torchbench | resnet50 | 0.8353 | 1.0021 | | torchbench | hf_Bert_large | 0.8302 | 1.0916 | | torchbench | hf_T5_large | 0.8201 | 1.1919 | | torchbench | timm_vovnet | 0.8185 | 1.0133 | | torchbench | pytorch_unet | 0.8134 | 1.0094 | | torchbench | phlippe_densenet | 0.8058 | 1.0057 | | torchbench | dcgan | 0.7955 | 0.9998 | | torchbench | hf_Bart | 0.793 | 1.0113 | | torchbench | squeezenet1_1 | 0.7867 | 1.0815 | | torchbench | mobilenet_v3_large | 0.7849 | 1.0 | | torchbench | demucs | 0.7826 | 0.9998 | | torchbench | pytorch_stargan | 0.7715 | 1.0716 | | torchbench | alexnet | 0.7396 | 1.0013 | | torchbench | vgg16 | 0.7227 | 0.9886 | | torchbench | mnasnet1_0 | 0.7144 | 1.0027 | | torchbench | densenet121 | 0.7071 | 0.9989 | | torchbench | pytorch_struct | 0.697 | 1.0 | | torchbench | hf_BigBird | 0.6949 | 1.1929 | | torchbench | nvidia_deeprecommender | 0.6857 | 0.9711 | | torchbench | resnext50_32x4d | 0.6786 | 1.0 | | torchbench | drq | 0.6429 | 0.9687 | | torchbench | soft_actor_critic | 0.6067 | 0.9974 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.6065 | 1.0224 | | torchbench | LearningToPaint | 0.5925 | 0.9944 | | torchbench | resnet18 | 0.5891 | 0.9931 | | torchbench | lennard_jones | 0.5317 | 1.0001 | | torchbench | hf_Reformer | 0.4539 | 1.0027 | | torchbench | functorch_dp_cifar10 | 0.3991 | 1.0609 | | torchbench | phlippe_resnet | 0.3169 | 1.008 | | huggingface | ElectraForCausalLM | 0.8941 | 0.9739 | | huggingface | PegasusForCausalLM | 0.893 | 0.9864 | | huggingface | DistilBertForMaskedLM | 0.8849 | 0.9624 | | huggingface | TrOCRForCausalLM | 0.8836 | 0.9583 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.8729 | 0.9803 | | huggingface | PegasusForConditionalGeneration | 0.8689 | 1.0689 | | huggingface | MBartForConditionalGeneration | 0.8574 | 1.0307 | | huggingface | BartForConditionalGeneration | 0.8456 | 1.0139 | | huggingface | MegatronBertForCausalLM | 0.845 | 1.0961 | | huggingface | BlenderbotSmallForCausalLM | 0.8184 | 0.9119 | | huggingface | Speech2Text2ForCausalLM | 0.789 | 0.8779 | | huggingface | M2M100ForConditionalGeneration | 0.7651 | 0.9908 | | huggingface | MobileBertForMaskedLM | 0.752 | 1.016 | | huggingface | XGLMForCausalLM | 0.7117 | 0.9792 | | huggingface | MobileBertForQuestionAnswering | 0.6569 | 0.8579 | | huggingface | DebertaForMaskedLM | 0.5646 | 1.0748 | | huggingface | DebertaV2ForMaskedLM | 0.5187 | 0.9894 | | huggingface | DebertaForQuestionAnswering | 0.4867 | 1.2209 | | huggingface | DebertaV2ForQuestionAnswering | 0.4855 | 1.0041 | | timm_models | ghostnet_100 | 0.8976 | 1.0514 | | timm_models | hrnet_w18 | 0.8918 | 1.0121 | | timm_models | sebotnet33ts_256 | 0.891 | 1.1401 | | timm_models | inception_v3 | 0.8904 | 1.0459 | | timm_models | adv_inception_v3 | 0.8904 | 1.0459 | | timm_models | gluon_inception_v3 | 0.8904 | 1.0459 | | timm_models | mobilenetv3_large_100 | 0.8881 | 1.0046 | | timm_models | dpn107 | 0.8833 | 0.9977 | | timm_models | gluon_xception65 | 0.8832 | 0.9998 | | timm_models | spnasnet_100 | 0.8786 | 1.0063 | | timm_models | selecsls42b | 0.8785 | 1.0139 | | timm_models | poolformer_m36 | 0.8768 | 1.1916 | | timm_models | eca_botnext26ts_256 | 0.8738 | 1.0257 | | timm_models | res2net50_14w_8s | 0.8712 | 0.9828 | | timm_models | res2net101_26w_4s | 0.871 | 0.9822 | | timm_models | mixnet_l | 0.8687 | 1.0134 | | timm_models | mnasnet_100 | 0.8683 | 1.0074 | | timm_models | res2next50 | 0.866 | 0.9759 | | timm_models | cait_m36_384 | 0.8636 | 1.0068 | | timm_models | fbnetc_100 | 0.8596 | 1.0104 | | timm_models | pit_b_224 | 0.8578 | 1.0382 | | timm_models | convnext_base | 0.8505 | 1.0373 | | timm_models | gernet_l | 0.8499 | 1.0005 | | timm_models | swsl_resnext101_32x16d | 0.8477 | 1.0007 | | timm_models | coat_lite_mini | 0.8402 | 1.0437 | | timm_models | lcnet_050 | 0.8273 | 1.0008 | | timm_models | botnet26t_256 | 0.8239 | 1.0 | | timm_models | xcit_large_24_p8_224 | 0.8228 | 1.0079 | | timm_models | regnety_002 | 0.8165 | 1.0004 | | timm_models | repvgg_a2 | 0.7738 | 1.0131 | | timm_models | crossvit_9_240 | 0.7526 | 1.0019 | | timm_models | swin_base_patch4_window7_224 | 0.7214 | 0.9303 | | timm_models | jx_nest_base | 0.6693 | 0.9905 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ | functorch_dp_cifar10 | 64 | 3.7683 | 1.4164 | 3.8468 | 1.4282 | | BERT_pytorch | 16 | 3.2988 | 2.2199 | 3.3208 | 2.2676 | | pytorch_CycleGAN_and_pix2pix | 1 | 2.9018 | 1.8137 | 2.3668 | 1.8208 | | densenet121 | 4 | 2.7444 | 1.0715 | 2.7296 | 1.0748 | | hf_BigBird | 2 | 2.6731 | 1.7162 | 2.6194 | 1.7758 | | hf_T5_large | 2 | 2.4329 | 2.0038 | 2.5267 | 2.1386 | | hf_Albert | 8 | 2.3975 | 2.3005 | 2.3685 | 2.3139 | | dlrm | 1024 | 2.2374 | 1.1677 | 2.0264 | 1.2637 | | squeezenet1_1 | 32 | 2.1124 | 1.3397 | 2.0063 | 1.4167 | | phlippe_densenet | 128 | 2.0806 | 1.029 | 2.087 | 1.0705 | | mobilenet_v3_large | 32 | 2.0468 | 1.211 | 2.0609 | 1.2376 | | pytorch_struct | 200 | 1.9681 | 1.1189 | 2.0867 | 1.4794 | | hf_T5 | 8 | 1.9589 | 1.9749 | 2.0015 | 2.0269 | | hf_Bert | 4 | 1.9253 | 1.7074 | 1.9583 | 1.7481 | | hf_Bart | 4 | 1.8693 | 1.557 | 1.7178 | 1.6633 | | hf_GPT2 | 4 | 1.8582 | 1.9035 | 2.0785 | 2.0666 | | phlippe_resnet | 128 | 1.832 | 1.0113 | 1.8121 | 1.0676 | | hf_GPT2_large | 4 | 1.7265 | 1.7904 | 0.0 | 1.9209 | | resnext50_32x4d | 8 | 1.7133 | 0.9962 | 1.7077 | 0.9997 | | mnasnet1_0 | 32 | 1.7066 | 1.0654 | 1.6934 | 1.114 | | speech_transformer | 32 | 1.6683 | 1.6225 | 1.8907 | 1.8447 | | shufflenet_v2_x1_0 | 128 | 1.6325 | 1.2108 | 1.6143 | 1.2218 | | hf_Bert_large | 4 | 1.6266 | 1.6462 | 1.6564 | 1.6987 | | resnet18 | 16 | 1.5869 | 0.9844 | 1.5787 | 1.0058 | | hf_DistilBert | 8 | 1.5777 | 1.5038 | 1.4896 | 1.5064 | | timm_vision_transformer | 32 | 1.5761 | 1.4238 | 1.7334 | 1.5843 | | timm_resnest | 32 | 1.573 | 1.5269 | 1.5745 | 1.5469 | | timm_nfnet | 128 | 1.5573 | 1.5076 | 1.5801 | 1.5156 | | attention_is_all_you_need_pytorch | 256 | 1.5525 | 1.554 | 1.7288 | 1.711 | | fastNLP_Bert | 6 | 1.5451 | 1.5499 | 1.6927 | 1.6834 | | mobilenet_v2 | 96 | 1.5203 | 1.5083 | 1.5202 | 1.5225 | | drq | 1 | 1.4926 | 1.0524 | 1.4722 | 1.1479 | | lennard_jones | 1000 | 1.4551 | 0.8351 | 1.3814 | 1.07 | | timm_efficientnet | 32 | 1.3786 | 1.0638 | 1.3933 | 1.0552 | | dcgan | 32 | 1.3676 | 0.84 | 1.4554 | 0.8373 | | pytorch_unet | 1 | 1.3593 | 1.3532 | 1.3587 | 1.3564 | | LearningToPaint | 96 | 1.3205 | 1.0678 | 1.3599 | 1.1066 | | pytorch_stargan | 16 | 1.281 | 1.2489 | 1.267 | 1.2428 | | Super_SloMo | 6 | 1.2511 | 1.2343 | 1.2587 | 1.2411 | | vgg16 | 64 | 1.2412 | 1.2537 | 1.2509 | 1.2643 | | Background_Matting | 4 | 1.2119 | 1.2059 | 1.2177 | 1.2108 | | resnet152 | 32 | 1.205 | 1.0171 | 1.1848 | 1.037 | | yolov3 | 16 | 1.1977 | 1.1969 | 1.2052 | 1.2084 | | resnet50 | 32 | 1.1807 | 1.0715 | 1.1844 | 1.0776 | | hf_Reformer | 4 | 1.1415 | 1.0689 | 1.1457 | 1.0826 | | alexnet | 128 | 1.089 | 1.1351 | 1.1322 | 1.1834 | | demucs | 4 | 1.039 | 1.0374 | 1.0363 | 1.0385 | | soft_actor_critic | 256 | 1.0041 | 0.8306 | 1.1711 | 0.8459 | | timm_regnet | 32 | 1.001 | 0.9535 | 1.0103 | 0.9605 | | tts_angular | 64 | 0.9571 | 0.9597 | 0.9585 | 0.9524 | | timm_vovnet | 32 | 0.9088 | 0.9047 | 0.9139 | 0.9079 | | nvidia_deeprecommender | 256 | 0.8719 | 1.0185 | 0.9331 | 1.1032 | | timm_vision_transformer_large | 32 | 0.0 | 1.084 | 0.0 | 1.1625 | | gat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | gcn | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | sage | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------+-----+------------------+------------------------+-----------------------+-------------------------------------+ | hf_GPT2_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientnet | 4 | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | | resnet152 | 4 | pass | pass | pass | pass | | resnet18 | 4 | pass | pass | pass | pass | | resnet50 | 4 | pass | pass | pass | pass | | resnext50_32x4d | 4 | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 4 | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | | speech_transformer | 4 | pass | pass | pass | pass | | timm_nfnet | 4 | pass | pass | pass | pass | | nvidia_deeprecommender | 4 | pass | pass | pass | pass | | timm_regnet | 4 | pass | pass | pass | pass | | timm_resnest | 4 | pass | pass | pass | pass | | timm_vision_transformer | 4 | pass | pass | pass | pass | | timm_vovnet | 4 | pass | pass | pass | pass | | tts_angular | 4 | pass | pass | pass | pass | | vgg16 | 4 | pass | pass | pass | pass | | yolov3 | 4 | pass | pass | pass | pass | | drq | 1 | pass | pass | fail_accuracy | fail_accuracy | | phlippe_resnet | 4 | pass | pass | fail_accuracy | fail_accuracy | | squeezenet1_1 | 4 | pass | pass | fail_accuracy | fail_accuracy | | vision_maskrcnn | 4 | pass | pass | pass | 0.0000 | | phlippe_densenet | 4 | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | | mobilenet_v3_large | 4 | pass | pass | pass | pass | | hf_Albert | 4 | pass | pass | pass | pass | | BERT_pytorch | 4 | pass | pass | pass | pass | | LearningToPaint | 4 | pass | pass | pass | pass | | Super_SloMo | 4 | pass | pass | pass | pass | | alexnet | 4 | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 4 | pass | pass | pass | pass | | dcgan | 4 | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | | mobilenet_v2 | 4 | pass | pass | pass | pass | | dlrm | 4 | pass | pass | pass | pass | | fastNLP_Bert | 4 | pass | pass | pass | pass | | functorch_dp_cifar10 | 4 | pass | pass | pass | pass | | densenet121 | 4 | pass | pass | pass | pass | | hf_Bart | 4 | pass | pass | pass | pass | | hf_Reformer | 4 | pass | pass | pass | pass | | mnasnet1_0 | 4 | pass | pass | pass | pass | | lennard_jones | 4 | pass | pass | pass | pass | | hf_Bert | 4 | pass | pass | pass | pass | | hf_T5 | 4 | pass | pass | pass | pass | | hf_T5_base | 4 | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | | hf_DistilBert | 4 | pass | pass | pass | pass | | hf_BigBird | 4 | pass | pass | pass | pass | | hf_Bert_large | 4 | pass | pass | pass | pass | | hf_Longformer | 4 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 4 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | Background_Matting | 4 | eager_variation | eager_variation | eager_variation | eager_variation | | gat | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | gcn | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | llama | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | sage | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | tacotron2 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | torchrec_dlrm | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------------+-----------------------+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ | hf_T5_large | 2 | 173.0714 | 132.2447 | 416.242 | 177.2517 | | phlippe_densenet | 128 | 160.9749 | 30.0922 | 535.7692 | 32.8417 | | hf_BigBird | 2 | 149.3215 | 103.7268 | 498.612 | 130.1443 | | densenet121 | 4 | 133.1323 | 73.2249 | 733.9116 | 76.8014 | | mobilenet_v2 | 96 | 122.4436 | 30.0659 | 339.5142 | 31.3981 | | timm_efficientnet | 32 | 111.0412 | 36.4138 | 266.193 | 38.9304 | | mnasnet1_0 | 32 | 110.6588 | 29.0324 | 378.7044 | 29.7176 | | mobilenet_v3_large | 32 | 109.9406 | 31.6035 | 344.9076 | 34.1691 | | hf_GPT2_large | 4 | 108.8581 | 77.1005 | nan | 96.8448 | | yolov3 | 16 | 92.0432 | 43.0049 | 284.2655 | 45.0176 | | speech_transformer | 32 | 80.649 | 39.1877 | 807.794 | 52.1384 | | shufflenet_v2_x1_0 | 128 | 79.0767 | 32.8056 | 225.9032 | 32.968 | | attention_is_all_you_need_pytorch | 256 | 75.6929 | 34.7959 | 634.072 | 46.5701 | | BERT_pytorch | 16 | 71.7002 | 33.6489 | 320.9033 | 42.5153 | | resnet152 | 32 | 71.3272 | 69.3953 | 177.8991 | 73.8442 | | Background_Matting | 4 | 70.9447 | 30.4912 | 132.6704 | 33.0487 | | timm_nfnet | 128 | 67.1746 | 37.2659 | 271.9672 | 39.586 | | timm_regnet | 32 | 61.764 | 38.7132 | 296.823 | 41.0341 | | hf_Bert_large | 4 | 61.6953 | 55.8367 | 289.4748 | 72.8998 | | timm_resnest | 32 | 60.5023 | 19.6536 | 114.3721 | 20.7796 | | functorch_dp_cifar10 | 64 | 56.865 | 15.62 | 134.2888 | 16.4183 | | hf_Bart | 4 | 50.0551 | 39.499 | 157.2438 | 50.1397 | | fastNLP_Bert | 6 | 49.5644 | 32.8088 | 424.9577 | 42.0206 | | hf_Reformer | 4 | 49.2329 | 15.8472 | 187.3093 | 17.8342 | | hf_T5 | 8 | 49.1459 | 33.6421 | 247.7637 | 46.7628 | | timm_vision_transformer | 32 | 48.9995 | 24.0413 | 385.9061 | 31.4757 | | pytorch_stargan | 16 | 47.5214 | 14.5602 | 33.711 | 14.0179 | | pytorch_unet | 1 | 45.6827 | 16.5621 | 138.7921 | 17.9691 | | LearningToPaint | 96 | 43.5813 | 16.6341 | 208.533 | 17.3278 | | Super_SloMo | 6 | 43.1872 | 29.9183 | 130.8438 | 31.9602 | | resnext50_32x4d | 8 | 42.9535 | 28.0696 | 248.8633 | 28.8052 | | hf_GPT2 | 4 | 41.8947 | 28.9462 | 175.7557 | 35.6423 | | hf_Albert | 8 | 40.9266 | 23.7433 | 369.308 | 33.0731 | | hf_Bert | 4 | 36.9631 | 29.8651 | 87.4618 | 39.0034 | | pytorch_CycleGAN_and_pix2pix | 1 | 36.7504 | 16.6227 | 94.3367 | 16.3521 | | timm_vovnet | 32 | 35.6387 | 26.284 | 252.3846 | 27.6386 | | resnet18 | 16 | 30.4133 | 15.5566 | 174.7456 | 15.7017 | | resnet50 | 32 | 28.5418 | 28.5283 | 29.5055 | 29.8457 | | demucs | 4 | 27.9362 | 12.1266 | 66.4479 | 11.9827 | | hf_DistilBert | 8 | 27.6289 | 17.9827 | 46.8685 | 22.3509 | | phlippe_resnet | 128 | 24.4142 | 13.4551 | 158.3286 | 15.0748 | | squeezenet1_1 | 32 | 21.7804 | 11.6383 | 145.0019 | 11.9146 | | pytorch_struct | 200 | 20.3197 | 6.8722 | 353.2137 | 8.9324 | | alexnet | 128 | 15.5657 | 8.5006 | 169.6181 | 8.8066 | | vgg16 | 64 | 12.7831 | 8.3682 | 165.2221 | 9.5088 | | drq | 1 | 10.9657 | 6.7722 | 263.0723 | 8.6234 | | nvidia_deeprecommender | 256 | 10.344 | 6.318 | 231.7501 | 7.0803 | | soft_actor_critic | 256 | 9.3882 | 6.4124 | 166.2186 | 6.0958 | | dlrm | 1024 | 8.8048 | 6.2748 | 322.0582 | 7.5683 | | dcgan | 32 | 8.4643 | 6.4032 | 34.7612 | 6.6723 | | lennard_jones | 1000 | 7.0308 | 6.2225 | 141.9293 | 5.9564 | | tts_angular | 64 | 6.0732 | 5.0228 | 5.2435 | 5.007 | | timm_vision_transformer_large | 32 | nan | 73.4664 | nan | 106.1873 | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ | hf_Albert | 8 | 1.0378 | 1.3253 | 0.9955 | 1.3253 | | hf_T5 | 8 | 1.0163 | 1.2478 | 0.9988 | 1.2478 | | mobilenet_v2 | 96 | 1.0102 | 1.1747 | 1.0102 | 1.1747 | | tts_angular | 64 | 0.9904 | 1.0 | 0.9896 | 1.0 | | timm_nfnet | 128 | 0.9689 | 1.1079 | 0.9619 | 1.1066 | | attention_is_all_you_need_pytorch | 256 | 0.9689 | 1.1774 | 1.0017 | 1.1736 | | fastNLP_Bert | 6 | 0.9575 | 1.2381 | 0.9595 | 1.2381 | | dlrm | 1024 | 0.9525 | 1.0009 | 0.9466 | 1.0009 | | BERT_pytorch | 16 | 0.9428 | 1.3212 | 0.9428 | 1.3212 | | hf_GPT2 | 4 | 0.9321 | 1.1566 | 0.932 | 1.1772 | | hf_GPT2_large | 4 | 0.8904 | 1.1718 | nan | 1.1777 | | yolov3 | 16 | 0.8748 | 1.0642 | 0.8723 | 1.0736 | | timm_efficientnet | 32 | 0.8701 | 1.0972 | 0.9259 | 1.1033 | | resnet152 | 32 | 0.8697 | 1.0021 | 0.8286 | 1.0 | | speech_transformer | 32 | 0.8681 | 1.0968 | 0.8618 | 1.0967 | | shufflenet_v2_x1_0 | 128 | 0.8627 | 1.0886 | 0.8631 | 1.1038 | | timm_resnest | 32 | 0.8616 | 1.0911 | 0.8431 | 1.1309 | | Super_SloMo | 6 | 0.8614 | 1.2225 | 0.8606 | 1.2225 | | timm_vision_transformer | 32 | 0.8593 | 0.9978 | 0.8357 | 0.9978 | | timm_regnet | 32 | 0.8513 | 1.0004 | 0.8485 | 1.0005 | | Background_Matting | 4 | 0.8485 | 1.0482 | 0.8333 | 1.0482 | | hf_DistilBert | 8 | 0.8476 | 1.0783 | 0.8456 | 1.0783 | | hf_Bert | 4 | 0.8411 | 1.0767 | 0.8411 | 1.0767 | | resnet50 | 32 | 0.8353 | 1.0021 | 0.8368 | 1.0001 | | hf_Bert_large | 4 | 0.8302 | 1.0916 | 0.8302 | 1.0916 | | hf_T5_large | 2 | 0.8201 | 1.1919 | 0.8201 | 1.1919 | | timm_vovnet | 32 | 0.8185 | 1.0133 | 0.7426 | 1.0135 | | pytorch_unet | 1 | 0.8134 | 1.0094 | 0.7708 | 1.0094 | | phlippe_densenet | 128 | 0.8058 | 1.0057 | 0.7988 | 1.0056 | | dcgan | 32 | 0.7955 | 0.9998 | 0.1811 | 0.9998 | | hf_Bart | 4 | 0.793 | 1.0113 | 0.7623 | 1.0102 | | squeezenet1_1 | 32 | 0.7867 | 1.0815 | 0.763 | 1.0814 | | mobilenet_v3_large | 32 | 0.7849 | 1.0 | 0.698 | 1.0 | | demucs | 4 | 0.7826 | 0.9998 | 0.7662 | 0.9998 | | pytorch_stargan | 16 | 0.7715 | 1.0716 | 0.7743 | 1.0716 | | alexnet | 128 | 0.7396 | 1.0013 | 0.7396 | 1.0397 | | vgg16 | 64 | 0.7227 | 0.9886 | 0.7228 | 1.0332 | | mnasnet1_0 | 32 | 0.7144 | 1.0027 | 0.7485 | 1.0034 | | densenet121 | 4 | 0.7071 | 0.9989 | 0.7107 | 1.0012 | | pytorch_struct | 200 | 0.697 | 1.0 | 0.9395 | 1.0001 | | hf_BigBird | 2 | 0.6949 | 1.1929 | 0.694 | 1.1929 | | nvidia_deeprecommender | 256 | 0.6857 | 0.9711 | 0.6857 | 1.0001 | | resnext50_32x4d | 8 | 0.6786 | 1.0 | 0.6565 | 1.0016 | | drq | 1 | 0.6429 | 0.9687 | 0.1818 | 1.035 | | soft_actor_critic | 256 | 0.6067 | 0.9974 | 0.1108 | 0.9974 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.6065 | 1.0224 | 0.5458 | 1.0224 | | LearningToPaint | 96 | 0.5925 | 0.9944 | 0.6015 | 0.9944 | | resnet18 | 16 | 0.5891 | 0.9931 | 0.5364 | 0.9931 | | lennard_jones | 1000 | 0.5317 | 1.0001 | 0.0648 | 1.0587 | | hf_Reformer | 4 | 0.4539 | 1.0027 | 0.4622 | 1.0027 | | functorch_dp_cifar10 | 64 | 0.3991 | 1.0609 | 0.4626 | 1.0609 | | phlippe_resnet | 128 | 0.3169 | 1.008 | 0.3272 | 1.008 | | timm_vision_transformer_large | 32 | nan | 0.9723 | nan | 0.9791 | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ | hf_GPT2_large | 4 | 120.9949 | 116.6407 | nan | 108.8995 | | Background_Matting | 4 | 103.5854 | 104.166 | 103.2133 | 103.9099 | | hf_T5_large | 2 | 94.0877 | 112.042 | 90.1229 | 105.0678 | | hf_T5 | 8 | 91.4678 | 90.5267 | 89.5872 | 88.9915 | | timm_nfnet | 128 | 75.3214 | 78.3379 | 74.6814 | 77.7218 | | hf_BigBird | 2 | 74.517 | 113.6677 | 73.379 | 112.2197 | | hf_Reformer | 4 | 71.5373 | 75.7153 | 70.6722 | 74.8327 | | Super_SloMo | 6 | 63.3501 | 64.2984 | 63.1509 | 64.0087 | | yolov3 | 16 | 57.2038 | 57.2744 | 56.834 | 56.5797 | | timm_regnet | 32 | 55.6464 | 58.7053 | 55.2539 | 57.8637 | | vgg16 | 64 | 53.3929 | 52.7979 | 52.9403 | 52.317 | | resnet152 | 32 | 53.2036 | 63.0864 | 53.1877 | 62.3167 | | demucs | 4 | 51.5908 | 51.6758 | 51.8294 | 51.7754 | | hf_Bert_large | 4 | 50.6631 | 50.0943 | 49.6293 | 48.9384 | | speech_transformer | 32 | 35.982 | 36.0 | 31.6151 | 31.3294 | | attention_is_all_you_need_pytorch | 256 | 35.4362 | 35.6083 | 31.7431 | 31.9067 | | hf_Bart | 4 | 34.0784 | 35.0328 | 31.4883 | 32.9934 | | fastNLP_Bert | 6 | 33.4188 | 33.9433 | 31.2706 | 31.3598 | | mobilenet_v2 | 96 | 30.8429 | 31.165 | 30.8653 | 30.844 | | pytorch_unet | 1 | 29.2184 | 29.4466 | 29.2677 | 29.3277 | | hf_Albert | 8 | 29.0675 | 29.6172 | 28.819 | 29.442 | | timm_vovnet | 32 | 26.9919 | 27.541 | 27.0308 | 27.3558 | | hf_GPT2 | 4 | 25.9285 | 25.7391 | 23.4441 | 23.5272 | | timm_efficientnet | 32 | 23.2412 | 30.5885 | 23.1287 | 30.8732 | | resnet50 | 32 | 22.1006 | 25.0252 | 22.1314 | 24.5177 | | hf_Bert | 4 | 21.8011 | 23.871 | 21.3681 | 23.5383 | | hf_DistilBert | 8 | 21.2775 | 20.8865 | 21.0167 | 20.8359 | | densenet121 | 4 | 19.8637 | 51.8501 | 20.3152 | 51.2219 | | shufflenet_v2_x1_0 | 128 | 18.7948 | 25.566 | 18.7733 | 25.155 | | timm_vision_transformer | 32 | 17.9678 | 19.9034 | 16.3517 | 18.1088 | | BERT_pytorch | 16 | 17.0725 | 24.6427 | 16.3854 | 24.0209 | | timm_resnest | 32 | 15.2495 | 15.7393 | 15.2485 | 15.5518 | | mnasnet1_0 | 32 | 13.0466 | 21.6279 | 13.3825 | 19.9102 | | mobilenet_v3_large | 32 | 12.9163 | 22.2031 | 13.1215 | 21.7979 | | resnext50_32x4d | 8 | 11.9392 | 20.6573 | 12.0657 | 20.1557 | | pytorch_stargan | 16 | 11.77 | 11.9393 | 11.6463 | 11.8596 | | nvidia_deeprecommender | 256 | 11.707 | 10.0214 | 10.9381 | 9.2529 | | phlippe_densenet | 128 | 11.4082 | 22.9777 | 11.4472 | 22.7842 | | alexnet | 128 | 9.0139 | 8.6398 | 8.6663 | 8.295 | | LearningToPaint | 96 | 8.5569 | 10.7769 | 8.3681 | 10.2813 | | tts_angular | 64 | 6.5413 | 6.5988 | 6.5713 | 6.5622 | | resnet18 | 16 | 5.84 | 9.4998 | 5.6093 | 9.0532 | | pytorch_CycleGAN_and_pix2pix | 1 | 5.7967 | 7.6938 | 5.7312 | 7.5394 | | squeezenet1_1 | 32 | 5.4322 | 7.7193 | 5.1811 | 7.2148 | | phlippe_resnet | 128 | 5.0226 | 8.9907 | 5.038 | 8.785 | | functorch_dp_cifar10 | 64 | 2.7787 | 7.303 | 2.806 | 7.1296 | | drq | 1 | 2.4704 | 3.1289 | 2.4997 | 2.9058 | | pytorch_struct | 200 | 2.453 | 4.1952 | 2.3217 | 3.3137 | | dlrm | 1024 | 2.1539 | 3.6077 | 2.1012 | 3.4372 | | soft_actor_critic | 256 | 1.9309 | 2.1787 | 1.3703 | 2.0865 | | dcgan | 32 | 1.5488 | 2.5518 | 1.5306 | 2.5165 | | lennard_jones | 1000 | 1.1898 | 2.4098 | 1.1487 | 1.5424 | | timm_vision_transformer_large | 32 | nan | 429.9693 | nan | 398.9973 | | gat | 0 | nan | nan | nan | nan | | gcn | 0 | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | | sage | 0 | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | | torchrec_dlrm | 0 | nan | nan | nan | nan | +-----------------------------------+------+----------+------------------------+-----------------------+-------------------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | MobileBertForMaskedLM | 64 | 2.8379 | 1.1804 | 2.542 | 1.3537 | | MobileBertForQuestionAnswering | 128 | 2.7692 | 1.1808 | 1.4973 | 1.3123 | | GPT2ForSequenceClassification | 4 | 2.3231 | 2.3576 | 2.3887 | 2.4404 | | OPTForCausalLM | 2 | 2.2847 | 2.3123 | 2.3442 | 2.3729 | | MT5ForConditionalGeneration | 16 | 2.2568 | 1.9766 | 2.283 | 2.1255 | | ElectraForQuestionAnswering | 64 | 2.1765 | 2.1411 | 2.1955 | 2.1551 | | ElectraForCausalLM | 32 | 1.8447 | 1.8684 | 1.851 | 1.883 | | LayoutLMForSequenceClassification | 16 | 1.8334 | 1.8149 | 1.8365 | 1.8208 | | XLNetLMHeadModel | 8 | 1.8198 | 1.8157 | 1.849 | 1.8362 | | BertForQuestionAnswering | 16 | 1.8037 | 1.805 | 1.8094 | 1.8103 | | RobertaForQuestionAnswering | 16 | 1.8007 | 1.8073 | 1.8071 | 1.8103 | | M2M100ForConditionalGeneration | 16 | 1.7192 | 1.4136 | 1.5253 | 1.5079 | | RobertaForCausalLM | 16 | 1.6812 | 1.6989 | 1.6865 | 1.7033 | | DistillGPT2 | 16 | 1.6794 | 1.7193 | 1.9253 | 1.9826 | | T5Small | 4 | 1.6666 | 1.8066 | 1.7059 | 1.8308 | | T5ForConditionalGeneration | 4 | 1.6661 | 1.805 | 1.7018 | 1.8589 | | MegatronBertForQuestionAnswering | 8 | 1.6556 | 1.6779 | 1.6706 | 1.6946 | | AlbertForQuestionAnswering | 4 | 1.6425 | 1.6439 | 1.6396 | 1.6462 | | XGLMForCausalLM | 8 | 1.6414 | 1.567 | 1.686 | 1.6278 | | AlbertForMaskedLM | 4 | 1.6328 | 1.6345 | 1.6189 | 1.6393 | | LayoutLMForMaskedLM | 16 | 1.6184 | 1.6414 | 1.6147 | 1.6395 | | BertForMaskedLM | 16 | 1.5979 | 1.6155 | 1.6036 | 1.6168 | | PLBartForConditionalGeneration | 4 | 1.5907 | 1.6254 | 1.7091 | 1.7489 | | CamemBert | 16 | 1.5473 | 1.5626 | 1.6282 | 1.6418 | | MegatronBertForCausalLM | 4 | 1.5329 | 1.5639 | 1.5751 | 1.6277 | | YituTechConvBert | 16 | 1.523 | 1.5246 | 1.6573 | 1.6564 | | PLBartForCausalLM | 8 | 1.4744 | 1.505 | 1.6825 | 1.7212 | | BartForCausalLM | 4 | 1.4667 | 1.499 | 1.6073 | 1.6423 | | MBartForCausalLM | 4 | 1.4619 | 1.4935 | 1.6016 | 1.6392 | | DistilBertForQuestionAnswering | 256 | 1.4549 | 1.4549 | 1.4664 | 1.469 | | BartForConditionalGeneration | 2 | 1.4517 | 1.477 | 1.5394 | 1.5653 | | MBartForConditionalGeneration | 2 | 1.4393 | 1.4681 | 1.5284 | 1.5603 | | Speech2Text2ForCausalLM | 256 | 1.423 | 1.4552 | 1.438 | 1.4968 | | BlenderbotSmallForConditionalGeneration | 64 | 1.3494 | 1.3821 | 1.4451 | 1.4879 | | PegasusForConditionalGeneration | 32 | 1.2481 | 1.2879 | 1.3445 | 1.3718 | | TrOCRForCausalLM | 32 | 1.2403 | 1.2717 | 1.3742 | 1.4124 | | DistilBertForMaskedLM | 128 | 1.2142 | 1.2433 | 1.2239 | 1.2496 | | BlenderbotSmallForCausalLM | 64 | 1.2128 | 1.2526 | 1.3706 | 1.402 | | DebertaForQuestionAnswering | 8 | 1.1834 | 1.0615 | 1.2014 | 1.0904 | | PegasusForCausalLM | 32 | 1.1733 | 1.2101 | 1.3076 | 1.3469 | | DebertaForMaskedLM | 4 | 1.0944 | 0.9127 | 1.1479 | 0.9382 | | DebertaV2ForMaskedLM | 1 | 1.0122 | 0.7354 | 1.0227 | 0.7303 | | DebertaV2ForQuestionAnswering | 2 | 0.9377 | 0.7692 | 0.9557 | 0.7836 | | BlenderbotForCausalLM | 4 | 0.0 | 1.1121 | 0.0 | 1.1282 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+----+------------------+------------------------+-----------------------+-------------------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | fail_to_run | pass | fail_to_run | pass | | AlbertForQuestionAnswering | 1 | fail_accuracy | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------------+----+------------------+------------------------+-----------------------+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | MobileBertForMaskedLM | 64 | 144.5755 | 102.6639 | 546.5416 | 139.5054 | | MobileBertForQuestionAnswering | 128 | 142.8046 | 101.7791 | 537.4145 | 138.2121 | | DebertaV2ForMaskedLM | 1 | 140.6903 | 57.1807 | 466.6899 | 76.2266 | | DebertaV2ForQuestionAnswering | 2 | 140.1457 | 61.8347 | 348.7994 | 79.0066 | | M2M100ForConditionalGeneration | 16 | 137.1882 | 71.3113 | 217.4307 | 97.9629 | | MT5ForConditionalGeneration | 16 | 133.1446 | 48.8807 | 483.0852 | 67.2957 | | XGLMForCausalLM | 8 | 121.334 | 58.2817 | 212.8778 | 78.0456 | | XLNetLMHeadModel | 8 | 95.8362 | 78.7795 | 252.4958 | 104.9234 | | MBartForConditionalGeneration | 2 | 82.4087 | 70.6402 | 109.5138 | 96.8137 | | DebertaForMaskedLM | 4 | 79.7105 | 33.6328 | 158.0097 | 42.3338 | | DebertaForQuestionAnswering | 8 | 76.661 | 33.5332 | 201.8764 | 41.2801 | | BartForConditionalGeneration | 2 | 75.1442 | 68.2448 | 217.2458 | 94.2897 | | PegasusForConditionalGeneration | 32 | 67.9314 | 62.7712 | 104.6898 | 88.1643 | | MegatronBertForQuestionAnswering | 8 | 66.518 | 57.6726 | 134.8533 | 78.1936 | | YituTechConvBert | 16 | 66.375 | 45.547 | 209.9396 | 57.9979 | | MegatronBertForCausalLM | 4 | 61.8968 | 58.287 | 103.0386 | 78.2287 | | BlenderbotSmallForConditionalGeneration | 64 | 53.1261 | 46.1383 | 74.7762 | 64.5962 | | ElectraForCausalLM | 32 | 50.6319 | 33.6646 | 386.2907 | 42.0994 | | T5ForConditionalGeneration | 4 | 50.1657 | 37.1969 | 233.8668 | 49.591 | | PLBartForConditionalGeneration | 4 | 47.3857 | 37.9846 | 85.0962 | 50.7367 | | ElectraForQuestionAnswering | 64 | 42.7933 | 32.2662 | 241.3952 | 40.2974 | | LayoutLMForSequenceClassification | 16 | 41.7579 | 34.0668 | 160.4051 | 41.8708 | | BertForMaskedLM | 16 | 38.7295 | 30.3644 | 214.0867 | 39.9959 | | OPTForCausalLM | 2 | 37.5386 | 30.2833 | 99.0268 | 38.8445 | | AlbertForMaskedLM | 4 | 37.4582 | 24.6813 | 300.1565 | 34.2281 | | MBartForCausalLM | 4 | 36.7723 | 30.4993 | 48.9687 | 40.8436 | | PegasusForCausalLM | 32 | 36.5871 | 30.4338 | 94.427 | 40.2854 | | GPT2ForSequenceClassification | 4 | 35.9461 | 28.2024 | 188.6827 | 37.4743 | | TrOCRForCausalLM | 32 | 35.7626 | 29.3602 | 186.3087 | 39.7047 | | DistilBertForQuestionAnswering | 256 | 35.6994 | 19.9854 | 184.4006 | 23.3917 | | LayoutLMForMaskedLM | 16 | 35.3459 | 33.2269 | 40.5109 | 40.9049 | | T5Small | 4 | 34.8294 | 37.3824 | 47.6236 | 48.8331 | | BartForCausalLM | 4 | 34.3041 | 29.607 | 168.1618 | 39.0359 | | DistilBertForMaskedLM | 128 | 33.9409 | 18.8863 | 206.8346 | 23.0878 | | CamemBert | 16 | 33.6916 | 31.2407 | 68.9291 | 40.2634 | | RobertaForCausalLM | 16 | 33.5368 | 31.5115 | 44.2814 | 41.1254 | | BertForQuestionAnswering | 16 | 31.8795 | 31.2068 | 92.8413 | 39.9058 | | RobertaForQuestionAnswering | 16 | 30.7016 | 32.569 | 40.1121 | 40.9837 | | BlenderbotSmallForCausalLM | 64 | 28.3922 | 22.7183 | 155.0844 | 27.9117 | | DistillGPT2 | 16 | 28.2467 | 16.8918 | 139.5229 | 21.4802 | | AlbertForQuestionAnswering | 4 | 24.8724 | 23.6732 | 88.2348 | 33.7614 | | Speech2Text2ForCausalLM | 256 | 24.1175 | 18.0462 | 131.2631 | 23.6207 | | PLBartForCausalLM | 8 | 24.0539 | 18.9783 | 66.2115 | 23.7083 | | BlenderbotForCausalLM | 4 | nan | 56.4431 | nan | 75.0952 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | XLNetLMHeadModel | 8 | 1.1551 | 1.1551 | 1.1551 | 1.1551 | | ElectraForQuestionAnswering | 64 | 1.1376 | 1.195 | 1.1104 | 1.195 | | GPT2ForSequenceClassification | 4 | 1.1139 | 1.23 | 1.1135 | 1.23 | | OPTForCausalLM | 2 | 1.0939 | 1.1343 | 1.094 | 1.1343 | | BertForQuestionAnswering | 16 | 1.0607 | 1.1729 | 1.0607 | 1.1729 | | RobertaForQuestionAnswering | 16 | 1.0603 | 1.1724 | 1.0603 | 1.1724 | | LayoutLMForSequenceClassification | 16 | 1.0583 | 1.1734 | 1.0583 | 1.1736 | | T5Small | 4 | 1.0382 | 1.1813 | 1.0382 | 1.1813 | | T5ForConditionalGeneration | 4 | 1.0382 | 1.1813 | 1.0356 | 1.1813 | | DistilBertForQuestionAnswering | 256 | 1.0299 | 1.1486 | 1.0418 | 1.1486 | | LayoutLMForMaskedLM | 16 | 1.0078 | 1.0517 | 1.0078 | 1.0517 | | RobertaForCausalLM | 16 | 1.0077 | 1.052 | 1.0077 | 1.052 | | BertForMaskedLM | 16 | 1.0075 | 1.0518 | 0.9463 | 1.0518 | | CamemBert | 16 | 1.0035 | 1.0492 | 0.9417 | 1.0492 | | YituTechConvBert | 16 | 0.9911 | 1.0411 | 0.9911 | 1.0411 | | AlbertForQuestionAnswering | 4 | 0.9729 | 1.3147 | 0.9729 | 1.3147 | | DistillGPT2 | 16 | 0.9682 | 1.0641 | 0.9682 | 1.0641 | | PLBartForConditionalGeneration | 4 | 0.9649 | 1.0521 | 0.9294 | 1.0521 | | MegatronBertForQuestionAnswering | 8 | 0.953 | 1.1152 | 0.953 | 1.1152 | | AlbertForMaskedLM | 4 | 0.9501 | 1.268 | 0.9501 | 1.268 | | MBartForCausalLM | 4 | 0.9281 | 0.9912 | 0.9281 | 0.9912 | | PLBartForCausalLM | 8 | 0.914 | 0.9887 | 0.8439 | 0.9887 | | BartForCausalLM | 4 | 0.9137 | 0.9749 | 0.8818 | 0.9749 | | MT5ForConditionalGeneration | 16 | 0.9089 | 1.0018 | 0.8222 | 1.0018 | | ElectraForCausalLM | 32 | 0.8941 | 0.9739 | 0.8941 | 0.9739 | | PegasusForCausalLM | 32 | 0.893 | 0.9864 | 0.893 | 0.9864 | | DistilBertForMaskedLM | 128 | 0.8849 | 0.9624 | 0.8045 | 0.9624 | | TrOCRForCausalLM | 32 | 0.8836 | 0.9583 | 0.8836 | 0.9583 | | BlenderbotSmallForConditionalGeneration | 64 | 0.8729 | 0.9803 | 0.816 | 0.9803 | | PegasusForConditionalGeneration | 32 | 0.8689 | 1.0689 | 0.8687 | 1.0689 | | MBartForConditionalGeneration | 2 | 0.8574 | 1.0307 | 0.8574 | 1.0307 | | BartForConditionalGeneration | 2 | 0.8456 | 1.0139 | 0.8456 | 1.0139 | | MegatronBertForCausalLM | 4 | 0.845 | 1.0961 | 0.845 | 1.0961 | | BlenderbotSmallForCausalLM | 64 | 0.8184 | 0.9119 | 0.7355 | 0.9119 | | Speech2Text2ForCausalLM | 256 | 0.789 | 0.8779 | 0.7143 | 0.8779 | | M2M100ForConditionalGeneration | 16 | 0.7651 | 0.9908 | 0.7651 | 0.9908 | | MobileBertForMaskedLM | 64 | 0.752 | 1.016 | 0.7654 | 1.016 | | XGLMForCausalLM | 8 | 0.7117 | 0.9792 | 0.7117 | 0.9792 | | MobileBertForQuestionAnswering | 128 | 0.6569 | 0.8579 | 0.6505 | 0.8579 | | DebertaForMaskedLM | 4 | 0.5646 | 1.0748 | 0.5649 | 1.0733 | | DebertaV2ForMaskedLM | 1 | 0.5187 | 0.9894 | 0.5129 | 1.0005 | | DebertaForQuestionAnswering | 8 | 0.4867 | 1.2209 | 0.487 | 1.218 | | DebertaV2ForQuestionAnswering | 2 | 0.4855 | 1.0041 | 0.4806 | 1.0036 | | BlenderbotForCausalLM | 4 | nan | 0.999 | nan | 0.999 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | AlbertForMaskedLM | 4 | 163.1963 | 163.1587 | 165.2836 | 162.4534 | | AlbertForQuestionAnswering | 4 | 160.9963 | 160.7677 | 161.433 | 160.4504 | | XLNetLMHeadModel | 8 | 153.1104 | 153.6633 | 151.6684 | 152.5373 | | DebertaV2ForQuestionAnswering | 2 | 116.3646 | 157.4995 | 112.4073 | 136.5503 | | PegasusForConditionalGeneration | 32 | 109.4586 | 107.0701 | 102.1534 | 99.8395 | | TrOCRForCausalLM | 32 | 108.956 | 106.1477 | 98.8934 | 95.5395 | | DebertaV2ForMaskedLM | 1 | 106.0534 | 143.0314 | 104.3972 | 141.3999 | | MBartForConditionalGeneration | 2 | 93.8608 | 92.1148 | 87.9214 | 86.2895 | | BartForConditionalGeneration | 2 | 93.0417 | 91.187 | 87.7946 | 86.2112 | | MegatronBertForQuestionAnswering | 8 | 85.885 | 84.7233 | 84.9406 | 83.9433 | | YituTechConvBert | 16 | 82.2597 | 82.3462 | 75.6132 | 75.5079 | | BlenderbotSmallForConditionalGeneration | 64 | 80.032 | 78.3281 | 74.7315 | 73.1472 | | CamemBert | 16 | 76.4407 | 75.7046 | 72.7593 | 71.9841 | | MBartForCausalLM | 4 | 74.3929 | 72.7671 | 67.9001 | 66.3893 | | M2M100ForConditionalGeneration | 16 | 74.2504 | 76.7499 | 71.7425 | 70.8671 | | BartForCausalLM | 4 | 74.0096 | 72.4829 | 67.6696 | 66.1328 | | PLBartForConditionalGeneration | 4 | 71.6926 | 70.4056 | 66.6692 | 65.2681 | | DistilBertForQuestionAnswering | 256 | 71.2658 | 71.5186 | 70.6492 | 70.572 | | DistilBertForMaskedLM | 128 | 69.9579 | 68.3028 | 69.4614 | 67.6883 | | MobileBertForQuestionAnswering | 128 | 69.8735 | 151.3741 | 113.2246 | 133.5794 | | LayoutLMForMaskedLM | 16 | 69.6713 | 68.9603 | 69.6797 | 68.6248 | | PLBartForCausalLM | 8 | 69.5727 | 68.1384 | 60.9211 | 59.6209 | | BertForMaskedLM | 16 | 68.7292 | 67.9596 | 68.6083 | 67.97 | | RobertaForCausalLM | 16 | 68.3556 | 67.7435 | 68.2043 | 67.5193 | | OPTForCausalLM | 2 | 68.2014 | 67.3931 | 66.4461 | 65.6217 | | DebertaForQuestionAnswering | 8 | 64.2309 | 71.1741 | 63.2963 | 69.3185 | | DistillGPT2 | 16 | 63.1898 | 61.4512 | 54.8378 | 53.2618 | | T5ForConditionalGeneration | 4 | 62.6908 | 58.7865 | 61.1997 | 57.4435 | | T5Small | 4 | 62.6659 | 58.7947 | 61.1709 | 57.1797 | | MobileBertForMaskedLM | 64 | 60.8558 | 153.5208 | 68.6975 | 131.0284 | | DebertaForMaskedLM | 4 | 58.2843 | 68.9715 | 56.5048 | 66.6704 | | PegasusForCausalLM | 32 | 58.1039 | 56.5348 | 52.1993 | 50.7132 | | MegatronBertForCausalLM | 4 | 56.6531 | 55.5386 | 54.9803 | 53.7762 | | XGLMForCausalLM | 8 | 53.3488 | 56.2894 | 51.1424 | 53.1789 | | LayoutLMForSequenceClassification | 16 | 53.3059 | 53.9067 | 53.1758 | 53.649 | | RobertaForQuestionAnswering | 16 | 53.0766 | 53.3038 | 52.8924 | 52.7856 | | BertForQuestionAnswering | 16 | 52.7161 | 52.7784 | 52.5468 | 52.4791 | | ElectraForQuestionAnswering | 64 | 52.5425 | 53.4447 | 52.1755 | 53.1431 | | ElectraForCausalLM | 32 | 47.7277 | 47.1817 | 47.551 | 46.6989 | | BlenderbotSmallForCausalLM | 64 | 46.5247 | 45.4596 | 41.3807 | 40.2216 | | MT5ForConditionalGeneration | 16 | 41.7186 | 47.6859 | 40.5796 | 44.2481 | | GPT2ForSequenceClassification | 4 | 39.3523 | 38.805 | 38.2517 | 37.8337 | | Speech2Text2ForCausalLM | 256 | 34.6072 | 33.8303 | 34.2895 | 33.5782 | | BlenderbotForCausalLM | 4 | nan | 81.9093 | nan | 80.6264 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | +-----------------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | tnt_s_patch16_224 | 128 | 3.026 | 2.9865 | 3.3514 | 3.3072 | | xcit_large_24_p8_224 | 5 | 2.0139 | 1.6339 | 2.4595 | 1.6378 | | twins_pcpvt_base | 64 | 1.9894 | 1.7401 | 2.1625 | 1.8586 | | coat_lite_mini | 128 | 1.9522 | 1.9288 | 2.0839 | 2.0577 | | gmlp_s16_224 | 128 | 1.8654 | 1.8486 | 1.8895 | 1.8629 | | ghostnet_100 | 128 | 1.8548 | 1.6132 | 1.8609 | 1.6433 | | gmixer_24_224 | 128 | 1.7826 | 1.7619 | 1.9152 | 1.8867 | | volo_d1_224 | 64 | 1.7045 | 1.6798 | 1.7626 | 1.7374 | | crossvit_9_240 | 128 | 1.6634 | 1.6403 | 1.8579 | 1.8229 | | swin_base_patch4_window7_224 | 64 | 1.6379 | 1.6272 | 1.744 | 1.731 | | convit_base | 64 | 1.6182 | 1.6148 | 1.7199 | 1.7183 | | lcnet_050 | 128 | 1.6062 | 1.3809 | 1.6373 | 1.3716 | | gluon_inception_v3 | 128 | 1.5356 | 1.525 | 1.5428 | 1.5267 | | adv_inception_v3 | 128 | 1.534 | 1.5217 | 1.5427 | 1.5317 | | inception_v3 | 128 | 1.5338 | 1.5212 | 1.5417 | 1.5298 | | dla102 | 128 | 1.5293 | 1.5277 | 1.535 | 1.5308 | | convnext_base | 64 | 1.5259 | 1.5017 | 1.5366 | 1.5175 | | nfnet_l0 | 128 | 1.5072 | 1.4537 | 1.5108 | 1.4558 | | dm_nfnet_f0 | 128 | 1.5035 | 1.4557 | 1.5205 | 1.4665 | | sebotnet33ts_256 | 64 | 1.4938 | 1.5199 | 1.5026 | 1.5275 | | pit_b_224 | 64 | 1.4442 | 1.4382 | 1.6208 | 1.6125 | | eca_botnext26ts_256 | 128 | 1.4347 | 1.4163 | 1.4388 | 1.4222 | | resnest101e | 64 | 1.4289 | 1.356 | 1.4328 | 1.358 | | mobilevit_s | 64 | 1.4186 | 1.4164 | 1.4727 | 1.4845 | | selecsls42b | 128 | 1.4075 | 1.4066 | 1.4143 | 1.4126 | | botnet26t_256 | 128 | 1.3956 | 1.4138 | 1.4006 | 1.4196 | | mobilenetv3_large_100 | 128 | 1.3932 | 1.3941 | 1.4065 | 1.3966 | | mnasnet_100 | 128 | 1.3907 | 1.4058 | 1.3931 | 1.4508 | | jx_nest_base | 32 | 1.3891 | 1.3766 | 1.5705 | 1.5543 | | regnety_002 | 128 | 1.3792 | 1.2145 | 1.3874 | 1.2139 | | res2net50_14w_8s | 128 | 1.3741 | 1.3534 | 1.3968 | 1.3763 | | res2next50 | 128 | 1.3689 | 1.3591 | 1.3681 | 1.3583 | | mixer_b16_224 | 128 | 1.3627 | 1.365 | 1.3983 | 1.3978 | | cait_m36_384 | 4 | 1.357 | 1.3535 | 1.4529 | 1.4558 | | beit_base_patch16_224 | 64 | 1.3548 | 1.357 | 1.4633 | 1.4652 | | poolformer_m36 | 64 | 1.3537 | 1.3425 | 1.3518 | 1.3438 | | mobilenetv2_100 | 128 | 1.3515 | 1.4039 | 1.3522 | 1.4 | | hrnet_w18 | 128 | 1.3475 | 1.3503 | 1.3873 | 1.3799 | | ese_vovnet19b_dw | 128 | 1.3401 | 1.3596 | 1.354 | 1.374 | | tf_efficientnet_b0 | 128 | 1.3298 | 1.3638 | 1.3289 | 1.3611 | | spnasnet_100 | 128 | 1.3072 | 1.3647 | 1.3132 | 1.3681 | | fbnetc_100 | 128 | 1.2879 | 1.3661 | 1.3194 | 1.3669 | | rexnet_100 | 128 | 1.2836 | 1.3222 | 1.2894 | 1.3234 | | fbnetv3_b | 128 | 1.2737 | 1.2946 | 1.2817 | 1.3094 | | resmlp_12_224 | 128 | 1.2729 | 1.2682 | 1.4107 | 1.4059 | | deit_base_distilled_patch16_224 | 64 | 1.2603 | 1.2604 | 1.3269 | 1.3254 | | vit_base_patch16_224 | 64 | 1.2399 | 1.2395 | 1.3509 | 1.351 | | tinynet_a | 128 | 1.2046 | 1.2237 | 1.2066 | 1.2342 | | cspdarknet53 | 64 | 1.2031 | 1.2386 | 1.2138 | 1.2477 | | tf_mixnet_l | 128 | 1.1816 | 1.188 | 1.1874 | 1.1934 | | visformer_small | 128 | 1.1765 | 1.1682 | 1.2089 | 1.2015 | | mixnet_l | 128 | 1.1655 | 1.1752 | 1.1761 | 1.1822 | | res2net101_26w_4s | 64 | 1.1612 | 1.0781 | 1.1667 | 1.0969 | | pnasnet5large | 16 | 1.1218 | 1.1364 | 1.1377 | 1.1553 | | gluon_xception65 | 32 | 1.0792 | 1.0805 | 1.0901 | 1.0935 | | dpn107 | 32 | 1.0693 | 1.1112 | 1.0704 | 1.1096 | | repvgg_a2 | 128 | 1.0667 | 1.1006 | 1.0748 | 1.1042 | | swsl_resnext101_32x16d | 32 | 1.059 | 1.022 | 1.0586 | 1.0208 | | gernet_l | 128 | 1.0229 | 1.0491 | 1.0293 | 1.0556 | | convmixer_768_32 | 32 | 1.0015 | 1.0022 | 1.0081 | 1.0086 | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+----+----------+------------------------+-----------------------+-------------------------------------+ | adv_inception_v3 | 8 | pass | pass | pass | pass | | beit_base_patch16_224 | 8 | pass | pass | pass | pass | | mobilenetv3_large_100 | 8 | pass | pass | pass | pass | | mobilevit_s | 8 | pass | pass | pass | pass | | nfnet_l0 | 8 | pass | pass | pass | pass | | pit_b_224 | 8 | pass | pass | pass | pass | | pnasnet5large | 8 | pass | pass | pass | pass | | poolformer_m36 | 8 | pass | pass | pass | pass | | regnety_002 | 8 | pass | pass | pass | pass | | repvgg_a2 | 8 | pass | pass | pass | pass | | res2net101_26w_4s | 8 | pass | pass | pass | pass | | res2net50_14w_8s | 8 | pass | pass | pass | pass | | res2next50 | 8 | pass | pass | pass | pass | | resmlp_12_224 | 8 | pass | pass | pass | pass | | resnest101e | 8 | pass | pass | pass | pass | | rexnet_100 | 8 | pass | pass | pass | pass | | sebotnet33ts_256 | 8 | pass | pass | pass | pass | | selecsls42b | 8 | pass | pass | pass | pass | | spnasnet_100 | 8 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 8 | pass | pass | pass | pass | | tf_efficientnet_b0 | 8 | pass | pass | pass | pass | | tf_mixnet_l | 8 | pass | pass | pass | pass | | tinynet_a | 8 | pass | pass | pass | pass | | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | twins_pcpvt_base | 8 | pass | pass | pass | pass | | visformer_small | 8 | pass | pass | pass | pass | | vit_base_patch16_224 | 8 | pass | pass | pass | pass | | volo_d1_224 | 8 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 8 | pass | pass | pass | pass | | mobilenetv2_100 | 8 | pass | pass | pass | pass | | mnasnet_100 | 8 | pass | pass | pass | pass | | mixnet_l | 8 | pass | pass | pass | pass | | eca_botnext26ts_256 | 8 | pass | pass | pass | pass | | botnet26t_256 | 8 | pass | pass | pass | pass | | cait_m36_384 | 4 | pass | pass | pass | pass | | coat_lite_mini | 8 | pass | pass | pass | pass | | convit_base | 8 | pass | pass | pass | pass | | convmixer_768_32 | 8 | pass | pass | pass | pass | | convnext_base | 8 | pass | pass | pass | pass | | crossvit_9_240 | 8 | pass | pass | pass | pass | | cspdarknet53 | 8 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 8 | pass | pass | pass | pass | | dla102 | 8 | pass | pass | pass | pass | | dm_nfnet_f0 | 8 | pass | pass | pass | pass | | dpn107 | 8 | pass | pass | pass | pass | | ese_vovnet19b_dw | 8 | pass | pass | pass | pass | | mixer_b16_224 | 8 | pass | pass | pass | pass | | fbnetc_100 | 8 | pass | pass | pass | pass | | fbnetv3_b | 8 | pass | pass | pass | pass | | gernet_l | 8 | pass | pass | pass | pass | | ghostnet_100 | 8 | pass | pass | pass | pass | | gluon_inception_v3 | 8 | pass | pass | pass | pass | | gluon_xception65 | 8 | pass | pass | pass | pass | | gmixer_24_224 | 8 | pass | pass | pass | pass | | gmlp_s16_224 | 8 | pass | pass | pass | pass | | hrnet_w18 | 8 | pass | pass | pass | pass | | inception_v3 | 8 | pass | pass | pass | pass | | jx_nest_base | 8 | pass | pass | pass | pass | | lcnet_050 | 8 | pass | pass | pass | pass | | swin_base_patch4_window7_224 | 8 | pass | pass | fail_accuracy | pass | +---------------------------------+----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | rexnet_100 | 128 | 224.9662 | 43.8651 | 549.7045 | 49.6792 | | hrnet_w18 | 128 | 192.1312 | 151.1129 | 538.2087 | 169.1894 | | pnasnet5large | 16 | 158.5366 | 110.0134 | 425.2666 | 119.9122 | | ghostnet_100 | 128 | 153.8495 | 53.7871 | 597.803 | 58.2259 | | res2net101_26w_4s | 64 | 150.918 | 87.7778 | 430.3811 | 97.7569 | | twins_pcpvt_base | 64 | 147.4936 | 70.2721 | 1360.7422 | 104.1596 | | adv_inception_v3 | 128 | 145.5702 | 52.6549 | 390.6734 | 59.3601 | | fbnetv3_b | 128 | 132.0557 | 60.2455 | 405.0947 | 66.0463 | | xcit_large_24_p8_224 | 5 | 126.2305 | 86.2627 | 930.2928 | 115.641 | | resnest101e | 64 | 124.9575 | 79.8068 | 252.5227 | 86.9188 | | tinynet_a | 128 | 120.3408 | 43.3221 | 320.7028 | 47.1822 | | mobilevit_s | 64 | 118.2427 | 57.282 | 1212.6895 | 69.4255 | | cait_m36_384 | 4 | 115.4874 | 84.669 | 867.7116 | 132.8021 | | mixnet_l | 128 | 113.4047 | 51.9441 | 486.8732 | 57.2357 | | swin_base_patch4_window7_224 | 64 | 106.1148 | 63.544 | 759.2707 | 89.0701 | | res2net50_14w_8s | 128 | 100.3224 | 80.5095 | 479.9018 | 90.3053 | | poolformer_m36 | 64 | 94.9903 | 62.2564 | 225.6091 | 66.3959 | | fbnetc_100 | 128 | 93.989 | 36.047 | 377.2218 | 40.1527 | | dpn107 | 32 | 90.7086 | 64.0835 | 373.3056 | 69.6322 | | coat_lite_mini | 128 | 90.1127 | 33.6073 | 1208.8042 | 44.5213 | | cspdarknet53 | 64 | 86.3952 | 40.8298 | 227.6929 | 45.2179 | | crossvit_9_240 | 128 | 85.881 | 41.9985 | 1078.4562 | 63.9877 | | gluon_xception65 | 32 | 85.8334 | 57.6158 | 204.0599 | 62.532 | | jx_nest_base | 32 | 82.1427 | 52.5359 | 818.2768 | 76.7963 | | dla102 | 128 | 75.4119 | 52.0476 | 239.3686 | 57.9405 | | tf_mixnet_l | 128 | 71.7949 | 52.5263 | 74.7729 | 57.3458 | | regnety_002 | 128 | 71.2927 | 31.3177 | 299.6921 | 34.7583 | | botnet26t_256 | 128 | 68.366 | 29.0346 | 507.1324 | 34.5135 | | tnt_s_patch16_224 | 128 | 67.6963 | 48.9934 | 453.8764 | 77.1973 | | sebotnet33ts_256 | 64 | 66.4868 | 37.2848 | 632.0998 | 47.2274 | | volo_d1_224 | 64 | 63.0412 | 39.7819 | 884.4952 | 62.0114 | | gmlp_s16_224 | 128 | 61.0161 | 38.1389 | 152.6563 | 54.1933 | | nfnet_l0 | 128 | 60.6763 | 35.0895 | 208.7935 | 39.0511 | | convnext_base | 64 | 60.0963 | 41.6565 | 420.167 | 51.3909 | | tf_efficientnet_b0 | 128 | 57.7291 | 38.1859 | 205.9058 | 41.6089 | | gluon_inception_v3 | 128 | 54.2118 | 53.0126 | 57.1827 | 61.1203 | | inception_v3 | 128 | 54.1294 | 52.9654 | 56.6752 | 58.1736 | | gernet_l | 128 | 51.0487 | 31.4274 | 191.511 | 35.3015 | | gmixer_24_224 | 128 | 50.865 | 37.4809 | 277.4413 | 54.7143 | | eca_botnext26ts_256 | 128 | 49.9108 | 31.6743 | 272.9045 | 35.6355 | | convit_base | 64 | 48.3676 | 30.4493 | 335.4578 | 45.7735 | | mobilenetv3_large_100 | 128 | 48.3606 | 33.7093 | 127.4584 | 37.67 | | swsl_resnext101_32x16d | 32 | 48.1704 | 46.9654 | 104.9527 | 51.188 | | mnasnet_100 | 128 | 47.4842 | 31.3142 | 178.561 | 33.5369 | | res2next50 | 128 | 47.4303 | 46.5696 | 121.2247 | 49.7811 | | ese_vovnet19b_dw | 128 | 46.6961 | 22.2385 | 158.4995 | 24.6631 | | pit_b_224 | 64 | 46.5782 | 29.299 | 800.5762 | 44.3182 | | visformer_small | 128 | 46.3822 | 25.6801 | 358.1299 | 30.8381 | | deit_base_distilled_patch16_224 | 64 | 43.4 | 24.8982 | 210.4033 | 37.4958 | | mobilenetv2_100 | 128 | 41.6236 | 31.8541 | 90.446 | 34.6492 | | lcnet_050 | 128 | 40.7226 | 23.6077 | 141.2591 | 25.3412 | | resmlp_12_224 | 128 | 40.1924 | 18.6476 | 128.4408 | 24.8915 | | dm_nfnet_f0 | 128 | 38.1535 | 38.9289 | 40.2725 | 42.1344 | | beit_base_patch16_224 | 64 | 37.0688 | 27.5769 | 272.031 | 40.8913 | | spnasnet_100 | 128 | 36.105 | 35.8612 | 60.9458 | 39.5017 | | convmixer_768_32 | 32 | 35.6133 | 30.5962 | 101.1971 | 31.817 | | vit_base_patch16_224 | 64 | 33.7325 | 25.0333 | 43.3898 | 36.6412 | | repvgg_a2 | 128 | 32.9142 | 30.5913 | 164.1643 | 33.4591 | | selecsls42b | 128 | 31.5101 | 27.2454 | 155.84 | 29.7677 | | mixer_b16_224 | 128 | 31.2014 | 20.8052 | 206.4417 | 29.0151 | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | gmlp_s16_224 | 128 | 1.1848 | 1.2358 | 1.1831 | 1.2358 | | pnasnet5large | 16 | 1.1712 | 1.3207 | 1.1522 | 1.3201 | | gmixer_24_224 | 128 | 1.1117 | 1.1923 | 1.1144 | 1.1923 | | convit_base | 64 | 1.0948 | 1.1869 | 1.098 | 1.1869 | | mobilenetv2_100 | 128 | 1.0431 | 1.1739 | 1.0267 | 1.1739 | | dm_nfnet_f0 | 128 | 1.013 | 1.0932 | 1.013 | 1.0932 | | resmlp_12_224 | 128 | 1.0079 | 1.1048 | 1.0093 | 1.1048 | | tinynet_a | 128 | 0.9984 | 1.1113 | 0.999 | 1.111 | | rexnet_100 | 128 | 0.9977 | 1.0864 | 0.9744 | 1.0862 | | resnest101e | 64 | 0.9972 | 1.1047 | 0.9933 | 1.1047 | | tf_efficientnet_b0 | 128 | 0.9871 | 1.1078 | 0.9876 | 1.1074 | | tnt_s_patch16_224 | 128 | 0.9834 | 1.066 | 0.986 | 1.066 | | convmixer_768_32 | 32 | 0.9762 | 0.9999 | 0.9657 | 0.9999 | | twins_pcpvt_base | 64 | 0.9729 | 1.0909 | 0.9763 | 1.0909 | | mobilevit_s | 64 | 0.9557 | 1.0236 | 0.9263 | 1.0236 | | dla102 | 128 | 0.9536 | 1.0437 | 0.9528 | 1.0434 | | mixer_b16_224 | 128 | 0.9501 | 1.0133 | 0.9466 | 1.0133 | | vit_base_patch16_224 | 64 | 0.9362 | 0.9867 | 0.9362 | 0.9867 | | deit_base_distilled_patch16_224 | 64 | 0.9353 | 0.9863 | 0.9072 | 0.9863 | | visformer_small | 128 | 0.9348 | 1.0408 | 0.9245 | 1.0408 | | tf_mixnet_l | 128 | 0.9346 | 1.0921 | 0.9343 | 1.092 | | beit_base_patch16_224 | 64 | 0.9308 | 1.0156 | 0.9307 | 1.0156 | | fbnetv3_b | 128 | 0.9228 | 1.0004 | 0.917 | 1.0069 | | nfnet_l0 | 128 | 0.9215 | 1.0065 | 0.9101 | 1.0065 | | volo_d1_224 | 64 | 0.9131 | 1.0077 | 0.9089 | 1.0078 | | cspdarknet53 | 64 | 0.9097 | 1.0569 | 0.9098 | 1.0569 | | ese_vovnet19b_dw | 128 | 0.9047 | 1.0046 | 0.8976 | 1.0046 | | ghostnet_100 | 128 | 0.8976 | 1.0514 | 0.8408 | 1.05 | | hrnet_w18 | 128 | 0.8918 | 1.0121 | 0.889 | 1.0144 | | sebotnet33ts_256 | 64 | 0.891 | 1.1401 | 0.9207 | 1.1401 | | inception_v3 | 128 | 0.8904 | 1.0459 | 0.8902 | 1.0459 | | adv_inception_v3 | 128 | 0.8904 | 1.0459 | 0.8902 | 1.0459 | | gluon_inception_v3 | 128 | 0.8904 | 1.0459 | 0.8902 | 1.0459 | | mobilenetv3_large_100 | 128 | 0.8881 | 1.0046 | 0.865 | 1.0046 | | dpn107 | 32 | 0.8833 | 0.9977 | 0.8676 | 0.9977 | | gluon_xception65 | 32 | 0.8832 | 0.9998 | 0.8833 | 0.9998 | | spnasnet_100 | 128 | 0.8786 | 1.0063 | 0.8788 | 1.0063 | | selecsls42b | 128 | 0.8785 | 1.0139 | 0.8473 | 1.0145 | | poolformer_m36 | 64 | 0.8768 | 1.1916 | 0.8592 | 1.1916 | | eca_botnext26ts_256 | 128 | 0.8738 | 1.0257 | 0.8738 | 1.0257 | | res2net50_14w_8s | 128 | 0.8712 | 0.9828 | 0.8501 | 0.983 | | res2net101_26w_4s | 64 | 0.871 | 0.9822 | 0.8506 | 0.9822 | | mixnet_l | 128 | 0.8687 | 1.0134 | 0.8686 | 1.0134 | | mnasnet_100 | 128 | 0.8683 | 1.0074 | 0.8684 | 1.0074 | | res2next50 | 128 | 0.866 | 0.9759 | 0.866 | 0.9759 | | cait_m36_384 | 4 | 0.8636 | 1.0068 | 0.8637 | 1.0073 | | fbnetc_100 | 128 | 0.8596 | 1.0104 | 0.8597 | 1.0104 | | pit_b_224 | 64 | 0.8578 | 1.0382 | 0.8566 | 1.0382 | | convnext_base | 64 | 0.8505 | 1.0373 | 0.8317 | 1.0373 | | gernet_l | 128 | 0.8499 | 1.0005 | 0.8497 | 1.0005 | | swsl_resnext101_32x16d | 32 | 0.8477 | 1.0007 | 0.8477 | 1.0007 | | coat_lite_mini | 128 | 0.8402 | 1.0437 | 0.8501 | 1.0437 | | lcnet_050 | 128 | 0.8273 | 1.0008 | 0.8174 | 1.0008 | | botnet26t_256 | 128 | 0.8239 | 1.0 | 0.824 | 1.0 | | xcit_large_24_p8_224 | 5 | 0.8228 | 1.0079 | 0.8263 | 1.0124 | | regnety_002 | 128 | 0.8165 | 1.0004 | 0.7848 | 1.0004 | | repvgg_a2 | 128 | 0.7738 | 1.0131 | 0.7738 | 1.0131 | | crossvit_9_240 | 128 | 0.7526 | 1.0019 | 0.7524 | 1.0019 | | swin_base_patch4_window7_224 | 64 | 0.7214 | 0.9303 | 0.7297 | 0.9303 | | jx_nest_base | 32 | 0.6693 | 0.9905 | 0.6705 | 0.9905 | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | name | bs | inductor | inductor_no_cudagraphs | inductor_max_autotune | inductor_max_autotune_no_cudagraphs | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ | convmixer_768_32 | 32 | 300.0291 | 299.9443 | 297.8893 | 297.7707 | | hrnet_w18 | 128 | 208.3656 | 206.8841 | 201.1329 | 202.9471 | | pnasnet5large | 16 | 174.5393 | 173.23 | 172.1158 | 169.8113 | | tf_mixnet_l | 128 | 160.135 | 159.2144 | 159.3293 | 158.4715 | | mixnet_l | 128 | 155.1819 | 154.0876 | 153.9369 | 153.1759 | | cait_m36_384 | 4 | 123.0772 | 123.3186 | 114.7842 | 114.6307 | | resnest101e | 64 | 114.7176 | 121.007 | 114.3349 | 120.8923 | | dla102 | 128 | 112.2881 | 112.3634 | 111.9551 | 112.1236 | | swsl_resnext101_32x16d | 32 | 111.9813 | 115.7084 | 111.6536 | 116.0687 | | poolformer_m36 | 64 | 106.9875 | 107.7497 | 107.0367 | 107.6677 | | tnt_s_patch16_224 | 128 | 106.6506 | 108.0809 | 96.3865 | 97.5664 | | adv_inception_v3 | 128 | 104.3768 | 105.2563 | 103.742 | 104.443 | | gluon_inception_v3 | 128 | 104.3702 | 104.9805 | 103.8921 | 105.0778 | | inception_v3 | 128 | 104.3155 | 105.2835 | 103.8708 | 104.6451 | | res2net50_14w_8s | 128 | 102.1548 | 103.7138 | 100.6208 | 102.0375 | | convit_base | 64 | 100.682 | 100.7395 | 94.6361 | 94.6613 | | dpn107 | 32 | 99.1128 | 95.5076 | 98.9991 | 95.5365 | | res2next50 | 128 | 91.9468 | 92.5855 | 91.9398 | 92.7046 | | gluon_xception65 | 32 | 91.748 | 91.625 | 90.8134 | 90.5599 | | swin_base_patch4_window7_224 | 64 | 89.0227 | 89.5743 | 83.7603 | 84.1753 | | fbnetv3_b | 128 | 85.9862 | 84.4588 | 85.4503 | 83.5889 | | mixer_b16_224 | 128 | 85.6864 | 85.5349 | 83.5185 | 83.6657 | | res2net101_26w_4s | 64 | 85.5716 | 92.931 | 84.9911 | 91.5106 | | dm_nfnet_f0 | 128 | 84.0895 | 86.9625 | 83.2779 | 86.2461 | | pit_b_224 | 64 | 81.6561 | 82.0702 | 72.7687 | 73.1262 | | convnext_base | 64 | 80.0661 | 81.453 | 79.5788 | 80.357 | | visformer_small | 128 | 77.2749 | 77.8055 | 75.2202 | 75.6284 | | beit_base_patch16_224 | 64 | 74.6681 | 74.5177 | 69.5741 | 69.1183 | | nfnet_l0 | 128 | 74.0506 | 76.767 | 73.9031 | 76.7771 | | eca_botnext26ts_256 | 128 | 73.7967 | 74.732 | 73.5375 | 74.313 | | cspdarknet53 | 64 | 73.5508 | 71.4896 | 72.9836 | 70.9455 | | gmlp_s16_224 | 128 | 73.5253 | 73.9639 | 72.5233 | 73.5151 | | jx_nest_base | 32 | 71.9518 | 72.9016 | 63.6997 | 64.5556 | | gernet_l | 128 | 71.1357 | 69.3325 | 70.6905 | 68.8367 | | botnet26t_256 | 128 | 71.0828 | 70.029 | 70.7261 | 69.7494 | | volo_d1_224 | 64 | 70.5427 | 71.4547 | 68.3347 | 69.2369 | | vit_base_patch16_224 | 64 | 69.7831 | 69.7801 | 64.1791 | 64.0354 | | repvgg_a2 | 128 | 68.0045 | 65.9019 | 67.5598 | 65.7685 | | deit_base_distilled_patch16_224 | 64 | 67.5126 | 67.0506 | 63.7581 | 63.7308 | | gmixer_24_224 | 128 | 66.039 | 66.6008 | 61.471 | 62.2492 | | tf_efficientnet_b0 | 128 | 61.184 | 59.6981 | 61.3009 | 59.8215 | | fbnetc_100 | 128 | 61.1237 | 57.6254 | 59.6898 | 57.5408 | | xcit_large_24_p8_224 | 5 | 60.9077 | 76.2594 | 58.2081 | 77.2834 | | rexnet_100 | 128 | 59.3765 | 57.5236 | 59.0595 | 57.6692 | | twins_pcpvt_base | 64 | 59.1109 | 68.5273 | 54.6172 | 63.0969 | | tinynet_a | 128 | 57.7272 | 56.954 | 57.7552 | 56.4225 | | coat_lite_mini | 128 | 57.6648 | 58.3542 | 54.0857 | 54.7672 | | mobilevit_s | 64 | 57.2611 | 57.544 | 55.2786 | 54.8024 | | sebotnet33ts_256 | 64 | 51.564 | 50.6513 | 51.2031 | 50.5236 | | spnasnet_100 | 128 | 50.7093 | 48.5572 | 50.4032 | 48.4586 | | crossvit_9_240 | 128 | 49.1771 | 49.8118 | 44.0418 | 44.8204 | | ghostnet_100 | 128 | 48.4646 | 55.8543 | 48.3651 | 54.7459 | | ese_vovnet19b_dw | 128 | 46.1538 | 45.5122 | 45.7289 | 45.0294 | | mobilenetv2_100 | 128 | 45.9662 | 44.2884 | 45.9857 | 44.5279 | | mnasnet_100 | 128 | 43.762 | 43.3129 | 43.6721 | 41.9699 | | selecsls42b | 128 | 42.5461 | 42.6063 | 42.3623 | 42.4158 | | mobilenetv3_large_100 | 128 | 41.7814 | 41.7146 | 41.3158 | 41.6747 | | resmlp_12_224 | 128 | 41.6567 | 41.8046 | 37.6055 | 37.7592 | | regnety_002 | 128 | 26.9701 | 31.0044 | 26.8701 | 30.6785 | | lcnet_050 | 128 | 18.6049 | 21.6797 | 18.1966 | 21.8385 | +---------------------------------+-----+----------+------------------------+-----------------------+-------------------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_104_14_04_23_performance_amp_147/torchbench_amp.png : ![](https://i.imgur.com/UilgJpr.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_104_14_04_23_performance_amp_147/huggingface_amp.png : ![](https://i.imgur.com/Noiy3J2.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_104_14_04_23_performance_amp_147/timm_models_amp.png : ![](https://i.imgur.com/LJmD4OF.png)

Build Summary

### Run name ### day_104_14_04_23_performance_amp_147 ### Commit hashes ### pytorch commit: 75f55ca63bd5623352c8eda8e31ff76ee5c960a7 pytorch commit date: 2023-04-13 00:45:48+00:00 torchbench commit: cd89d490ecbcca7d8ca50324522b31a1a198c753 torchbench commit date: 2023-04-13 11:05:33-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+git75f55ca ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

williamwen42 commented 1 year ago

Performance Dashboard for amp precision (Python 3.11)

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 87%, 55/63 | 100%, 45/45 | 98%, 60/61  |
|       aot_eager        | 87%, 55/63 | 100%, 45/45 | 98%, 60/61  |
|        inductor        | 83%, 52/63 | 93%, 42/45  | 97%, 59/61  |
| inductor_no_cudagraphs | 84%, 53/63 | 98%, 44/45  | 98%, 60/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|        inductor        |   1.62x    |    1.65x    |    1.46x    |
| inductor_no_cudagraphs |   1.30x    |    1.58x    |    1.40x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    4.08    |    6.46     |    5.00     |
|       aot_eager        |    8.77    |    14.66    |    11.60    |
|        inductor        |   53.21    |    53.30    |    90.52    |
| inductor_no_cudagraphs |   58.03    |    52.81    |   102.89    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   0.99x    |    0.96x    |    1.00x    |
|        inductor        |   1.03x    |    0.98x    |    1.01x    |
| inductor_no_cudagraphs |   1.00x    |    1.01x    |    1.00x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/williamwen/cluster/oneoff_cron_logs/day_047_16_02_23_performance_amp_184 Previous report name: /data/home/williamwen/cluster/oneoff_cron_logs/day_040_09_02_23_performance_amp_322 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 86%, 51/59 | 83%, 49/59 | | inductor | huggingface | 91%, 41/45 | 91%, 41/45 | | inductor | timm_models | 95%, 58/61 | 98%, 60/61 | | inductor_no_cudagraphs | torchbench | 86%, 51/59 | 83%, 49/59 | | inductor_no_cudagraphs | huggingface | 98%, 44/45 | 96%, 43/45 | | inductor_no_cudagraphs | timm_models | 95%, 58/61 | 98%, 60/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.60x | 1.54x | | inductor | huggingface | 1.59x | 1.57x | | inductor | timm_models | 1.37x | 1.36x | | inductor_no_cudagraphs | torchbench | 1.35x | 1.28x | | inductor_no_cudagraphs | huggingface | 1.51x | 1.51x | | inductor_no_cudagraphs | timm_models | 1.37x | 1.34x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+-------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+---------------+------------------------+ | torchbench | vision_maskrcnn | infra_error | infra_error | | torchbench | DALLE2_pytorch | infra_error | infra_error | | torchbench | detectron2_fcos_r_50_fpn | infra_error | infra_error | | torchbench | drq | infra_error | infra_error | | torchbench | pytorch_struct | infra_error | infra_error | | torchbench | soft_actor_critic | infra_error | infra_error | | torchbench | timm_efficientdet | infra_error | infra_error | | torchbench | torchrec_dlrm | infra_error | infra_error | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | llama | fail_accuracy | fail_accuracy | | huggingface | DebertaV2ForQuestionAnswering | infra_error | pass | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | cait_m36_384 | OOM | pass | +-------------+-------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | phlippe_densenet | 1.917714 | 0.901889 | | torchbench | dcgan | 1.514868 | 0.895629 | | torchbench | basic_gnn_edgecnn | 1.316141 | 0.0 | | torchbench | detectron2_fcos_r_50_fpn | 0.0 | 0.0 | | torchbench | timm_efficientdet | 0.0 | 0.0 | | torchbench | soft_actor_critic | 0.0 | 0.0 | | torchbench | pytorch_struct | 0.0 | 0.0 | | torchbench | drq | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | DALLE2_pytorch | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | torchbench | dlrm | 0.0 | 1.226872 | | torchbench | timm_vision_transformer_large | 0.0 | 0.993088 | | torchbench | torchrec_dlrm | 0.0 | 0.0 | | huggingface | LayoutLMForMaskedLM | 0.0 | 1.604682 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+--------------------------------+------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+------------+------------------------+ | torchbench | hf_T5_large | 164.083421 | 164.456502 | | torchbench | hf_BigBird | 163.680048 | 126.065458 | | torchbench | densenet121 | 113.059074 | 124.351767 | | torchbench | timm_efficientnet | 110.753224 | 133.045389 | | torchbench | phlippe_densenet | 101.337409 | 153.191907 | | huggingface | MobileBertForQuestionAnswering | 134.464177 | 134.520693 | | huggingface | MobileBertForMaskedLM | 132.344597 | 133.560424 | | timm_models | hrnet_w18 | 215.300548 | 235.439188 | | timm_models | rexnet_100 | 198.835597 | 287.845509 | | timm_models | ghostnet_100 | 182.636547 | 232.169208 | | timm_models | pnasnet5large | 147.549783 | 156.780616 | | timm_models | resnest101e | 138.563704 | 158.368072 | | timm_models | mobilevit_s | 135.482809 | 157.381025 | | timm_models | fbnetv3_b | 131.883844 | 162.084463 | | timm_models | gluon_inception_v3 | 131.07415 | 158.07422 | | timm_models | tf_mixnet_l | 128.487929 | 147.30566 | | timm_models | res2net101_26w_4s | 128.054408 | 138.097871 | | timm_models | adv_inception_v3 | 126.647641 | 150.583826 | | timm_models | inception_v3 | 126.359993 | 154.997186 | | timm_models | tinynet_a | 123.713257 | 143.520164 | | timm_models | mixnet_l | 122.645935 | 147.809105 | | timm_models | tf_efficientnet_b0 | 119.095017 | 135.472967 | | timm_models | mobilenetv3_large_100 | 116.468401 | 141.481082 | | timm_models | fbnetc_100 | 105.643603 | 134.635195 | | timm_models | spnasnet_100 | 104.156516 | 131.219595 | +-------------+--------------------------------+------------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | basic_gnn_edgecnn | 1.268084 | 0.0 | | torchbench | pytorch_stargan | 0.893437 | 0.889299 | | torchbench | resnet50 | 0.890619 | 0.887016 | | torchbench | timm_vovnet | 0.888781 | 0.887004 | | torchbench | timm_vision_transformer | 0.85232 | 0.846964 | | torchbench | speech_transformer | 0.846621 | 0.844683 | | torchbench | mobilenet_v3_large | 0.788748 | 0.78255 | | torchbench | mnasnet1_0 | 0.784332 | 0.774557 | | torchbench | resnext50_32x4d | 0.780881 | 0.771616 | | torchbench | squeezenet1_1 | 0.776372 | 0.775402 | | torchbench | LearningToPaint | 0.757111 | 0.7482 | | torchbench | phlippe_densenet | 0.729494 | 0.713997 | | torchbench | densenet121 | 0.691168 | 0.670463 | | torchbench | resnet18 | 0.618876 | 0.61026 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.603505 | 0.600365 | | torchbench | functorch_dp_cifar10 | 0.453125 | 0.444502 | | torchbench | phlippe_resnet | 0.378591 | 0.36166 | | torchbench | detectron2_fcos_r_50_fpn | 0.0 | 0.0 | | torchbench | timm_efficientdet | 0.0 | 0.0 | | torchbench | soft_actor_critic | 0.0 | 0.0 | | torchbench | pytorch_struct | 0.0 | 0.0 | | torchbench | drq | 0.0 | 0.0 | | torchbench | timm_vision_transformer_large | 0.0 | 0.973508 | | torchbench | DALLE2_pytorch | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | dlrm | 0.0 | 1.000856 | | torchbench | torchrec_dlrm | 0.0 | 0.0 | | huggingface | TrOCRForCausalLM | 0.87395 | 0.881037 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.864986 | 0.897783 | | huggingface | PLBartForCausalLM | 0.863 | 0.860945 | | huggingface | ElectraForCausalLM | 0.861134 | 0.93223 | | huggingface | MobileBertForQuestionAnswering | 0.857907 | 0.857131 | | huggingface | DistilBertForMaskedLM | 0.851792 | 0.849938 | | huggingface | BlenderbotSmallForCausalLM | 0.804903 | 0.803499 | | huggingface | Speech2Text2ForCausalLM | 0.77739 | 0.775883 | | huggingface | LayoutLMForMaskedLM | 0.0 | 0.924424 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | crossvit_9_240 | 0.871764 | 0.870197 | | timm_models | regnety_002 | 0.866936 | 0.862637 | | timm_models | lcnet_050 | 0.843427 | 0.838246 | | timm_models | jx_nest_base | 0.733958 | 0.732922 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

/data/home/williamwen/cluster/oneoff_cron_logs/day_144_24_05_23_performance_amp_246/comp_time_over_time.png : ![](https://i.imgur.com/T2xkoJ3.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_144_24_05_23_performance_amp_246/geomean_over_time.png : ![](https://i.imgur.com/HqwlSDO.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_144_24_05_23_performance_amp_246/passrate_over_time.png : ![](https://i.imgur.com/GiwRAfw.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_144_24_05_23_performance_amp_246/memory_over_time.png : ![](https://i.imgur.com/c9NzzFp.png)

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/williamwen/cluster/oneoff_cron_logs/day_047_16_02_23_performance_amp_184 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/williamwen/cluster/oneoff_cron_logs/day_040_09_02_23_performance_amp_322 Current report name (compiler: inductor, suite: torchbench): /data/home/williamwen/cluster/oneoff_cron_logs/day_047_16_02_23_performance_amp_184 Previous report name (compiler: inductor, suite: torchbench): /data/home/williamwen/cluster/oneoff_cron_logs/day_040_09_02_23_performance_amp_322 Accuracy regressions ~~~ +------------------------+------------------------------+-------------+---------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------------------+-------------+---------------+ | inductor_no_cudagraphs | pytorch_CycleGAN_and_pix2pix | pass | fail_accuracy | | inductor_no_cudagraphs | phlippe_resnet | pass | fail_accuracy | | inductor | pytorch_CycleGAN_and_pix2pix | pass | fail_accuracy | | inductor | phlippe_resnet | pass | fail_accuracy | +------------------------+------------------------------+-------------+---------------+ ~~~ Performance speedup regressions ~~~ +------------------------+-------------------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------------------------+-------------+------------+ | inductor_no_cudagraphs | lennard_jones | 1.0479 | 0.9232 | | inductor_no_cudagraphs | tts_angular | 1.0306 | 0.8749 | | inductor_no_cudagraphs | dlrm | 1.3205 | 0.0 | | inductor_no_cudagraphs | timm_vision_transformer_large | 1.103 | 0.0 | | inductor | tts_angular | 1.0226 | 0.9018 | | inductor | hf_Longformer | 1.6303 | 0.0 | | inductor | timm_vision_transformer_large | 1.074 | 0.0 | +------------------------+-------------------------------+-------------+------------+ ~~~ Compilation latency (sec) regressions ~~~ +------------------------+--------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+--------------------+-------------+------------+ | inductor_no_cudagraphs | phlippe_densenet | 35.5879 | 177.3165 | | inductor_no_cudagraphs | timm_efficientnet | 75.8329 | 152.4158 | | inductor_no_cudagraphs | densenet121 | 82.9904 | 147.3931 | | inductor_no_cudagraphs | mobilenet_v3_large | 65.4633 | 145.8978 | | inductor_no_cudagraphs | hf_GPT2_large | 78.3684 | 140.4915 | | inductor_no_cudagraphs | mobilenet_v2 | 34.0223 | 139.8573 | | inductor_no_cudagraphs | yolov3 | 52.7001 | 125.7029 | | inductor_no_cudagraphs | hf_BigBird | 117.2348 | 120.1973 | | inductor | phlippe_densenet | 37.1538 | 171.3755 | | inductor | timm_efficientnet | 76.7066 | 150.8667 | | inductor | mobilenet_v3_large | 68.574 | 148.1743 | | inductor | densenet121 | 78.3237 | 142.1128 | | inductor | hf_GPT2_large | 80.0068 | 141.2736 | | inductor | mobilenet_v2 | 34.8149 | 136.3726 | | inductor | yolov3 | 53.8302 | 124.8586 | +------------------------+--------------------+-------------+------------+ ~~~ Peak Memory Compression Ratio regressions ~~~ +------------------------+-------------------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------------------------+-------------+------------+ | inductor_no_cudagraphs | speech_transformer | 1.0888 | 0.869 | | inductor_no_cudagraphs | squeezenet1_1 | 1.1148 | 0.7678 | | inductor_no_cudagraphs | LearningToPaint | 0.9966 | 0.7466 | | inductor_no_cudagraphs | phlippe_densenet | 1.0062 | 0.7179 | | inductor_no_cudagraphs | densenet121 | 0.9945 | 0.6035 | | inductor_no_cudagraphs | pytorch_CycleGAN_and_pix2pix | 1.0224 | 0.6004 | | inductor_no_cudagraphs | phlippe_resnet | 1.0037 | 0.3443 | | inductor_no_cudagraphs | timm_vision_transformer_large | 0.9762 | 0.0 | | inductor_no_cudagraphs | dlrm | 1.0009 | 0.0 | | inductor | shufflenet_v2_x1_0 | 0.9343 | 0.8656 | | inductor | speech_transformer | 1.0825 | 0.8651 | +------------------------+-------------------------------+-------------+------------+ ~~~ ### Regressions for huggingface ### Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/williamwen/cluster/oneoff_cron_logs/day_047_16_02_23_performance_amp_184 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/williamwen/cluster/oneoff_cron_logs/day_040_09_02_23_performance_amp_322 Current report name (compiler: inductor, suite: huggingface): /data/home/williamwen/cluster/oneoff_cron_logs/day_047_16_02_23_performance_amp_184 Previous report name (compiler: inductor, suite: huggingface): /data/home/williamwen/cluster/oneoff_cron_logs/day_040_09_02_23_performance_amp_322 Accuracy regressions ~~~ +------------------------+-------------------------------+-------------+---------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------------------------+-------------+---------------+ | inductor_no_cudagraphs | DebertaV2ForQuestionAnswering | pass | fail_accuracy | +------------------------+-------------------------------+-------------+---------------+ ~~~ Performance speedup regressions ~~~ +------------------------+-------------------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------------------------+-------------+------------+ | inductor_no_cudagraphs | DebertaForMaskedLM | 0.9907 | 0.9352 | | inductor_no_cudagraphs | AllenaiLongformerBase | 1.6455 | 0.0 | | inductor | DebertaV2ForQuestionAnswering | 1.0834 | 0.9392 | +------------------------+-------------------------------+-------------+------------+ ~~~ Compilation latency (sec) regressions ~~~ +------------------------+--------------------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+--------------------------------+-------------+------------+ | inductor_no_cudagraphs | MobileBertForMaskedLM | 113.7818 | 140.0078 | | inductor_no_cudagraphs | MT5ForConditionalGeneration | 89.1825 | 135.7902 | | inductor_no_cudagraphs | MobileBertForQuestionAnswering | 104.0908 | 135.3323 | | inductor_no_cudagraphs | M2M100ForConditionalGeneration | 92.5092 | 121.2122 | | inductor | MobileBertForMaskedLM | 115.4381 | 142.8343 | | inductor | MT5ForConditionalGeneration | 92.2199 | 138.8022 | | inductor | MobileBertForQuestionAnswering | 108.0764 | 138.2921 | | inductor | M2M100ForConditionalGeneration | 96.6702 | 127.9952 | +------------------------+--------------------------------+-------------+------------+ ~~~ Peak Memory Compression Ratio regressions ~~~ +------------------------+-----------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-----------------------+-------------+------------+ | inductor_no_cudagraphs | AllenaiLongformerBase | 0.9124 | 0.0 | +------------------------+-----------------------+-------------+------------+ ~~~ ### Regressions for timm_models ### Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/williamwen/cluster/oneoff_cron_logs/day_047_16_02_23_performance_amp_184 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/williamwen/cluster/oneoff_cron_logs/day_040_09_02_23_performance_amp_322 Current report name (compiler: inductor, suite: timm_models): /data/home/williamwen/cluster/oneoff_cron_logs/day_047_16_02_23_performance_amp_184 Previous report name (compiler: inductor, suite: timm_models): /data/home/williamwen/cluster/oneoff_cron_logs/day_040_09_02_23_performance_amp_322 Performance speedup regressions ~~~ +----------+---------------+-------------+------------+ | compiler | name | prev_status | cur_status | +----------+---------------+-------------+------------+ | inductor | pnasnet5large | 1.1413 | 0.9452 | +----------+---------------+-------------+------------+ ~~~ Compilation latency (sec) regressions ~~~ +------------------------+-----------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-----------------------+-------------+------------+ | inductor_no_cudagraphs | rexnet_100 | 55.9265 | 315.721 | | inductor_no_cudagraphs | ghostnet_100 | 62.387 | 255.9795 | | inductor_no_cudagraphs | fbnetv3_b | 65.65 | 184.9282 | | inductor_no_cudagraphs | tf_mixnet_l | 61.3387 | 173.799 | | inductor_no_cudagraphs | mobilenetv3_large_100 | 41.7463 | 173.3434 | | inductor_no_cudagraphs | tinynet_a | 49.8997 | 172.5106 | | inductor_no_cudagraphs | mobilevit_s | 79.2543 | 171.8832 | | inductor_no_cudagraphs | mixnet_l | 58.8432 | 171.804 | | inductor_no_cudagraphs | inception_v3 | 58.888 | 171.7195 | | inductor_no_cudagraphs | resnest101e | 100.691 | 171.5599 | | inductor_no_cudagraphs | adv_inception_v3 | 61.8915 | 171.0537 | | inductor_no_cudagraphs | gluon_inception_v3 | 58.6345 | 169.8777 | | inductor_no_cudagraphs | tf_efficientnet_b0 | 45.0207 | 167.3042 | | inductor_no_cudagraphs | xcit_large_24_p8_224 | 105.4047 | 160.8476 | | inductor_no_cudagraphs | res2net101_26w_4s | 92.7082 | 159.3268 | | inductor_no_cudagraphs | twins_pcpvt_base | 117.3875 | 157.5274 | | inductor_no_cudagraphs | pnasnet5large | 100.0627 | 152.5736 | | inductor_no_cudagraphs | fbnetc_100 | 41.5344 | 152.4179 | | inductor_no_cudagraphs | spnasnet_100 | 40.7063 | 151.848 | | inductor_no_cudagraphs | mobilenetv2_100 | 35.9906 | 141.7804 | | inductor_no_cudagraphs | mnasnet_100 | 36.0892 | 134.8647 | | inductor_no_cudagraphs | res2net50_14w_8s | 85.8919 | 131.2387 | | inductor_no_cudagraphs | cait_m36_384 | 107.5555 | 126.2237 | | inductor | rexnet_100 | 59.9265 | 309.363 | | inductor | ghostnet_100 | 63.4552 | 258.8826 | | inductor | fbnetv3_b | 66.6791 | 186.0991 | | inductor | tinynet_a | 50.1639 | 173.8911 | | inductor | adv_inception_v3 | 59.5714 | 173.5962 | | inductor | mixnet_l | 61.2409 | 173.3327 | | inductor | resnest101e | 102.515 | 173.1165 | | inductor | mobilenetv3_large_100 | 42.9618 | 173.0226 | | inductor | mobilevit_s | 79.8525 | 172.8851 | | inductor | inception_v3 | 59.6333 | 172.2805 | | inductor | tf_mixnet_l | 62.0863 | 172.127 | | inductor | gluon_inception_v3 | 59.1854 | 168.4864 | | inductor | tf_efficientnet_b0 | 45.4103 | 161.3287 | | inductor | xcit_large_24_p8_224 | 108.3117 | 160.3484 | | inductor | res2net101_26w_4s | 93.3066 | 159.9451 | | inductor | twins_pcpvt_base | 119.6717 | 158.7075 | | inductor | pnasnet5large | 105.0551 | 158.0835 | | inductor | spnasnet_100 | 42.2941 | 150.974 | | inductor | fbnetc_100 | 42.1337 | 150.8096 | | inductor | mobilenetv2_100 | 36.7172 | 142.5709 | | inductor | mnasnet_100 | 37.6126 | 134.0404 | | inductor | res2net50_14w_8s | 92.4949 | 133.6794 | | inductor | cait_m36_384 | 112.9868 | 127.3313 | +------------------------+-----------------------+-------------+------------+ ~~~ Peak Memory Compression Ratio regressions ~~~ +------------------------+--------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+--------------+-------------+------------+ | inductor_no_cudagraphs | regnety_002 | 1.0009 | 0.8625 | | inductor_no_cudagraphs | lcnet_050 | 1.0001 | 0.8411 | | inductor | ghostnet_100 | 0.9077 | 0.8805 | +------------------------+--------------+-------------+------------+ ~~~

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------+------------------------+ | functorch_dp_cifar10 | 64 | 0.990581 | 0.54021 | 3.712228 | 1.413733 | | BERT_pytorch | 16 | 1.006936 | 0.450595 | 3.463346 | 2.108239 | | hf_BigBird | 2 | 0.981411 | 0.415691 | 2.859171 | 1.640458 | | basic_gnn_gin | 1 | 1.032703 | 0.568248 | 2.732045 | 1.429185 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.984672 | 0.500195 | 2.477184 | 1.870084 | | densenet121 | 4 | 0.996921 | 0.423079 | 2.444519 | 1.088812 | | hf_T5_large | 2 | 1.017803 | 0.439253 | 2.437908 | 2.023823 | | hf_Albert | 8 | 1.000671 | 0.702945 | 2.363868 | 2.341655 | | hf_Bert | 4 | 1.028005 | 0.463149 | 2.009287 | 1.640587 | | mobilenet_v3_large | 32 | 1.003432 | 0.504719 | 1.990287 | 1.1577 | | timm_efficientnet | 32 | 1.013296 | 0.456372 | 1.983019 | 1.114798 | | phlippe_densenet | 128 | 0.995252 | 0.491501 | 1.917714 | 0.901889 | | hf_GPT2 | 4 | 1.017823 | 0.571805 | 1.888901 | 1.90254 | | squeezenet1_1 | 32 | 0.98282 | 0.597015 | 1.86335 | 1.298258 | | hf_GPT2_large | 4 | 0.998962 | 0.753518 | 1.741626 | 1.719248 | | lennard_jones | 1000 | 0.915411 | 0.438693 | 1.738416 | 0.97385 | | phlippe_resnet | 128 | 0.98917 | 0.5133 | 1.726683 | 1.09865 | | resnext50_32x4d | 8 | 0.994479 | 0.422441 | 1.726142 | 0.996377 | | hf_T5 | 8 | 0.999187 | 0.832055 | 1.714156 | 1.730936 | | hf_Bert_large | 4 | 1.035508 | 0.46677 | 1.677549 | 1.626562 | | basic_gnn_sage | 1 | 1.029749 | 0.547532 | 1.677 | 1.355606 | | hf_Bart | 4 | 1.016383 | 0.452196 | 1.662917 | 1.214776 | | timm_resnest | 32 | 0.997901 | 0.696343 | 1.650129 | 1.531564 | | mnasnet1_0 | 32 | 0.994613 | 0.477843 | 1.648648 | 1.068923 | | timm_nfnet | 128 | 0.999422 | 0.989363 | 1.622877 | 1.526521 | | resnet18 | 16 | 0.993864 | 0.473616 | 1.594736 | 1.029539 | | attention_is_all_you_need_pytorch | 256 | 1.003547 | 0.479328 | 1.593045 | 1.705072 | | mobilenet_v2 | 96 | 0.999291 | 0.681107 | 1.581873 | 1.386965 | | shufflenet_v2_x1_0 | 128 | 0.997039 | 0.529758 | 1.57826 | 1.231394 | | timm_vision_transformer | 32 | 0.995796 | 0.462421 | 1.572057 | 1.290214 | | dcgan | 32 | 0.934627 | 0.462908 | 1.514868 | 0.895629 | | hf_DistilBert | 8 | 0.999675 | 0.659354 | 1.502309 | 1.544261 | | fastNLP_Bert | 6 | 0.972396 | 0.512643 | 1.482898 | 1.453191 | | speech_transformer | 32 | 0.998451 | 0.444351 | 1.44784 | 1.570706 | | LearningToPaint | 96 | 0.99264 | 0.534404 | 1.363809 | 1.075582 | | pytorch_unet | 1 | 0.999237 | 0.231561 | 1.352636 | 1.330816 | | basic_gnn_edgecnn | 1 | 0.991719 | 0.721736 | 1.316141 | 0.0 | | basic_gnn_gcn | 1 | 0.933433 | 0.502323 | 1.305415 | 1.217759 | | pytorch_stargan | 16 | 0.995532 | 0.512034 | 1.289811 | 1.262911 | | timm_vovnet | 32 | 1.022856 | 0.586446 | 1.271282 | 1.185148 | | vgg16 | 64 | 0.999728 | 0.990964 | 1.261118 | 1.251211 | | yolov3 | 16 | 0.999093 | 0.698116 | 1.239094 | 1.221426 | | resnet50 | 32 | 0.998222 | 0.529793 | 1.238431 | 1.084923 | | resnet152 | 32 | 0.997807 | 0.477685 | 1.211381 | 1.038721 | | Background_Matting | 4 | 0.999552 | 0.155649 | 1.19138 | 1.181734 | | hf_Reformer | 4 | 0.995598 | 0.8563 | 1.174101 | 1.14423 | | timm_regnet | 32 | 1.014454 | 0.631259 | 1.153648 | 1.062594 | | alexnet | 128 | 0.998507 | 0.968282 | 1.141633 | 1.137853 | | Super_SloMo | 6 | 0.999032 | 0.207671 | 1.124 | 1.09898 | | demucs | 4 | 0.999256 | 0.999251 | 1.060159 | 1.038659 | | tts_angular | 64 | 0.974489 | 0.830191 | 0.990276 | 1.000686 | | nvidia_deeprecommender | 256 | 0.998672 | 0.999366 | 0.979605 | 1.018991 | | detectron2_fcos_r_50_fpn | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientdet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | soft_actor_critic | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | pytorch_struct | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | drq | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 1.016201 | 0.413757 | 0.0 | 0.0 | | DALLE2_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 32 | 0.982534 | 0.0 | 0.0 | 0.0 | | dlrm | 1024 | 0.964046 | 0.515002 | 0.0 | 1.226872 | | timm_vision_transformer_large | 32 | 1.000023 | 0.979788 | 0.0 | 0.993088 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+----------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+----+------------------+------------------+------------------+------------------------+ | Background_Matting | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_GPT2_large | 4 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | shufflenet_v2_x1_0 | 4 | pass | pass | pass | pass | | moco | 4 | pass | pass | pass | pass | | phlippe_densenet | 4 | pass | pass | pass | pass | | phlippe_resnet | 4 | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | | resnet152 | 4 | pass | pass | pass | pass | | resnet18 | 4 | pass | pass | pass | pass | | resnet50 | 4 | pass | pass | pass | pass | | resnext50_32x4d | 4 | pass | pass | pass | pass | | squeezenet1_1 | 4 | pass | pass | pass | pass | | speech_transformer | 32 | pass | pass | pass | pass | | mobilenet_v2 | 4 | pass | pass | pass | pass | | timm_efficientnet | 4 | pass | pass | pass | pass | | timm_nfnet | 4 | pass | pass | pass | pass | | timm_regnet | 4 | pass | pass | pass | pass | | timm_resnest | 4 | pass | pass | pass | pass | | timm_vision_transformer | 4 | pass | pass | pass | pass | | timm_vovnet | 4 | pass | pass | pass | pass | | tts_angular | 4 | pass | pass | pass | pass | | vgg16 | 4 | pass | pass | pass | pass | | yolov3 | 4 | pass | pass | pass | pass | | BERT_pytorch | 4 | fail_accuracy | pass | pass | pass | | mobilenet_v3_large | 4 | pass | pass | pass | pass | | nvidia_deeprecommender | 4 | pass | pass | pass | pass | | mnasnet1_0 | 4 | pass | pass | pass | pass | | dlrm | 4 | pass | pass | pass | pass | | LearningToPaint | 4 | pass | pass | pass | pass | | Super_SloMo | 4 | pass | pass | pass | pass | | alexnet | 4 | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 4 | pass | pass | pass | pass | | basic_gnn_edgecnn | 1 | pass | pass | pass | pass | | basic_gnn_gcn | 1 | pass | pass | pass | pass | | basic_gnn_gin | 1 | pass | pass | pass | pass | | lennard_jones | 4 | pass | pass | pass | pass | | dcgan | 4 | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | | densenet121 | 4 | pass | pass | pass | pass | | basic_gnn_sage | 1 | pass | pass | pass | pass | | fastNLP_Bert | 4 | pass | pass | pass | pass | | hf_DistilBert | 4 | pass | pass | pass | pass | | functorch_dp_cifar10 | 4 | pass | pass | pass | pass | | hf_T5 | 4 | pass | pass | pass | pass | | hf_Reformer | 4 | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | | hf_T5_base | 4 | pass | pass | pass | pass | | hf_BigBird | 4 | pass | pass | pass | pass | | hf_Bert_large | 4 | pass | pass | pass | pass | | hf_Bert | 4 | pass | pass | pass | pass | | hf_Bart | 4 | pass | pass | pass | pass | | hf_Albert | 4 | pass | pass | pass | pass | | vision_maskrcnn | 1 | pass | pass | infra_error | infra_error | | DALLE2_pytorch | 0 | infra_error | infra_error | infra_error | infra_error | | detectron2_fcos_r_50_fpn | 0 | infra_error | infra_error | infra_error | infra_error | | drq | 0 | infra_error | infra_error | infra_error | infra_error | | pytorch_struct | 0 | infra_error | infra_error | infra_error | infra_error | | soft_actor_critic | 0 | infra_error | infra_error | infra_error | infra_error | | timm_efficientdet | 0 | infra_error | infra_error | infra_error | infra_error | | torchrec_dlrm | 0 | infra_error | infra_error | infra_error | infra_error | | hf_Longformer | 4 | pass | pass | fail_to_run | fail_to_run | | llama | 4 | fail_accuracy | pass | fail_accuracy | fail_accuracy | +-----------------------------------+----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+-----------+-----------+------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+------------+------------------------+ | hf_T5_large | 2 | 31.439603 | 57.802168 | 164.083421 | 164.456502 | | hf_BigBird | 2 | 14.210805 | 39.864807 | 163.680048 | 126.065458 | | densenet121 | 4 | 6.08875 | 17.663866 | 113.059074 | 124.351767 | | timm_efficientnet | 32 | 3.993969 | 9.531764 | 110.753224 | 133.045389 | | mobilenet_v3_large | 32 | 2.413839 | 6.663535 | 102.873356 | 106.446296 | | hf_GPT2_large | 4 | 13.364166 | 27.884342 | 101.458251 | 101.483187 | | phlippe_densenet | 128 | 2.547705 | 6.22488 | 101.337409 | 153.191907 | | yolov3 | 16 | 2.648541 | 9.837036 | 94.954118 | 110.717039 | | resnet152 | 32 | 6.50623 | 17.380757 | 94.415224 | 98.678362 | | mobilenet_v2 | 96 | 2.221713 | 6.185012 | 92.460311 | 101.414281 | | timm_resnest | 32 | 1.375978 | 3.281833 | 72.12699 | 92.236499 | | mnasnet1_0 | 32 | 2.233631 | 6.156288 | 70.783563 | 85.075042 | | speech_transformer | 32 | 4.408621 | 12.618386 | 70.360211 | 71.684556 | | timm_nfnet | 128 | 5.738571 | 9.848938 | 64.911987 | 69.544176 | | shufflenet_v2_x1_0 | 128 | 2.541028 | 6.612881 | 64.884759 | 69.163346 | | timm_regnet | 32 | 6.148549 | 11.355753 | 63.242359 | 66.675486 | | hf_Bert_large | 4 | 8.977086 | 19.504087 | 60.793473 | 60.410969 | | Background_Matting | 4 | 2.332617 | 9.612557 | 58.472384 | 66.595764 | | attention_is_all_you_need_pytorch | 256 | 3.505044 | 9.522789 | 57.67811 | 58.422907 | | BERT_pytorch | 16 | 3.713043 | 10.090656 | 52.195144 | 53.596293 | | timm_vovnet | 32 | 3.107935 | 6.053281 | 49.165858 | 57.570949 | | resnet50 | 32 | 2.346468 | 7.000923 | 48.652712 | 55.526006 | | fastNLP_Bert | 6 | 4.164716 | 10.128299 | 48.572427 | 46.713766 | | hf_T5 | 8 | 4.915321 | 11.419154 | 48.113126 | 45.577417 | | pytorch_unet | 1 | 1.076603 | 3.689207 | 47.830672 | 57.432964 | | hf_Bart | 4 | 4.546189 | 11.59483 | 47.061006 | 49.216481 | | hf_Reformer | 4 | 4.48975 | 6.167701 | 46.172077 | 41.411991 | | resnext50_32x4d | 8 | 2.366839 | 5.909192 | 42.762865 | 45.120019 | | functorch_dp_cifar10 | 64 | 0.887442 | 2.125804 | 42.667529 | 54.300915 | | Super_SloMo | 6 | 2.285815 | 8.035609 | 41.629987 | 43.139268 | | hf_GPT2 | 4 | 4.747487 | 8.90541 | 39.97558 | 40.067813 | | pytorch_stargan | 16 | 0.865556 | 2.822374 | 39.396471 | 45.636277 | | timm_vision_transformer | 32 | 2.302068 | 5.710583 | 39.236317 | 41.092706 | | resnet18 | 16 | 0.946753 | 2.344216 | 38.098066 | 44.582691 | | hf_Albert | 8 | 2.360897 | 7.744948 | 36.59468 | 39.796835 | | LearningToPaint | 96 | 1.035538 | 2.504758 | 35.489835 | 40.383513 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.922841 | 2.633678 | 34.938649 | 37.547208 | | hf_Bert | 4 | 4.12986 | 9.508273 | 34.485988 | 36.734931 | | hf_DistilBert | 8 | 2.77078 | 4.502302 | 29.687752 | 30.247295 | | phlippe_resnet | 128 | 0.953731 | 2.422928 | 28.932409 | 33.693714 | | demucs | 4 | 1.100834 | 1.852201 | 27.642722 | 27.286456 | | squeezenet1_1 | 32 | 0.694507 | 1.502967 | 20.689336 | 24.304327 | | basic_gnn_gcn | 1 | 0.709566 | 0.960833 | 16.970181 | 15.818721 | | basic_gnn_edgecnn | 1 | 1.282825 | 2.266287 | 16.496915 | 0.0 | | vgg16 | 64 | 0.420619 | 0.901943 | 15.236677 | 16.48442 | | alexnet | 128 | -0.653623 | 0.619124 | 14.853093 | 15.645378 | | nvidia_deeprecommender | 256 | 0.322201 | 0.617053 | 12.085651 | 10.625166 | | basic_gnn_sage | 1 | 0.617571 | 0.798499 | 11.174681 | 9.523117 | | dcgan | 32 | 0.338477 | 0.591528 | 11.05737 | 10.258963 | | basic_gnn_gin | 1 | 0.620047 | 0.898669 | 9.574421 | 9.717836 | | lennard_jones | 1000 | 0.258804 | 0.486293 | 8.655805 | 8.538321 | | tts_angular | 64 | 0.300747 | 0.39467 | 8.265909 | 9.247066 | | dlrm | 1024 | 0.372334 | 1.616226 | 0.0 | 9.201727 | | timm_vision_transformer_large | 32 | 7.192703 | 17.534214 | 0.0 | 109.763888 | | hf_Longformer | 2 | 7.310068 | 32.69951 | 0.0 | 0.0 | | moco | 32 | 24.68456 | 0.0 | 0.0 | 0.0 | | DALLE2_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | detectron2_fcos_r_50_fpn | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | drq | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | pytorch_struct | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | soft_actor_critic | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientdet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+-----------+-----------+------------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------+------------------------+ | basic_gnn_gcn | 1 | 1.0 | 1.168533 | 3.650016 | 3.059778 | | basic_gnn_sage | 1 | 1.028514 | 1.0 | 2.008225 | 1.65962 | | basic_gnn_gin | 1 | 1.002292 | 0.974245 | 1.983976 | 1.788851 | | basic_gnn_edgecnn | 1 | 1.000679 | 1.136944 | 1.268084 | 0.0 | | hf_Albert | 8 | 0.999911 | 0.973313 | 1.264965 | 1.262069 | | Super_SloMo | 6 | 1.011702 | 1.01433 | 1.213417 | 1.22401 | | hf_BigBird | 2 | 1.004615 | 0.994082 | 1.160882 | 1.155953 | | BERT_pytorch | 16 | 1.000289 | 1.004587 | 1.11184 | 1.096388 | | mobilenet_v2 | 96 | 0.999635 | 0.95159 | 1.108167 | 1.102092 | | fastNLP_Bert | 6 | 1.000268 | 0.991456 | 1.098417 | 1.08374 | | hf_T5_large | 2 | 0.999967 | 1.015386 | 1.093518 | 1.100758 | | hf_GPT2_large | 4 | 0.99993 | 0.968515 | 1.081592 | 1.118083 | | timm_nfnet | 128 | 0.91347 | 0.989046 | 1.076428 | 1.072757 | | lennard_jones | 1000 | 1.0 | 1.000112 | 1.068689 | 0.999804 | | hf_GPT2 | 4 | 1.000028 | 0.959248 | 1.065758 | 1.101962 | | hf_T5 | 8 | 0.999954 | 0.992418 | 1.043161 | 1.10217 | | Background_Matting | 4 | 1.011173 | 0.705627 | 1.039805 | 1.039524 | | yolov3 | 16 | 0.999819 | 0.994149 | 1.02349 | 1.022172 | | dcgan | 32 | 1.0 | 1.015395 | 1.006245 | 0.999846 | | hf_Reformer | 4 | 1.0 | 1.0 | 1.005698 | 1.0 | | hf_Bert_large | 4 | 1.0 | 0.989943 | 1.004658 | 1.003581 | | attention_is_all_you_need_pytorch | 256 | 1.003319 | 1.001208 | 1.003191 | 1.018958 | | demucs | 4 | 1.000058 | 1.000184 | 1.001963 | 0.999831 | | tts_angular | 64 | 1.0 | 1.0 | 0.996691 | 1.0 | | shufflenet_v2_x1_0 | 128 | 1.002631 | 1.003486 | 0.995438 | 0.985483 | | vgg16 | 64 | 1.0 | 0.999917 | 0.99064 | 0.988421 | | hf_Bert | 4 | 1.000248 | 0.985513 | 0.975042 | 0.968575 | | nvidia_deeprecommender | 256 | 1.0 | 0.970936 | 0.973313 | 0.971137 | | hf_DistilBert | 8 | 0.999326 | 0.982577 | 0.972378 | 0.967066 | | timm_resnest | 32 | 0.999633 | 1.100041 | 0.958733 | 0.952261 | | timm_regnet | 32 | 0.999704 | 0.999869 | 0.952921 | 0.950449 | | timm_efficientnet | 32 | 0.999872 | 0.958033 | 0.94973 | 0.94381 | | alexnet | 128 | 1.000735 | 1.001243 | 0.94399 | 0.938735 | | resnet152 | 32 | 0.999192 | 1.001836 | 0.943303 | 0.939124 | | pytorch_unet | 1 | 1.000597 | 0.866079 | 0.930504 | 0.93105 | | hf_Bart | 4 | 1.000543 | 0.921673 | 0.911767 | 0.941756 | | pytorch_stargan | 16 | 0.998426 | 1.050024 | 0.893437 | 0.889299 | | resnet50 | 32 | 1.000517 | 1.003551 | 0.890619 | 0.887016 | | timm_vovnet | 32 | 1.001787 | 1.001536 | 0.888781 | 0.887004 | | timm_vision_transformer | 32 | 0.999677 | 1.004442 | 0.85232 | 0.846964 | | speech_transformer | 32 | 0.999301 | 1.000002 | 0.846621 | 0.844683 | | mobilenet_v3_large | 32 | 1.002749 | 0.996106 | 0.788748 | 0.78255 | | mnasnet1_0 | 32 | 0.997135 | 0.999441 | 0.784332 | 0.774557 | | resnext50_32x4d | 8 | 1.000255 | 1.001975 | 0.780881 | 0.771616 | | squeezenet1_1 | 32 | 1.000239 | 0.998744 | 0.776372 | 0.775402 | | LearningToPaint | 96 | 1.0 | 1.001431 | 0.757111 | 0.7482 | | phlippe_densenet | 128 | 1.0 | 0.999753 | 0.729494 | 0.713997 | | densenet121 | 4 | 0.999194 | 0.982488 | 0.691168 | 0.670463 | | resnet18 | 16 | 0.999306 | 0.999031 | 0.618876 | 0.61026 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.9865 | 0.603505 | 0.600365 | | functorch_dp_cifar10 | 64 | 1.0 | 0.999337 | 0.453125 | 0.444502 | | phlippe_resnet | 128 | 1.000654 | 1.000288 | 0.378591 | 0.36166 | | detectron2_fcos_r_50_fpn | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientdet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | soft_actor_critic | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | pytorch_struct | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | drq | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_vision_transformer_large | 32 | 0.999978 | 1.003925 | 0.0 | 0.973508 | | DALLE2_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 32 | 0.977628 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 0.999106 | 0.981581 | 0.0 | 0.0 | | dlrm | 1024 | 1.0 | 1.000239 | 0.0 | 1.000856 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+----------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+------------+------------+------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+------------+------------+------------+------------------------+ | hf_GPT2_large | 4 | 209.743514 | 278.263137 | 119.851999 | 121.387135 | | Background_Matting | 4 | 127.166333 | 815.382554 | 106.692103 | 107.651865 | | hf_T5 | 8 | 179.867163 | 214.081589 | 103.929014 | 103.976291 | | hf_T5_large | 2 | 220.827315 | 534.394825 | 92.848731 | 122.636986 | | timm_nfnet | 128 | 118.48404 | 118.522766 | 72.383207 | 76.617107 | | Super_SloMo | 6 | 81.272053 | 389.501557 | 71.989336 | 73.631438 | | hf_Reformer | 4 | 81.199195 | 94.346731 | 68.901985 | 70.726586 | | hf_BigBird | 2 | 200.190482 | 463.53716 | 66.188665 | 118.239417 | | yolov3 | 16 | 69.449507 | 100.042606 | 56.058524 | 56.917833 | | vgg16 | 64 | 66.053084 | 66.675128 | 52.395551 | 52.8455 | | resnet152 | 32 | 62.922322 | 131.994108 | 52.076394 | 61.520554 | | demucs | 4 | 53.378007 | 53.4464 | 50.519243 | 51.323677 | | timm_regnet | 32 | 57.432232 | 90.482038 | 49.597881 | 54.684078 | | hf_Bert_large | 4 | 80.805967 | 176.396209 | 49.089284 | 51.675919 | | speech_transformer | 32 | 59.786829 | 132.323443 | 41.385651 | 38.71846 | | fastNLP_Bert | 6 | 58.055502 | 108.546454 | 37.595166 | 38.576146 | | attention_is_all_you_need_pytorch | 256 | 56.749064 | 110.79778 | 33.845332 | 35.132171 | | hf_Bart | 4 | 54.824691 | 117.855948 | 32.748684 | 53.885642 | | mobilenet_v2 | 96 | 49.590641 | 73.289949 | 31.269572 | 35.868631 | | pytorch_unet | 1 | 40.562235 | 175.148977 | 29.931212 | 30.529352 | | hf_Albert | 8 | 68.56327 | 97.30032 | 28.961743 | 29.81502 | | hf_GPT2 | 4 | 48.076233 | 87.964772 | 25.598932 | 28.113427 | | densenet121 | 4 | 51.575704 | 142.20561 | 21.708126 | 50.003876 | | shufflenet_v2_x1_0 | 128 | 33.597051 | 62.070596 | 20.963371 | 26.907238 | | resnet50 | 32 | 26.38896 | 49.775186 | 20.806872 | 24.357018 | | hf_Bert | 4 | 41.596632 | 89.431368 | 20.777042 | 25.918155 | | hf_DistilBert | 8 | 31.808461 | 47.486587 | 20.757692 | 21.297847 | | timm_vovnet | 32 | 25.020608 | 44.16279 | 19.903423 | 21.885843 | | timm_efficientnet | 32 | 35.170023 | 76.950871 | 17.270957 | 31.628223 | | BERT_pytorch | 16 | 52.476098 | 119.2507 | 15.477095 | 25.364927 | | timm_vision_transformer | 32 | 25.157028 | 50.651734 | 14.976379 | 18.969449 | | timm_resnest | 32 | 24.105572 | 34.667665 | 14.541757 | 15.751461 | | mnasnet1_0 | 32 | 22.941997 | 51.009472 | 13.985521 | 23.125186 | | mobilenet_v3_large | 32 | 26.935175 | 58.649976 | 13.744624 | 27.768159 | | phlippe_densenet | 128 | 26.885159 | 51.138504 | 12.783092 | 28.133978 | | resnext50_32x4d | 8 | 20.321671 | 46.339256 | 11.907205 | 20.199126 | | pytorch_stargan | 16 | 14.755791 | 28.485545 | 11.295711 | 11.662139 | | nvidia_deeprecommender | 256 | 10.314474 | 10.311137 | 10.50392 | 10.110573 | | alexnet | 128 | 9.749033 | 10.03852 | 8.526583 | 8.554891 | | LearningToPaint | 96 | 11.514781 | 23.293387 | 8.199904 | 11.019652 | | tts_angular | 64 | 6.470567 | 7.580891 | 6.19554 | 6.247491 | | phlippe_resnet | 128 | 10.148995 | 19.853486 | 5.83463 | 9.243718 | | basic_gnn_edgecnn | 1 | 7.576859 | 10.434382 | 5.638062 | 0.0 | | resnet18 | 16 | 8.898828 | 18.6218 | 5.602462 | 8.931554 | | squeezenet1_1 | 32 | 10.179315 | 17.106445 | 5.473075 | 7.992387 | | pytorch_CycleGAN_and_pix2pix | 1 | 13.228025 | 26.220481 | 5.395211 | 7.201255 | | basic_gnn_gcn | 1 | 4.916064 | 8.996784 | 3.458337 | 3.718402 | | functorch_dp_cifar10 | 64 | 12.012928 | 18.743119 | 3.235116 | 7.399274 | | basic_gnn_sage | 1 | 3.294526 | 6.12314 | 2.022108 | 2.454052 | | dcgan | 32 | 2.334653 | 4.536521 | 1.536079 | 2.35901 | | basic_gnn_gin | 1 | 4.050946 | 6.602456 | 1.400484 | 3.192411 | | lennard_jones | 1000 | 1.574114 | 3.572559 | 0.873083 | 1.60309 | | soft_actor_critic | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | pytorch_struct | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | detectron2_fcos_r_50_fpn | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientdet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | drq | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_vision_transformer_large | 32 | 416.854919 | 425.844473 | 0.0 | 420.331654 | | DALLE2_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 32 | 49.197174 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 116.920036 | 286.932827 | 0.0 | 0.0 | | dlrm | 1024 | 6.501852 | 8.253714 | 0.0 | 3.466805 | | torchrec_dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+------------+------------+------------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | MobileBertForQuestionAnswering | 128 | 1.013948 | 0.46647 | 2.780108 | 1.182392 | | DebertaForQuestionAnswering | 16 | 0.999415 | 0.93414 | 2.748288 | 2.693413 | | DebertaV2ForQuestionAnswering | 1 | 1.019968 | 0.534772 | 2.647552 | 1.959343 | | MT5ForConditionalGeneration | 16 | 1.021138 | 0.484657 | 2.506785 | 1.919343 | | OPTForCausalLM | 2 | 0.99936 | 0.912277 | 2.356594 | 2.349125 | | GPT2ForSequenceClassification | 4 | 0.999497 | 0.86511 | 2.31787 | 2.272015 | | ElectraForQuestionAnswering | 64 | 1.000032 | 0.951253 | 2.153292 | 2.091876 | | MobileBertForMaskedLM | 128 | 1.014689 | 0.482264 | 1.994832 | 1.154881 | | DebertaForMaskedLM | 8 | 0.999973 | 0.789217 | 1.949198 | 1.92653 | | ElectraForCausalLM | 32 | 0.998817 | 0.859431 | 1.861379 | 1.883629 | | LayoutLMForSequenceClassification | 16 | 0.999907 | 0.928003 | 1.825873 | 1.771729 | | RobertaForQuestionAnswering | 16 | 1.000025 | 0.936475 | 1.802971 | 1.761879 | | BertForQuestionAnswering | 16 | 1.00079 | 0.930211 | 1.797494 | 1.745037 | | XGLMForCausalLM | 8 | 1.011627 | 0.483379 | 1.746136 | 1.54555 | | RobertaForCausalLM | 16 | 1.000061 | 0.937462 | 1.679495 | 1.651481 | | MegatronBertForQuestionAnswering | 8 | 0.998948 | 0.796862 | 1.668544 | 1.62954 | | AlbertForQuestionAnswering | 4 | 1.000328 | 0.87335 | 1.655547 | 1.644761 | | DistillGPT2 | 16 | 1.000361 | 0.945663 | 1.650823 | 1.675055 | | M2M100ForConditionalGeneration | 16 | 1.007375 | 0.476325 | 1.649832 | 1.491729 | | AlbertForMaskedLM | 4 | 1.000027 | 0.873116 | 1.645831 | 1.636325 | | XLNetLMHeadModel | 8 | 0.99919 | 0.888554 | 1.626346 | 1.634082 | | MegatronBertForCausalLM | 4 | 1.029205 | 0.484722 | 1.615127 | 1.552879 | | BertForMaskedLM | 16 | 0.999089 | 0.937167 | 1.595655 | 1.584557 | | PLBartForConditionalGeneration | 4 | 0.999868 | 0.869075 | 1.590527 | 1.588789 | | DebertaV2ForMaskedLM | 2 | 1.020883 | 0.555165 | 1.567072 | 1.544834 | | T5ForConditionalGeneration | 4 | 1.000079 | 0.692371 | 1.545598 | 1.585897 | | T5Small | 4 | 0.999779 | 0.689581 | 1.536741 | 1.581885 | | CamemBert | 16 | 1.000056 | 0.939538 | 1.53429 | 1.53267 | | BartForConditionalGeneration | 2 | 1.002339 | 0.595178 | 1.505955 | 1.475818 | | MBartForConditionalGeneration | 2 | 1.000198 | 0.568799 | 1.498238 | 1.467302 | | BartForCausalLM | 4 | 0.99972 | 0.933782 | 1.494171 | 1.49175 | | MBartForCausalLM | 4 | 0.999195 | 0.933679 | 1.488263 | 1.491064 | | YituTechConvBert | 16 | 1.000015 | 0.859239 | 1.476193 | 1.456752 | | PLBartForCausalLM | 8 | 1.000065 | 0.944641 | 1.461295 | 1.485481 | | Speech2Text2ForCausalLM | 256 | 0.999598 | 0.884283 | 1.459613 | 1.483713 | | DistilBertForQuestionAnswering | 256 | 0.99982 | 0.973493 | 1.445399 | 1.434791 | | BlenderbotSmallForConditionalGeneration | 64 | 1.006367 | 0.704126 | 1.362908 | 1.363553 | | PegasusForConditionalGeneration | 32 | 1.005231 | 0.611184 | 1.319683 | 1.31133 | | TrOCRForCausalLM | 32 | 1.000129 | 0.938249 | 1.253856 | 1.264002 | | DistilBertForMaskedLM | 128 | 0.99951 | 0.934382 | 1.231039 | 1.239865 | | BlenderbotSmallForCausalLM | 64 | 0.999792 | 0.802935 | 1.226609 | 1.263912 | | PegasusForCausalLM | 32 | 1.000389 | 0.752019 | 1.204053 | 1.192735 | | BlenderbotForCausalLM | 4 | 1.011696 | 0.49972 | 1.138141 | 1.119599 | | LayoutLMForMaskedLM | 16 | 1.00001 | 0.938123 | 0.0 | 1.604682 | | AllenaiLongformerBase | 4 | 0.999358 | 0.582473 | 0.0 | 0.0 | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | YituTechConvBert | 1 | pass | pass | pass | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | infra_error | pass | +-----------------------------------------+----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+-----------+-----------+------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+-----------+-----------+------------+------------------------+ | MobileBertForQuestionAnswering | 128 | 28.005908 | 50.090302 | 134.464177 | 134.520693 | | MobileBertForMaskedLM | 128 | 26.541355 | 50.022521 | 132.344597 | 133.560424 | | M2M100ForConditionalGeneration | 16 | 8.232879 | 22.093632 | 98.271694 | 98.993619 | | XLNetLMHeadModel | 8 | 8.730799 | 23.367344 | 94.50377 | 98.369967 | | XGLMForCausalLM | 8 | 7.002015 | 17.639691 | 85.759767 | 86.436074 | | MT5ForConditionalGeneration | 16 | 6.961684 | 17.397204 | 84.680189 | 84.005012 | | MBartForConditionalGeneration | 2 | 9.524429 | 22.614339 | 78.37576 | 77.586508 | | BartForConditionalGeneration | 2 | 8.640034 | 21.831722 | 73.21505 | 73.113895 | | DebertaV2ForMaskedLM | 2 | 15.014841 | 25.998448 | 71.168706 | 66.025264 | | PegasusForConditionalGeneration | 32 | 4.152695 | 18.741433 | 70.392856 | 69.228935 | | MegatronBertForCausalLM | 4 | 9.41297 | 20.29049 | 68.510422 | 65.352325 | | DebertaV2ForQuestionAnswering | 1 | 13.291003 | 26.004463 | 68.001044 | 66.357613 | | BlenderbotForCausalLM | 4 | 6.043639 | 18.061699 | 66.767457 | 64.225277 | | MegatronBertForQuestionAnswering | 8 | 9.412642 | 19.553306 | 65.844739 | 65.134062 | | YituTechConvBert | 16 | 5.986609 | 14.311981 | 62.389043 | 64.085156 | | BlenderbotSmallForConditionalGeneration | 64 | 5.725183 | 14.854252 | 57.020904 | 53.715058 | | T5Small | 4 | 4.829753 | 11.646779 | 47.015723 | 46.676919 | | T5ForConditionalGeneration | 4 | 4.887451 | 11.57768 | 46.680816 | 46.210788 | | PLBartForConditionalGeneration | 4 | 4.427963 | 11.516188 | 46.19498 | 46.879618 | | ElectraForCausalLM | 32 | 4.277916 | 9.506654 | 43.828684 | 46.754753 | | DebertaForMaskedLM | 8 | 6.988147 | 12.874999 | 42.348477 | 41.364712 | | LayoutLMForSequenceClassification | 16 | 4.324636 | 9.89485 | 41.894954 | 42.14851 | | DebertaForQuestionAnswering | 16 | 6.876514 | 12.653998 | 39.520491 | 38.947953 | | RobertaForCausalLM | 16 | 4.909576 | 9.835368 | 38.496135 | 35.025488 | | BertForMaskedLM | 16 | 4.817023 | 9.755045 | 38.025903 | 37.77552 | | MBartForCausalLM | 4 | 3.601726 | 8.722706 | 37.898803 | 36.378014 | | CamemBert | 16 | 4.252059 | 10.003216 | 37.556412 | 36.801517 | | AlbertForMaskedLM | 4 | 2.140659 | 7.868742 | 37.496397 | 36.848402 | | BertForQuestionAnswering | 16 | 4.104459 | 9.663629 | 37.093781 | 36.475531 | | TrOCRForCausalLM | 32 | 3.671399 | 8.717741 | 36.849418 | 36.250642 | | ElectraForQuestionAnswering | 64 | 4.231301 | 9.704069 | 36.808427 | 37.085661 | | BartForCausalLM | 4 | 3.880073 | 8.798065 | 36.216865 | 35.372078 | | PegasusForCausalLM | 32 | 4.530307 | 9.036373 | 36.065896 | 34.979598 | | AlbertForQuestionAnswering | 4 | 3.143062 | 7.671309 | 35.832595 | 34.116697 | | OPTForCausalLM | 2 | 3.334007 | 8.370733 | 34.41215 | 36.543114 | | GPT2ForSequenceClassification | 4 | 4.23757 | 8.806861 | 34.021485 | 32.988405 | | RobertaForQuestionAnswering | 16 | 5.321378 | 9.405191 | 33.953307 | 35.571662 | | DistilBertForQuestionAnswering | 256 | 2.684745 | 4.576588 | 32.997891 | 33.474498 | | DistilBertForMaskedLM | 128 | 1.906759 | 5.577871 | 32.012122 | 31.835771 | | BlenderbotSmallForCausalLM | 64 | 2.390067 | 6.839191 | 30.847765 | 28.647454 | | DistillGPT2 | 16 | 1.794796 | 4.343471 | 28.615149 | 27.169641 | | Speech2Text2ForCausalLM | 256 | 2.636885 | 4.348747 | 26.361592 | 26.945285 | | PLBartForCausalLM | 8 | 1.961152 | 4.522749 | 25.933666 | 25.806095 | | LayoutLMForMaskedLM | 16 | 4.381036 | 9.961645 | 0.0 | 37.867065 | | AllenaiLongformerBase | 4 | 7.66396 | 30.761964 | 0.0 | 0.0 | +-----------------------------------------+-----+-----------+-----------+------------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | DebertaForQuestionAnswering | 16 | 1.021271 | 1.147063 | 1.340031 | 1.337637 | | AlbertForQuestionAnswering | 4 | 0.999999 | 0.791079 | 1.315569 | 1.314665 | | AlbertForMaskedLM | 4 | 1.0 | 0.783779 | 1.257824 | 1.268002 | | GPT2ForSequenceClassification | 4 | 1.000115 | 0.978078 | 1.097279 | 1.126616 | | OPTForCausalLM | 2 | 1.000031 | 0.968122 | 1.090841 | 1.128266 | | DistilBertForQuestionAnswering | 256 | 1.011432 | 1.024273 | 1.084986 | 1.082283 | | ElectraForQuestionAnswering | 64 | 1.001377 | 1.00139 | 1.072642 | 1.07173 | | BertForQuestionAnswering | 16 | 1.001687 | 1.001795 | 1.069941 | 1.065794 | | RobertaForQuestionAnswering | 16 | 1.001249 | 1.001349 | 1.069558 | 1.065445 | | DebertaForMaskedLM | 8 | 0.999756 | 1.026032 | 1.055076 | 1.108048 | | LayoutLMForSequenceClassification | 16 | 1.001432 | 1.00144 | 1.039867 | 1.035639 | | XLNetLMHeadModel | 8 | 1.0 | 0.984302 | 1.033141 | 1.033141 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.999999 | 1.029183 | 1.028523 | | T5Small | 4 | 0.999923 | 0.987768 | 1.02166 | 1.067157 | | T5ForConditionalGeneration | 4 | 0.999923 | 0.987768 | 1.02166 | 1.067157 | | MegatronBertForCausalLM | 4 | 1.0 | 0.986938 | 1.020586 | 1.032361 | | BlenderbotForCausalLM | 4 | 0.997825 | 0.998157 | 1.000304 | 0.99879 | | DebertaV2ForQuestionAnswering | 1 | 1.000099 | 1.000099 | 0.999938 | 0.999281 | | MBartForConditionalGeneration | 2 | 1.0 | 0.973796 | 0.995659 | 1.02194 | | PegasusForConditionalGeneration | 32 | 0.999993 | 0.944958 | 0.989469 | 1.048692 | | BartForConditionalGeneration | 2 | 1.0 | 0.973768 | 0.979934 | 1.005418 | | DebertaV2ForMaskedLM | 2 | 0.999666 | 0.981407 | 0.974219 | 0.990623 | | RobertaForCausalLM | 16 | 0.999899 | 0.958734 | 0.94639 | 0.983123 | | DistillGPT2 | 16 | 0.999964 | 0.915694 | 0.940389 | 1.027408 | | MBartForCausalLM | 4 | 1.0 | 0.95108 | 0.936612 | 0.982652 | | MobileBertForMaskedLM | 128 | 0.999985 | 0.932688 | 0.935191 | 0.983458 | | YituTechConvBert | 16 | 1.0 | 0.955142 | 0.930818 | 0.929368 | | BertForMaskedLM | 16 | 0.999764 | 0.958621 | 0.927181 | 0.924076 | | CamemBert | 16 | 0.999989 | 0.957465 | 0.924779 | 0.921675 | | BartForCausalLM | 4 | 1.0 | 0.951014 | 0.921898 | 0.966594 | | XGLMForCausalLM | 8 | 1.0 | 0.943529 | 0.91812 | 0.969992 | | M2M100ForConditionalGeneration | 16 | 1.0 | 0.938979 | 0.910994 | 0.966946 | | PLBartForConditionalGeneration | 4 | 1.00005 | 0.930006 | 0.908319 | 0.973149 | | PegasusForCausalLM | 32 | 1.0 | 0.926025 | 0.904791 | 0.973312 | | MT5ForConditionalGeneration | 16 | 0.999948 | 0.922812 | 0.903874 | 0.991231 | | TrOCRForCausalLM | 32 | 1.0 | 0.919942 | 0.87395 | 0.881037 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.889539 | 0.864986 | 0.897783 | | PLBartForCausalLM | 8 | 1.0 | 0.923832 | 0.863 | 0.860945 | | ElectraForCausalLM | 32 | 0.999976 | 0.917402 | 0.861134 | 0.93223 | | MobileBertForQuestionAnswering | 128 | 1.016118 | 1.024925 | 0.857907 | 0.857131 | | DistilBertForMaskedLM | 128 | 1.000018 | 0.917034 | 0.851792 | 0.849938 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.890384 | 0.804903 | 0.803499 | | Speech2Text2ForCausalLM | 256 | 1.0 | 0.888278 | 0.77739 | 0.775883 | | LayoutLMForMaskedLM | 16 | 0.999888 | 0.958901 | 0.0 | 0.924424 | | AllenaiLongformerBase | 4 | 0.998844 | 0.951083 | 0.0 | 0.0 | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+------------+------------+------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+------------+------------+------------+------------------------+ | XLNetLMHeadModel | 8 | 277.016386 | 311.764992 | 170.93676 | 169.043056 | | AlbertForMaskedLM | 4 | 266.852506 | 305.86614 | 162.205125 | 163.083869 | | AlbertForQuestionAnswering | 4 | 265.070686 | 303.194028 | 160.052774 | 160.974258 | | TrOCRForCausalLM | 32 | 135.298269 | 144.146378 | 108.454232 | 107.836772 | | PegasusForConditionalGeneration | 32 | 136.739217 | 234.549796 | 107.324999 | 109.105772 | | MobileBertForMaskedLM | 128 | 175.810528 | 397.080822 | 93.477214 | 162.804799 | | YituTechConvBert | 16 | 133.302608 | 156.60748 | 90.92096 | 93.080105 | | MBartForConditionalGeneration | 2 | 133.92787 | 238.272106 | 89.959816 | 92.407365 | | BartForConditionalGeneration | 2 | 133.74041 | 224.192551 | 88.67486 | 91.007708 | | MegatronBertForQuestionAnswering | 8 | 141.779694 | 178.000234 | 85.562515 | 87.803048 | | BlenderbotForCausalLM | 4 | 89.909457 | 182.567356 | 79.563173 | 81.7712 | | BlenderbotSmallForConditionalGeneration | 64 | 108.340561 | 155.596584 | 79.292134 | 79.038295 | | CamemBert | 16 | 118.496928 | 126.03016 | 77.124599 | 78.028231 | | DebertaV2ForMaskedLM | 2 | 118.481165 | 212.967894 | 74.15952 | 76.213999 | | MBartForCausalLM | 4 | 108.309004 | 116.002761 | 73.098479 | 73.205261 | | BartForCausalLM | 4 | 108.563164 | 115.7605 | 72.313173 | 72.48485 | | PLBartForConditionalGeneration | 4 | 113.500478 | 130.329322 | 71.762294 | 72.370534 | | DistilBertForQuestionAnswering | 256 | 102.873574 | 105.65135 | 71.370854 | 72.075112 | | M2M100ForConditionalGeneration | 16 | 116.169399 | 244.385957 | 71.244798 | 87.374891 | | PLBartForCausalLM | 8 | 102.536118 | 108.485397 | 70.476753 | 69.49154 | | T5Small | 4 | 103.32242 | 150.514971 | 68.882033 | 66.842522 | | T5ForConditionalGeneration | 4 | 103.362044 | 149.945978 | 68.863693 | 66.818725 | | BertForMaskedLM | 16 | 109.965019 | 117.322671 | 68.813818 | 69.507316 | | RobertaForCausalLM | 16 | 114.930009 | 122.707966 | 68.787922 | 69.598675 | | DistilBertForMaskedLM | 128 | 84.142975 | 89.979572 | 68.647007 | 67.866745 | | MobileBertForQuestionAnswering | 128 | 181.745989 | 400.246042 | 66.049968 | 156.101302 | | OPTForCausalLM | 2 | 155.246568 | 169.915345 | 65.80678 | 66.600839 | | DistillGPT2 | 16 | 105.696834 | 111.898442 | 64.033449 | 63.104804 | | PegasusForCausalLM | 32 | 68.190021 | 90.585331 | 57.286371 | 57.135705 | | MegatronBertForCausalLM | 4 | 86.17124 | 185.62622 | 54.927648 | 58.193645 | | LayoutLMForSequenceClassification | 16 | 97.871068 | 105.438826 | 53.932518 | 55.204354 | | BertForQuestionAnswering | 16 | 95.33042 | 102.560064 | 53.431936 | 54.651938 | | ElectraForQuestionAnswering | 64 | 115.256001 | 120.816882 | 53.357514 | 55.019793 | | RobertaForQuestionAnswering | 16 | 95.807973 | 102.049819 | 53.026195 | 54.84576 | | DebertaForQuestionAnswering | 16 | 145.63176 | 155.702317 | 52.938425 | 54.0114 | | XGLMForCausalLM | 8 | 92.257911 | 193.848468 | 50.147052 | 56.908814 | | DebertaForMaskedLM | 8 | 93.849216 | 119.057684 | 48.558856 | 48.678837 | | ElectraForCausalLM | 32 | 87.737753 | 101.708558 | 46.940744 | 47.323442 | | BlenderbotSmallForCausalLM | 64 | 56.735862 | 70.750482 | 46.229695 | 46.02597 | | DebertaV2ForQuestionAnswering | 1 | 107.643376 | 207.988094 | 41.279191 | 53.56479 | | MT5ForConditionalGeneration | 16 | 96.460884 | 209.539665 | 39.911509 | 52.325371 | | GPT2ForSequenceClassification | 4 | 90.507337 | 104.834707 | 39.596057 | 39.892535 | | Speech2Text2ForCausalLM | 256 | 49.689272 | 55.855297 | 34.465984 | 33.965759 | | LayoutLMForMaskedLM | 16 | 112.489355 | 119.939849 | 0.0 | 70.264157 | | AllenaiLongformerBase | 4 | 197.24792 | 342.259095 | 0.0 | 0.0 | +-----------------------------------------+-----+------------+------------+------------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 0.999937 | 0.979273 | 2.535318 | 2.488158 | | levit_128 | 128 | 1.006095 | 0.479146 | 2.163689 | 1.462609 | | xcit_large_24_p8_224 | 5 | 0.998343 | 0.478302 | 2.115537 | 1.591851 | | cait_m36_384 | 2 | 0.942583 | 0.474984 | 2.051115 | 1.572233 | | lcnet_050 | 128 | 0.998996 | 0.68977 | 2.045065 | 1.718694 | | ghostnet_100 | 128 | 0.99974 | 0.657087 | 1.957761 | 1.731641 | | twins_pcpvt_base | 64 | 1.002899 | 0.554792 | 1.937105 | 1.573457 | | coat_lite_mini | 128 | 0.999538 | 0.90529 | 1.897422 | 1.862499 | | regnety_002 | 128 | 1.018046 | 0.581444 | 1.856761 | 1.440385 | | gmlp_s16_224 | 128 | 1.000266 | 0.973587 | 1.743779 | 1.707174 | | gmixer_24_224 | 128 | 0.999746 | 0.81072 | 1.713867 | 1.679773 | | mobilenetv3_large_100 | 128 | 0.999267 | 0.757196 | 1.641153 | 1.585806 | | mnasnet_100 | 128 | 0.999342 | 0.783825 | 1.630094 | 1.619385 | | crossvit_9_240 | 128 | 1.003288 | 0.578453 | 1.628457 | 1.310233 | | mobilenetv2_100 | 128 | 0.999443 | 0.767241 | 1.621301 | 1.60987 | | swin_base_patch4_window7_224 | 64 | 0.999643 | 0.735531 | 1.620062 | 1.578529 | | volo_d1_224 | 64 | 0.999606 | 0.919902 | 1.603852 | 1.570803 | | dm_nfnet_f0 | 128 | 0.999379 | 0.971582 | 1.603275 | 1.518219 | | sebotnet33ts_256 | 64 | 0.999996 | 0.780509 | 1.600624 | 1.573112 | | nfnet_l0 | 128 | 0.999007 | 0.780911 | 1.590227 | 1.501868 | | spnasnet_100 | 128 | 0.999321 | 0.781063 | 1.570904 | 1.546223 | | convit_base | 64 | 0.999734 | 0.97573 | 1.564137 | 1.536557 | | dla102 | 128 | 0.999475 | 0.82009 | 1.558571 | 1.537289 | | gluon_inception_v3 | 128 | 0.999921 | 0.860162 | 1.557212 | 1.528026 | | fbnetc_100 | 128 | 0.999546 | 0.774868 | 1.555221 | 1.541267 | | inception_v3 | 128 | 0.999767 | 0.860256 | 1.554518 | 1.524263 | | adv_inception_v3 | 128 | 0.999568 | 0.85449 | 1.554092 | 1.525432 | | convnext_base | 64 | 0.999662 | 0.916394 | 1.537376 | 1.505963 | | tf_efficientnet_b0 | 128 | 0.999678 | 0.699331 | 1.512592 | 1.490584 | | eca_botnext26ts_256 | 128 | 0.999655 | 0.740707 | 1.491524 | 1.48158 | | mobilevit_s | 64 | 0.999561 | 0.640437 | 1.490321 | 1.37385 | | fbnetv3_b | 128 | 0.99911 | 0.73631 | 1.473592 | 1.454377 | | botnet26t_256 | 128 | 0.999733 | 0.874946 | 1.46725 | 1.457085 | | SelecSls42b | 128 | 0.999507 | 0.81013 | 1.462733 | 1.433993 | | resnest101e | 64 | 0.999128 | 0.739484 | 1.43997 | 1.344963 | | ese_vovnet19b_dw | 128 | 0.999513 | 0.861125 | 1.437404 | 1.429039 | | cspdarknet53 | 64 | 0.999553 | 0.829967 | 1.433597 | 1.414689 | | rexnet_100 | 128 | 0.999486 | 0.713482 | 1.430253 | 1.404121 | | tinynet_a | 128 | 0.999533 | 0.620371 | 1.407258 | 1.257235 | | res2next50 | 128 | 0.999736 | 0.810016 | 1.396026 | 1.37432 | | eca_halonext26ts | 128 | 0.999875 | 0.746769 | 1.383023 | 1.37401 | | poolformer_m36 | 64 | 0.999077 | 0.96823 | 1.38185 | 1.348848 | | res2net50_14w_8s | 128 | 0.999756 | 0.71351 | 1.373582 | 1.354588 | | mixer_b16_224 | 128 | 0.999596 | 0.968242 | 1.327419 | 1.317113 | | repvgg_a2 | 128 | 0.999542 | 0.789153 | 1.263607 | 1.251531 | | dpn107 | 32 | 1.00005 | 0.736984 | 1.25536 | 1.231054 | | tf_mixnet_l | 128 | 0.999972 | 0.834594 | 1.242394 | 1.227238 | | mixnet_l | 128 | 0.999308 | 0.828089 | 1.237938 | 1.218866 | | pit_b_224 | 64 | 0.999462 | 0.910021 | 1.227257 | 1.207004 | | jx_nest_base | 32 | 0.999827 | 0.664991 | 1.220553 | 1.19083 | | visformer_small | 128 | 0.999601 | 0.94578 | 1.188301 | 1.136696 | | gernet_l | 128 | 0.999329 | 0.824337 | 1.164958 | 1.153659 | | resmlp_12_224 | 128 | 1.000008 | 0.761694 | 1.152558 | 1.139756 | | deit_base_distilled_patch16_224 | 64 | 0.99969 | 0.956208 | 1.141128 | 1.125946 | | vit_base_patch16_224 | 64 | 0.999596 | 0.960421 | 1.129073 | 1.116839 | | beit_base_patch16_224 | 64 | 0.999567 | 0.8994 | 1.126642 | 1.118745 | | swsl_resnext101_32x16d | 32 | 0.999512 | 0.820644 | 1.088081 | 1.025927 | | res2net101_26w_4s | 64 | 0.999153 | 0.564931 | 1.046018 | 1.084082 | | pnasnet5large | 16 | 0.998644 | 0.736695 | 1.042094 | 1.119958 | | convmixer_768_32 | 32 | 0.999638 | 0.947029 | 0.998149 | 0.995412 | | hrnet_w18 | 128 | 0.968426 | 0.705496 | 0.972432 | 1.036891 | +---------------------------------+-----+----------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+---------------+---------------+-------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+----+---------------+---------------+-------------+------------------------+ | SelecSls42b | 8 | pass | pass | pass | pass | | adv_inception_v3 | 8 | pass | pass | pass | pass | | mobilevit_s | 8 | pass | pass | pass | pass | | nfnet_l0 | 8 | pass | pass | pass | pass | | pit_b_224 | 8 | pass | pass | pass | pass | | pnasnet5large | 8 | pass | pass | pass | pass | | poolformer_m36 | 8 | pass | pass | pass | pass | | regnety_002 | 8 | pass | pass | pass | pass | | repvgg_a2 | 8 | pass | pass | pass | pass | | res2net101_26w_4s | 8 | pass | pass | pass | pass | | res2net50_14w_8s | 8 | pass | pass | pass | pass | | res2next50 | 8 | pass | pass | pass | pass | | resmlp_12_224 | 8 | pass | pass | pass | pass | | resnest101e | 8 | pass | pass | pass | pass | | rexnet_100 | 8 | pass | pass | pass | pass | | sebotnet33ts_256 | 8 | pass | pass | pass | pass | | spnasnet_100 | 8 | pass | pass | pass | pass | | swin_base_patch4_window7_224 | 8 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 8 | pass | pass | pass | pass | | tf_efficientnet_b0 | 8 | pass | pass | pass | pass | | tf_mixnet_l | 8 | pass | pass | pass | pass | | tinynet_a | 8 | pass | pass | pass | pass | | tnt_s_patch16_224 | 8 | pass | pass | pass | pass | | twins_pcpvt_base | 8 | pass | pass | pass | pass | | visformer_small | 8 | pass | pass | pass | pass | | vit_base_patch16_224 | 8 | pass | pass | pass | pass | | volo_d1_224 | 8 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 8 | pass | pass | pass | pass | | lcnet_050 | 8 | pass | fail_accuracy | pass | pass | | mobilenetv3_large_100 | 8 | pass | pass | pass | pass | | mobilenetv2_100 | 8 | pass | pass | pass | pass | | mnasnet_100 | 8 | pass | pass | pass | pass | | mixnet_l | 8 | pass | pass | pass | pass | | beit_base_patch16_224 | 8 | pass | pass | pass | pass | | botnet26t_256 | 8 | pass | pass | pass | pass | | coat_lite_mini | 8 | pass | pass | pass | pass | | convmixer_768_32 | 8 | pass | pass | pass | pass | | convnext_base | 8 | pass | pass | pass | pass | | crossvit_9_240 | 8 | pass | pass | pass | pass | | cspdarknet53 | 8 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 8 | pass | pass | pass | pass | | dla102 | 8 | pass | pass | pass | pass | | dm_nfnet_f0 | 8 | pass | pass | pass | pass | | dpn107 | 8 | pass | pass | pass | pass | | eca_botnext26ts_256 | 8 | pass | pass | pass | pass | | eca_halonext26ts | 8 | pass | pass | pass | pass | | ese_vovnet19b_dw | 8 | pass | pass | pass | pass | | fbnetc_100 | 8 | pass | pass | pass | pass | | fbnetv3_b | 8 | pass | pass | pass | pass | | gernet_l | 8 | pass | pass | pass | pass | | ghostnet_100 | 8 | pass | pass | pass | pass | | gluon_inception_v3 | 8 | pass | pass | pass | pass | | gmixer_24_224 | 8 | pass | pass | pass | pass | | gmlp_s16_224 | 8 | pass | pass | pass | pass | | hrnet_w18 | 8 | pass | pass | pass | pass | | inception_v3 | 8 | pass | pass | pass | pass | | jx_nest_base | 8 | pass | pass | pass | pass | | levit_128 | 8 | pass | pass | pass | pass | | mixer_b16_224 | 8 | pass | pass | pass | pass | | convit_base | 8 | fail_accuracy | pass | fail_to_run | fail_to_run | | cait_m36_384 | 8 | pass | pass | OOM | pass | +---------------------------------+----+---------------+---------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+-----------+-----------+------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+-----------+-----------+------------+------------------------+ | hrnet_w18 | 128 | 11.526093 | 34.704092 | 215.300548 | 235.439188 | | rexnet_100 | 128 | 4.709586 | 9.921003 | 198.835597 | 287.845509 | | ghostnet_100 | 128 | 4.28569 | 11.676922 | 182.636547 | 232.169208 | | pnasnet5large | 16 | 9.84799 | 27.578502 | 147.549783 | 156.780616 | | resnest101e | 64 | 9.160534 | 21.012353 | 138.563704 | 158.368072 | | mobilevit_s | 64 | 4.339656 | 9.846744 | 135.482809 | 157.381025 | | fbnetv3_b | 128 | 6.780701 | 14.103187 | 131.883844 | 162.084463 | | gluon_inception_v3 | 128 | 6.771645 | 13.31762 | 131.07415 | 158.07422 | | tf_mixnet_l | 128 | 7.631285 | 15.099839 | 128.487929 | 147.30566 | | res2net101_26w_4s | 64 | 6.925115 | 22.037917 | 128.054408 | 138.097871 | | adv_inception_v3 | 128 | 6.916839 | 13.522195 | 126.647641 | 150.583826 | | inception_v3 | 128 | 6.908191 | 13.404646 | 126.359993 | 154.997186 | | tinynet_a | 128 | 4.562391 | 10.756223 | 123.713257 | 143.520164 | | mixnet_l | 128 | 8.114457 | 14.762465 | 122.645935 | 147.809105 | | tf_efficientnet_b0 | 128 | 3.876222 | 8.95223 | 119.095017 | 135.472967 | | mobilenetv3_large_100 | 128 | 3.268903 | 7.370348 | 116.468401 | 141.481082 | | xcit_large_24_p8_224 | 5 | 9.060889 | 23.917738 | 114.981726 | 117.396615 | | res2net50_14w_8s | 128 | 6.656659 | 20.358322 | 110.228272 | 114.338428 | | levit_128 | 128 | 4.723784 | 11.611245 | 105.940064 | 118.947059 | | fbnetc_100 | 128 | 4.740683 | 8.405713 | 105.643603 | 134.635195 | | cait_m36_384 | 2 | 10.683412 | 26.892269 | 104.168617 | 101.645813 | | spnasnet_100 | 128 | 3.801504 | 8.38845 | 104.156516 | 131.219595 | | swin_base_patch4_window7_224 | 64 | 6.711992 | 16.408464 | 102.961338 | 103.763249 | | twins_pcpvt_base | 64 | 8.097481 | 18.079822 | 101.116118 | 105.321891 | | eca_halonext26ts | 128 | 2.803947 | 6.547626 | 95.464727 | 106.42396 | | poolformer_m36 | 64 | 7.666118 | 13.079596 | 93.365106 | 100.51483 | | mobilenetv2_100 | 128 | 3.233801 | 7.084594 | 88.824387 | 103.558762 | | dpn107 | 32 | 9.221649 | 18.13529 | 87.67872 | 90.00538 | | sebotnet33ts_256 | 64 | 3.449897 | 7.858807 | 84.98418 | 96.118227 | | regnety_002 | 128 | 4.030516 | 8.01265 | 84.177373 | 87.946725 | | cspdarknet53 | 64 | 5.331471 | 10.261594 | 82.409863 | 93.775798 | | jx_nest_base | 32 | 6.090223 | 13.536968 | 81.942001 | 80.856023 | | dla102 | 128 | 4.956245 | 13.010602 | 81.783964 | 90.879032 | | mnasnet_100 | 128 | 3.710201 | 6.811075 | 79.663665 | 100.111465 | | coat_lite_mini | 128 | 2.811215 | 7.210743 | 79.427025 | 82.83736 | | eca_botnext26ts_256 | 128 | 2.733343 | 6.420717 | 76.05704 | 90.096494 | | lcnet_050 | 128 | 2.341389 | 4.422994 | 73.907701 | 96.204906 | | crossvit_9_240 | 128 | 4.295831 | 11.204699 | 73.348105 | 74.572179 | | res2next50 | 128 | 4.780722 | 10.478523 | 73.289263 | 82.035364 | | botnet26t_256 | 128 | 2.510238 | 5.418026 | 71.797055 | 86.853273 | | volo_d1_224 | 64 | 3.63585 | 10.17847 | 68.454106 | 69.542334 | | nfnet_l0 | 128 | 4.139216 | 9.284106 | 65.549119 | 71.427918 | | dm_nfnet_f0 | 128 | 4.891267 | 9.810936 | 65.227689 | 69.27176 | | tnt_s_patch16_224 | 128 | 4.937536 | 13.830504 | 65.109728 | 65.763122 | | gernet_l | 128 | 4.505373 | 8.082768 | 64.167078 | 74.439501 | | SelecSls42b | 128 | 1.928453 | 4.732315 | 61.956307 | 85.059221 | | ese_vovnet19b_dw | 128 | 2.16477 | 4.153617 | 59.97885 | 74.747155 | | swsl_resnext101_32x16d | 32 | 4.890121 | 12.505957 | 59.081047 | 59.323381 | | visformer_small | 128 | 2.055817 | 5.208936 | 56.404626 | 60.933241 | | convnext_base | 64 | 5.300881 | 11.968768 | 54.463518 | 55.232698 | | gmlp_s16_224 | 128 | 4.926444 | 10.300078 | 54.267164 | 53.275909 | | gmixer_24_224 | 128 | 4.541943 | 12.086148 | 48.574792 | 47.430169 | | repvgg_a2 | 128 | 4.114284 | 7.906238 | 48.392955 | 56.701221 | | convit_base | 64 | 2.941901 | 7.936277 | 43.926032 | 42.370222 | | resmlp_12_224 | 128 | 2.033362 | 4.268435 | 36.677943 | 36.590531 | | beit_base_patch16_224 | 64 | 3.389528 | 7.645599 | 36.5096 | 33.7776 | | convmixer_768_32 | 32 | 1.667835 | 6.788494 | 35.26413 | 34.369227 | | pit_b_224 | 64 | 2.808171 | 6.189776 | 35.009204 | 34.056854 | | mixer_b16_224 | 128 | 3.228623 | 4.958011 | 33.479241 | 29.756883 | | vit_base_patch16_224 | 64 | 2.341231 | 5.392029 | 33.37032 | 32.161311 | | deit_base_distilled_patch16_224 | 64 | 2.321951 | 5.310861 | 32.587011 | 32.188822 | +---------------------------------+-----+-----------+-----------+------------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------+------------------------+ | pnasnet5large | 16 | 1.058353 | 1.058169 | 1.288607 | 1.284813 | | gmlp_s16_224 | 128 | 1.00147 | 1.001336 | 1.205843 | 1.204895 | | poolformer_m36 | 64 | 1.003335 | 1.003329 | 1.195425 | 1.192848 | | gmixer_24_224 | 128 | 1.001673 | 1.001863 | 1.160637 | 1.159565 | | convit_base | 64 | 0.999863 | 1.001607 | 1.158278 | 1.157029 | | mobilenetv2_100 | 128 | 1.000388 | 0.951814 | 1.121887 | 1.118241 | | sebotnet33ts_256 | 64 | 1.000271 | 1.000207 | 1.112976 | 1.111256 | | resnest101e | 64 | 1.0 | 1.103821 | 1.084709 | 1.08367 | | resmlp_12_224 | 128 | 0.999919 | 1.008706 | 1.076966 | 1.074078 | | dm_nfnet_f0 | 128 | 0.91384 | 0.989346 | 1.07696 | 1.072741 | | tf_efficientnet_b0 | 128 | 1.000043 | 0.956726 | 1.070345 | 1.067359 | | tf_mixnet_l | 128 | 1.000146 | 0.982793 | 1.067061 | 1.064731 | | tinynet_a | 128 | 0.999648 | 0.959928 | 1.064684 | 1.06071 | | twins_pcpvt_base | 64 | 1.000222 | 1.000677 | 1.061434 | 1.061039 | | tnt_s_patch16_224 | 128 | 1.000049 | 1.005661 | 1.051242 | 1.050604 | | rexnet_100 | 128 | 0.999839 | 0.95797 | 1.04929 | 1.046467 | | swin_base_patch4_window7_224 | 64 | 1.000137 | 1.001841 | 1.047412 | 1.046694 | | convnext_base | 64 | 1.005168 | 1.004929 | 1.038726 | 1.037884 | | dla102 | 128 | 0.975942 | 1.000439 | 1.026486 | 1.026963 | | coat_lite_mini | 128 | 1.044455 | 1.045678 | 1.020978 | 1.020201 | | visformer_small | 128 | 1.000821 | 1.000654 | 1.02056 | 1.019496 | | adv_inception_v3 | 128 | 1.000748 | 1.000689 | 1.019977 | 1.01808 | | gluon_inception_v3 | 128 | 1.000748 | 1.000689 | 1.019977 | 1.01808 | | inception_v3 | 128 | 1.000748 | 1.000689 | 1.019977 | 1.01808 | | cspdarknet53 | 64 | 1.0 | 0.99992 | 1.017356 | 1.014332 | | eca_botnext26ts_256 | 128 | 0.999972 | 0.977713 | 1.005351 | 1.004339 | | ghostnet_100 | 128 | 0.998638 | 0.997661 | 1.00505 | 1.00194 | | eca_halonext26ts | 128 | 0.999902 | 0.977751 | 1.001027 | 0.999981 | | dpn107 | 32 | 1.000838 | 1.001744 | 0.998565 | 0.999818 | | mixer_b16_224 | 128 | 0.999945 | 1.00009 | 0.995661 | 0.994736 | | hrnet_w18 | 128 | 1.000189 | 1.00005 | 0.992477 | 0.989726 | | mobilevit_s | 64 | 1.000096 | 0.9616 | 0.989891 | 0.98861 | | mixnet_l | 128 | 1.000338 | 0.980962 | 0.989015 | 0.986977 | | beit_base_patch16_224 | 64 | 0.999671 | 1.003491 | 0.988595 | 0.987106 | | convmixer_768_32 | 32 | 1.0 | 0.999873 | 0.987411 | 0.986378 | | cait_m36_384 | 2 | 1.000008 | 0.999611 | 0.9837 | 0.97734 | | swsl_resnext101_32x16d | 32 | 1.000515 | 1.00025 | 0.979647 | 0.978856 | | xcit_large_24_p8_224 | 5 | 0.998685 | 0.998616 | 0.977595 | 0.973401 | | botnet26t_256 | 128 | 1.000038 | 0.999927 | 0.975558 | 0.974401 | | ese_vovnet19b_dw | 128 | 1.000829 | 1.000484 | 0.975309 | 0.973402 | | gernet_l | 128 | 1.000219 | 0.999742 | 0.973937 | 0.970591 | | volo_d1_224 | 64 | 1.001172 | 1.002349 | 0.973156 | 0.973038 | | nfnet_l0 | 128 | 1.000313 | 0.980272 | 0.973054 | 0.969296 | | fbnetv3_b | 128 | 1.000086 | 0.972896 | 0.972444 | 0.969908 | | SelecSls42b | 128 | 1.001155 | 1.000911 | 0.971568 | 0.967975 | | res2net101_26w_4s | 64 | 1.00123 | 1.001229 | 0.967101 | 0.962937 | | repvgg_a2 | 128 | 1.000425 | 1.000528 | 0.965269 | 0.960691 | | res2net50_14w_8s | 128 | 1.000212 | 1.000114 | 0.9641 | 0.961767 | | fbnetc_100 | 128 | 0.999813 | 1.000333 | 0.958404 | 0.953838 | | res2next50 | 128 | 1.000587 | 1.001572 | 0.957722 | 0.955826 | | spnasnet_100 | 128 | 1.0 | 1.001177 | 0.951742 | 0.946511 | | mnasnet_100 | 128 | 1.0 | 1.000843 | 0.946617 | 0.94138 | | mobilenetv3_large_100 | 128 | 1.0 | 0.993162 | 0.940769 | 0.93762 | | vit_base_patch16_224 | 64 | 0.999946 | 1.015251 | 0.938716 | 0.937614 | | deit_base_distilled_patch16_224 | 64 | 0.999166 | 1.010207 | 0.937285 | 0.935767 | | pit_b_224 | 64 | 0.999855 | 1.003211 | 0.932635 | 0.931235 | | levit_128 | 128 | 1.002678 | 1.002644 | 0.905741 | 0.902892 | | crossvit_9_240 | 128 | 0.999212 | 1.000282 | 0.871764 | 0.870197 | | regnety_002 | 128 | 1.0 | 0.99901 | 0.866936 | 0.862637 | | lcnet_050 | 128 | 1.000406 | 0.967801 | 0.843427 | 0.838246 | | jx_nest_base | 32 | 0.999875 | 1.000032 | 0.733958 | 0.732922 | +---------------------------------+-----+----------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+------------+------------+------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+------------+------------+------------+------------------------+ | hrnet_w18 | 128 | 446.878638 | 611.969061 | 443.782737 | 415.895688 | | convmixer_768_32 | 32 | 294.645387 | 311.184605 | 294.741448 | 295.729404 | | pnasnet5large | 16 | 232.732296 | 316.778479 | 222.770819 | 208.410549 | | tf_mixnet_l | 128 | 196.872753 | 235.871277 | 158.46994 | 160.293271 | | mixnet_l | 128 | 188.98348 | 227.929676 | 152.45178 | 154.710412 | | tnt_s_patch16_224 | 128 | 363.341809 | 370.530738 | 143.092395 | 145.923881 | | resnest101e | 64 | 169.119573 | 228.064175 | 117.528888 | 126.475063 | | dla102 | 128 | 182.030465 | 221.841947 | 116.864859 | 118.460653 | | convit_base | 64 | 181.540373 | 186.158534 | 116.059079 | 118.158031 | | beit_base_patch16_224 | 64 | 124.679585 | 138.527407 | 110.360664 | 111.161983 | | res2net50_14w_8s | 128 | 150.324309 | 210.688859 | 109.422432 | 111.274625 | | swsl_resnext101_32x16d | 32 | 118.458758 | 144.639785 | 108.966196 | 116.039383 | | adv_inception_v3 | 128 | 165.980914 | 193.918667 | 106.649283 | 108.680916 | | inception_v3 | 128 | 165.997887 | 192.462692 | 106.587813 | 108.714911 | | gluon_inception_v3 | 128 | 165.857755 | 192.871784 | 106.553094 | 108.537034 | | poolformer_m36 | 64 | 145.918709 | 150.027273 | 104.905518 | 107.808836 | | res2net101_26w_4s | 64 | 102.486962 | 181.458481 | 97.619046 | 94.436208 | | visformer_small | 128 | 110.407552 | 116.551611 | 92.84181 | 97.240383 | | res2next50 | 128 | 127.266728 | 156.846056 | 91.100126 | 92.625477 | | mixer_b16_224 | 128 | 116.580148 | 120.330186 | 87.796646 | 88.465373 | | dpn107 | 32 | 110.562657 | 149.513341 | 87.788073 | 89.607133 | | swin_base_patch4_window7_224 | 64 | 142.119912 | 192.836912 | 87.491142 | 89.951344 | | jx_nest_base | 32 | 104.265713 | 156.002293 | 85.117311 | 87.159175 | | volo_d1_224 | 64 | 134.242745 | 146.037595 | 83.782373 | 85.412843 | | eca_halonext26ts | 128 | 114.145529 | 152.799788 | 82.533784 | 82.990996 | | fbnetv3_b | 128 | 119.365616 | 161.694348 | 80.882374 | 81.984004 | | convnext_base | 64 | 122.041312 | 133.30372 | 79.309284 | 80.94466 | | gmlp_s16_224 | 128 | 136.750852 | 140.324328 | 78.414999 | 79.949312 | | dm_nfnet_f0 | 128 | 120.83973 | 124.176647 | 75.11734 | 79.402598 | | eca_botnext26ts_256 | 128 | 110.273459 | 148.783847 | 73.941916 | 74.422772 | | botnet26t_256 | 128 | 106.74703 | 122.109975 | 72.835181 | 73.21282 | | cait_m36_384 | 2 | 169.039562 | 302.431277 | 68.975127 | 83.133636 | | gmixer_24_224 | 128 | 117.516488 | 144.904971 | 68.639002 | 69.842059 | | nfnet_l0 | 128 | 104.999651 | 134.500987 | 66.132067 | 70.338611 | | gernet_l | 128 | 76.509915 | 92.917121 | 65.809415 | 66.360813 | | cspdarknet53 | 64 | 93.222172 | 112.320786 | 64.998876 | 65.869118 | | pit_b_224 | 64 | 78.724206 | 86.501541 | 64.041408 | 65.063404 | | rexnet_100 | 128 | 86.483862 | 121.371797 | 60.386357 | 61.641991 | | vit_base_patch16_224 | 64 | 68.160133 | 70.768295 | 60.357166 | 61.00665 | | deit_base_distilled_patch16_224 | 64 | 68.712661 | 71.701813 | 60.125983 | 60.883998 | | repvgg_a2 | 128 | 75.521186 | 95.544417 | 59.744078 | 60.204655 | | xcit_large_24_p8_224 | 5 | 125.378018 | 257.084322 | 59.59185 | 79.512499 | | coat_lite_mini | 128 | 112.387396 | 124.341592 | 59.210266 | 60.389376 | | mobilevit_s | 64 | 85.465966 | 133.228319 | 57.256435 | 62.152627 | | tf_efficientnet_b0 | 128 | 85.684352 | 122.393691 | 56.514047 | 57.474226 | | twins_pcpvt_base | 64 | 106.863494 | 189.787044 | 55.294221 | 67.516871 | | fbnetc_100 | 128 | 83.424177 | 107.685281 | 53.60418 | 54.139423 | | tinynet_a | 128 | 72.048109 | 115.805231 | 51.070798 | 58.165759 | | sebotnet33ts_256 | 64 | 81.416487 | 104.254351 | 50.799102 | 51.710679 | | ghostnet_100 | 128 | 96.153795 | 146.156668 | 48.912041 | 55.489097 | | spnasnet_100 | 128 | 73.10982 | 93.504571 | 46.414704 | 47.265541 | | resmlp_12_224 | 128 | 52.917786 | 69.49245 | 45.985379 | 46.419233 | | ese_vovnet19b_dw | 128 | 63.288247 | 73.521745 | 44.024072 | 44.288846 | | SelecSls42b | 128 | 62.412993 | 77.107054 | 42.637271 | 43.54646 | | mnasnet_100 | 128 | 67.29454 | 85.625227 | 41.141425 | 41.438283 | | mobilenetv2_100 | 128 | 64.984806 | 84.671564 | 40.120431 | 40.339911 | | crossvit_9_240 | 128 | 64.26486 | 113.953064 | 39.703625 | 48.413579 | | mobilenetv3_large_100 | 128 | 63.010654 | 83.221 | 38.300878 | 39.772848 | | levit_128 | 128 | 53.758502 | 113.336888 | 25.693503 | 36.250379 | | regnety_002 | 128 | 40.779376 | 70.479183 | 23.091201 | 30.022773 | | lcnet_050 | 128 | 31.582245 | 45.758263 | 15.384557 | 18.366872 | +---------------------------------+-----+------------+------------+------------+------------------------+ ~~~

Performance graphs

/data/home/williamwen/cluster/oneoff_cron_logs/day_144_24_05_23_performance_amp_246/timm_models_amp.png : ![](https://i.imgur.com/UaHSXit.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_144_24_05_23_performance_amp_246/torchbench_amp.png : ![](https://i.imgur.com/tDfXWeX.png) /data/home/williamwen/cluster/oneoff_cron_logs/day_144_24_05_23_performance_amp_246/huggingface_amp.png : ![](https://i.imgur.com/yNySJpu.png)

Build Summary

### Run name ### day_144_24_05_23_performance_amp_246 ### Commit hashes ### pytorch commit: a370bca9a97aaf3c1bc36adbc2d68428fde8e74c pytorch commit date: 2023-05-18 22:52:38+00:00 torchbench commit: 3f2a2a1583f5ec480e4882f632445807d1c4d487 torchbench commit date: 2023-05-24 10:26:51-07:00 ### TorchDynamo config flags ### ### Torch version ### torch: 2.1.0a0+git956bd03 ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8401 Number CUDA Devices: 1 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.481549312

anijain2305 / torchdynamo_dashboard

one-off runs #3

huggingface suite with amp precision

timm_models suite with amp precision

Performance Dashboard for amp precision (max autotune, with cold start)

Executive Summary

Warnings

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Build Summary

Performance Dashboard for amp precision (inference, no max-autotune)

Executive Summary

Warnings

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Build Summary

Performance Dashboard for amp precision (inductor max-autotune with cudagraphs)

Executive Summary

Warnings

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Build Summary

Performance Dashboard for amp precision (inductor max-autotune without cudagraphs)

Executive Summary

Warnings

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Build Summary

Performance Dashboard for amp precision (inductor max-autotune comparison on timm models)

Executive Summary

Warnings

timm_models suite with amp precision

Performance graphs

Build Summary

Performance Dashboard for amp precision (2.0 release binary oneoff)

Executive Summary

Warnings

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Build Summary

Performance Dashboard for float32 precision (2.0 release binary oneoff)

Executive Summary

Warnings

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Build Summary

Performance Dashboard for amp precision (inductor max-autotune comparison on timm models, small)

Executive Summary

Warnings

timm_models suite with amp precision

Performance graphs

Build Summary

Performance Dashboard for amp precision (inductor max-autotune comparison on timm models, small, ran locally)

Executive Summary

Warnings

timm_models suite with amp precision

Performance graphs

Build Summary

Performance Dashboard for amp precision (inductor max-autotune comparison on all suites, with warm start)

Executive Summary

Warnings

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Build Summary

Performance Dashboard for amp precision (Python 3.11)

Executive Summary

Summary Statistics Diff