Closed fbusato closed 2 days ago
Could you please show a benchmark diff of the three algorithms before #2756 and after this PR? We should see a net benefit then.
Instructions how to benchmark in case you need it: https://nvidia.github.io/cccl/cub/benchmarking.html
[0] NVIDIA H100 80GB HBM3
T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|---|
I8 | I32 | 2^28 | 107.167 us | 0.39% | 122.152 us | 0.43% | 14.986 us | 13.98% | SLOW |
I8 | I64 | 2^28 | 108.946 us | 1.96% | 109.357 us | 2.03% | 0.411 us | 0.38% | SAME |
I16 | I32 | 2^28 | 188.094 us | 1.98% | 190.235 us | 2.02% | 2.142 us | 1.14% | SAME |
I16 | I64 | 2^28 | 188.058 us | 1.88% | 189.965 us | 1.94% | 1.907 us | 1.01% | SAME |
I32 | I32 | 2^28 | 351.776 us | 1.38% | 352.035 us | 1.45% | 0.259 us | 0.07% | SAME |
I32 | I64 | 2^28 | 352.221 us | 1.38% | 352.644 us | 1.41% | 0.424 us | 0.12% | SAME |
I64 | I32 | 2^28 | 688.110 us | 0.76% | 687.958 us | 0.81% | -0.152 us | -0.02% | SAME |
I64 | I64 | 2^28 | 688.580 us | 0.83% | 688.623 us | 0.85% | 0.043 us | 0.01% | SAME |
I128 | I32 | 2^28 | 1.400 ms | 0.27% | 1.403 ms | 0.28% | 2.806 us | 0.20% | SAME |
I128 | I64 | 2^28 | 1.404 ms | 1.43% | 1.397 ms | 1.56% | -6.640 us | -0.47% | SAME |
F32 | I32 | 2^28 | 359.793 us | 3.51% | 359.978 us | 3.54% | 0.185 us | 0.05% | SAME |
F32 | I64 | 2^28 | 352.525 us | 1.47% | 352.488 us | 1.44% | -0.037 us | -0.01% | SAME |
F64 | I32 | 2^28 | 688.145 us | 0.82% | 688.022 us | 0.78% | -0.123 us | -0.02% | SAME |
F64 | I64 | 2^28 | 688.338 us | 0.83% | 688.361 us | 0.90% | 0.023 us | 0.00% | SAME |
C64 | I32 | 2^28 | 1.479 ms | 0.06% | 1.468 ms | 0.07% | -11.115 us | -0.75% | FAST |
C64 | I64 | 2^28 | 1.550 ms | 0.07% | 1.524 ms | 0.07% | -25.813 us | -1.67% | FAST |
T{ct} | OffsetT{ct} | IsInPlace{ct} | Elements{io} | Entropy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|---|---|---|
I8 | I32 | false | 2^28 | 1 | 666.930 us | 0.32% | 646.117 us | 0.29% | -20.812 us | -3.12% | FAST |
I8 | I32 | false | 2^28 | 0.544 | 646.175 us | 0.29% | 624.590 us | 0.24% | -21.586 us | -3.34% | FAST |
I8 | I32 | false | 2^28 | 0 | 539.994 us | 0.27% | 525.318 us | 0.22% | -14.675 us | -2.72% | FAST |
I8 | I32 | true | 2^28 | 1 | 766.635 us | 0.22% | 747.091 us | 0.21% | -19.544 us | -2.55% | FAST |
I8 | I32 | true | 2^28 | 0.544 | 752.011 us | 0.21% | 727.137 us | 0.20% | -24.873 us | -3.31% | FAST |
I8 | I32 | true | 2^28 | 0 | 642.715 us | 0.22% | 629.653 us | 0.20% | -13.062 us | -2.03% | FAST |
I8 | I64 | false | 2^28 | 1 | 687.225 us | 0.27% | 655.494 us | 0.23% | -31.731 us | -4.62% | FAST |
I8 | I64 | false | 2^28 | 0.544 | 669.616 us | 0.28% | 638.069 us | 0.21% | -31.547 us | -4.71% | FAST |
I8 | I64 | false | 2^28 | 0 | 550.467 us | 0.28% | 530.086 us | 0.26% | -20.381 us | -3.70% | FAST |
I8 | I64 | true | 2^28 | 1 | 791.146 us | 0.19% | 755.288 us | 0.23% | -35.858 us | -4.53% | FAST |
I8 | I64 | true | 2^28 | 0.544 | 773.481 us | 0.20% | 739.127 us | 0.21% | -34.354 us | -4.44% | FAST |
I8 | I64 | true | 2^28 | 0 | 653.690 us | 0.24% | 636.169 us | 0.22% | -17.521 us | -2.68% | FAST |
I16 | I32 | false | 2^28 | 1 | 751.677 us | 0.38% | 765.957 us | 0.36% | 14.280 us | 1.90% | SLOW |
I16 | I32 | false | 2^28 | 0.544 | 704.825 us | 0.30% | 723.212 us | 0.29% | 18.387 us | 2.61% | SLOW |
I16 | I32 | false | 2^28 | 0 | 595.444 us | 0.26% | 563.434 us | 0.24% | -32.010 us | -5.38% | FAST |
I16 | I32 | true | 2^28 | 1 | 850.084 us | 0.26% | 862.771 us | 0.25% | 12.687 us | 1.49% | SLOW |
I16 | I32 | true | 2^28 | 0.544 | 812.488 us | 0.22% | 824.056 us | 0.23% | 11.568 us | 1.42% | SLOW |
I16 | I32 | true | 2^28 | 0 | 711.684 us | 0.20% | 674.914 us | 0.21% | -36.770 us | -5.17% | FAST |
I16 | I64 | false | 2^28 | 1 | 784.931 us | 0.28% | 783.574 us | 0.31% | -1.357 us | -0.17% | SAME |
I16 | I64 | false | 2^28 | 0.544 | 733.121 us | 0.26% | 731.942 us | 0.28% | -1.179 us | -0.16% | SAME |
I16 | I64 | false | 2^28 | 0 | 570.107 us | 0.25% | 571.236 us | 0.26% | 1.129 us | 0.20% | SAME |
I16 | I64 | true | 2^28 | 1 | 876.321 us | 0.24% | 875.572 us | 0.24% | -0.749 us | -0.09% | SAME |
I16 | I64 | true | 2^28 | 0.544 | 831.914 us | 0.20% | 832.294 us | 0.20% | 0.380 us | 0.05% | SAME |
I16 | I64 | true | 2^28 | 0 | 685.799 us | 0.18% | 687.535 us | 0.18% | 1.736 us | 0.25% | SLOW |
I32 | I32 | false | 2^28 | 1 | 1.122 ms | 0.43% | 1.022 ms | 0.44% | -100.502 us | -8.96% | FAST |
I32 | I32 | false | 2^28 | 0.544 | 1.018 ms | 0.27% | 892.510 us | 0.33% | -125.439 us | -12.32% | FAST |
I32 | I32 | false | 2^28 | 0 | 799.558 us | 0.24% | 664.654 us | 0.27% | -134.904 us | -16.87% | FAST |
I32 | I32 | true | 2^28 | 1 | 1.253 ms | 0.23% | 1.118 ms | 0.34% | -134.791 us | -10.76% | FAST |
I32 | I32 | true | 2^28 | 0.544 | 1.178 ms | 0.19% | 1.012 ms | 0.30% | -165.656 us | -14.06% | FAST |
I32 | I32 | true | 2^28 | 0 | 984.614 us | 0.15% | 793.354 us | 0.23% | -191.260 us | -19.42% | FAST |
I32 | I64 | false | 2^28 | 1 | 1.062 ms | 0.58% | 1.026 ms | 0.43% | -36.118 us | -3.40% | FAST |
I32 | I64 | false | 2^28 | 0.544 | 913.881 us | 0.54% | 888.767 us | 0.36% | -25.114 us | -2.75% | FAST |
I32 | I64 | false | 2^28 | 0 | 690.710 us | 0.35% | 668.789 us | 0.29% | -21.921 us | -3.17% | FAST |
I32 | I64 | true | 2^28 | 1 | 1.124 ms | 0.43% | 1.121 ms | 0.35% | -3.175 us | -0.28% | SAME |
I32 | I64 | true | 2^28 | 0.544 | 1.006 ms | 0.31% | 1.006 ms | 0.30% | -0.527 us | -0.05% | SAME |
I32 | I64 | true | 2^28 | 0 | 805.761 us | 0.23% | 798.519 us | 0.22% | -7.242 us | -0.90% | FAST |
I64 | I32 | false | 2^28 | 1 | 1.821 ms | 0.43% | 1.823 ms | 0.45% | 1.098 us | 0.06% | SAME |
I64 | I32 | false | 2^28 | 0.544 | 1.496 ms | 0.61% | 1.496 ms | 0.59% | 0.517 us | 0.03% | SAME |
I64 | I32 | false | 2^28 | 0 | 1.010 ms | 0.39% | 1.009 ms | 0.40% | -1.132 us | -0.11% | SAME |
I64 | I32 | true | 2^28 | 1 | 1.936 ms | 0.33% | 1.935 ms | 0.31% | -1.034 us | -0.05% | SAME |
I64 | I32 | true | 2^28 | 0.544 | 1.639 ms | 0.40% | 1.639 ms | 0.43% | 0.101 us | 0.01% | SAME |
I64 | I32 | true | 2^28 | 0 | 1.192 ms | 0.26% | 1.191 ms | 0.26% | -0.858 us | -0.07% | SAME |
I64 | I64 | false | 2^28 | 1 | 1.819 ms | 0.41% | 1.816 ms | 0.41% | -3.146 us | -0.17% | SAME |
I64 | I64 | false | 2^28 | 0.544 | 1.496 ms | 0.60% | 1.493 ms | 0.60% | -3.054 us | -0.20% | SAME |
I64 | I64 | false | 2^28 | 0 | 1.021 ms | 0.43% | 1.019 ms | 0.46% | -2.154 us | -0.21% | SAME |
I64 | I64 | true | 2^28 | 1 | 1.936 ms | 0.33% | 1.932 ms | 0.32% | -4.479 us | -0.23% | SAME |
I64 | I64 | true | 2^28 | 0.544 | 1.638 ms | 0.41% | 1.633 ms | 0.43% | -4.543 us | -0.28% | SAME |
I64 | I64 | true | 2^28 | 0 | 1.200 ms | 0.28% | 1.202 ms | 0.26% | 2.752 us | 0.23% | SAME |
I128 | I32 | false | 2^28 | 1 | 3.603 ms | 0.46% | 3.604 ms | 0.45% | 1.313 us | 0.04% | SAME |
I128 | I32 | false | 2^28 | 0.544 | 2.859 ms | 0.84% | 2.858 ms | 0.82% | -0.687 us | -0.02% | SAME |
I128 | I32 | false | 2^28 | 0 | 1.943 ms | 0.69% | 1.944 ms | 0.68% | 0.080 us | 0.00% | SAME |
I128 | I32 | true | 2^28 | 1 | 3.820 ms | 0.44% | 3.820 ms | 0.45% | 0.541 us | 0.01% | SAME |
I128 | I32 | true | 2^28 | 0.544 | 3.192 ms | 0.59% | 3.192 ms | 0.56% | 0.359 us | 0.01% | SAME |
I128 | I32 | true | 2^28 | 0 | 2.421 ms | 0.40% | 2.421 ms | 0.43% | -0.185 us | -0.01% | SAME |
I128 | I64 | false | 2^28 | 1 | 3.609 ms | 0.59% | 3.609 ms | 0.59% | -0.008 us | -0.00% | SAME |
I128 | I64 | false | 2^28 | 0.544 | 2.864 ms | 0.82% | 2.864 ms | 0.84% | 0.521 us | 0.02% | SAME |
I128 | I64 | false | 2^28 | 0 | 1.953 ms | 0.69% | 1.954 ms | 0.72% | 0.404 us | 0.02% | SAME |
I128 | I64 | true | 2^28 | 1 | 3.832 ms | 0.44% | 3.831 ms | 0.43% | -0.698 us | -0.02% | SAME |
I128 | I64 | true | 2^28 | 0.544 | 3.203 ms | 0.57% | 3.203 ms | 0.56% | -0.236 us | -0.01% | SAME |
I128 | I64 | true | 2^28 | 0 | 2.435 ms | 0.40% | 2.435 ms | 0.39% | -0.436 us | -0.02% | SAME |
F32 | I32 | false | 2^28 | 1 | 1.123 ms | 0.85% | 1.024 ms | 1.03% | -99.082 us | -8.82% | FAST |
F32 | I32 | false | 2^28 | 0.544 | 1.018 ms | 0.27% | 892.420 us | 0.34% | -125.529 us | -12.33% | FAST |
F32 | I32 | false | 2^28 | 0 | 799.450 us | 0.22% | 664.718 us | 0.27% | -134.732 us | -16.85% | FAST |
F32 | I32 | true | 2^28 | 1 | 1.253 ms | 0.25% | 1.117 ms | 0.34% | -136.310 us | -10.88% | FAST |
F32 | I32 | true | 2^28 | 0.544 | 1.178 ms | 0.20% | 1.011 ms | 0.28% | -166.523 us | -14.14% | FAST |
F32 | I32 | true | 2^28 | 0 | 984.513 us | 0.15% | 793.035 us | 0.23% | -191.478 us | -19.45% | FAST |
F32 | I64 | false | 2^28 | 1 | 1.062 ms | 0.59% | 1.025 ms | 0.41% | -36.465 us | -3.43% | FAST |
F32 | I64 | false | 2^28 | 0.544 | 913.749 us | 0.53% | 888.043 us | 0.37% | -25.705 us | -2.81% | FAST |
F32 | I64 | false | 2^28 | 0 | 689.779 us | 0.36% | 668.831 us | 0.30% | -20.948 us | -3.04% | FAST |
F32 | I64 | true | 2^28 | 1 | 1.123 ms | 0.44% | 1.120 ms | 0.36% | -2.997 us | -0.27% | SAME |
F32 | I64 | true | 2^28 | 0.544 | 1.006 ms | 0.31% | 1.005 ms | 0.28% | -0.527 us | -0.05% | SAME |
F32 | I64 | true | 2^28 | 0 | 805.759 us | 0.24% | 798.182 us | 0.23% | -7.577 us | -0.94% | FAST |
F64 | I32 | false | 2^28 | 1 | 1.822 ms | 0.46% | 1.823 ms | 0.46% | 0.258 us | 0.01% | SAME |
F64 | I32 | false | 2^28 | 0.544 | 1.496 ms | 0.61% | 1.497 ms | 0.59% | 0.757 us | 0.05% | SAME |
F64 | I32 | false | 2^28 | 0 | 1.010 ms | 0.38% | 1.009 ms | 0.40% | -0.575 us | -0.06% | SAME |
F64 | I32 | true | 2^28 | 1 | 1.936 ms | 0.30% | 1.935 ms | 0.31% | -1.120 us | -0.06% | SAME |
F64 | I32 | true | 2^28 | 0.544 | 1.639 ms | 0.41% | 1.639 ms | 0.42% | -0.057 us | -0.00% | SAME |
F64 | I32 | true | 2^28 | 0 | 1.192 ms | 0.26% | 1.192 ms | 0.26% | -0.607 us | -0.05% | SAME |
F64 | I64 | false | 2^28 | 1 | 1.819 ms | 0.41% | 1.816 ms | 0.38% | -2.957 us | -0.16% | SAME |
F64 | I64 | false | 2^28 | 0.544 | 1.496 ms | 0.60% | 1.493 ms | 0.61% | -3.213 us | -0.21% | SAME |
F64 | I64 | false | 2^28 | 0 | 1.021 ms | 0.45% | 1.019 ms | 0.44% | -2.082 us | -0.20% | SAME |
F64 | I64 | true | 2^28 | 1 | 1.936 ms | 0.33% | 1.931 ms | 0.31% | -4.830 us | -0.25% | SAME |
F64 | I64 | true | 2^28 | 0.544 | 1.638 ms | 0.41% | 1.633 ms | 0.42% | -4.501 us | -0.27% | SAME |
F64 | I64 | true | 2^28 | 0 | 1.200 ms | 0.25% | 1.203 ms | 0.27% | 2.579 us | 0.21% | SAME |
KeyT{ct} | ValueT{ct} | OffsetT{ct} | Elements{io} | MaxSegSize | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|---|---|---|
I8 | I8 | I32 | 2^28 | 2^1 | 1.042 ms | 0.48% | 1.070 ms | 0.47% | 27.182 us | 2.61% | SLOW |
I8 | I8 | I32 | 2^28 | 2^4 | 912.502 us | 0.45% | 936.005 us | 0.43% | 23.503 us | 2.58% | SLOW |
I8 | I8 | I32 | 2^28 | 2^8 | 905.978 us | 0.29% | 928.088 us | 0.30% | 22.110 us | 2.44% | SLOW |
I8 | I16 | I32 | 2^28 | 2^1 | 1.595 ms | 0.28% | 1.592 ms | 0.29% | -2.567 us | -0.16% | SAME |
I8 | I16 | I32 | 2^28 | 2^4 | 1.262 ms | 0.38% | 1.240 ms | 0.43% | -22.581 us | -1.79% | FAST |
I8 | I16 | I32 | 2^28 | 2^8 | 1.194 ms | 0.29% | 1.172 ms | 0.28% | -22.151 us | -1.86% | FAST |
I8 | I32 | I32 | 2^28 | 2^1 | 1.332 ms | 0.59% | 1.316 ms | 0.60% | -16.514 us | -1.24% | FAST |
I8 | I32 | I32 | 2^28 | 2^4 | 1.015 ms | 0.49% | 1.002 ms | 0.51% | -13.342 us | -1.31% | FAST |
I8 | I32 | I32 | 2^28 | 2^8 | 946.310 us | 0.39% | 938.284 us | 0.42% | -8.026 us | -0.85% | FAST |
I8 | I64 | I32 | 2^28 | 2^1 | 2.028 ms | 0.46% | 2.032 ms | 0.47% | 4.866 us | 0.24% | SAME |
I8 | I64 | I32 | 2^28 | 2^4 | 1.442 ms | 0.36% | 1.450 ms | 0.34% | 8.921 us | 0.62% | SLOW |
I8 | I64 | I32 | 2^28 | 2^8 | 1.308 ms | 0.21% | 1.320 ms | 0.21% | 11.956 us | 0.91% | SLOW |
I8 | I128 | I32 | 2^28 | 2^1 | 5.109 ms | 0.17% | 5.127 ms | 0.18% | 17.669 us | 0.35% | SLOW |
I8 | I128 | I32 | 2^28 | 2^4 | 4.173 ms | 0.27% | 4.195 ms | 0.27% | 21.329 us | 0.51% | SLOW |
I8 | I128 | I32 | 2^28 | 2^8 | 4.015 ms | 0.31% | 4.034 ms | 0.32% | 18.978 us | 0.47% | SLOW |
I8 | F32 | I32 | 2^28 | 2^1 | 1.337 ms | 0.75% | 1.322 ms | 0.78% | -14.694 us | -1.10% | FAST |
I8 | F32 | I32 | 2^28 | 2^4 | 1.018 ms | 0.52% | 1.007 ms | 0.49% | -10.626 us | -1.04% | FAST |
I8 | F32 | I32 | 2^28 | 2^8 | 952.242 us | 0.41% | 947.759 us | 0.42% | -4.483 us | -0.47% | FAST |
I8 | F64 | I32 | 2^28 | 2^1 | 2.043 ms | 0.50% | 2.046 ms | 0.47% | 2.364 us | 0.12% | SAME |
I8 | F64 | I32 | 2^28 | 2^4 | 1.459 ms | 0.35% | 1.465 ms | 0.33% | 6.160 us | 0.42% | SLOW |
I8 | F64 | I32 | 2^28 | 2^8 | 1.325 ms | 0.19% | 1.332 ms | 0.19% | 7.488 us | 0.57% | SLOW |
I8 | C64 | I32 | 2^28 | 2^1 | 5.539 ms | 0.18% | 5.340 ms | 0.18% | -199.691 us | -3.60% | FAST |
I8 | C64 | I32 | 2^28 | 2^4 | 4.998 ms | 0.10% | 4.802 ms | 0.10% | -195.590 us | -3.91% | FAST |
I8 | C64 | I32 | 2^28 | 2^8 | 4.848 ms | 0.09% | 4.655 ms | 0.09% | -193.759 us | -4.00% | FAST |
I16 | I8 | I32 | 2^28 | 2^1 | 1.228 ms | 0.54% | 1.382 ms | 0.44% | 153.882 us | 12.53% | SLOW |
I16 | I8 | I32 | 2^28 | 2^4 | 961.367 us | 0.50% | 1.107 ms | 0.47% | 145.262 us | 15.11% | SLOW |
I16 | I8 | I32 | 2^28 | 2^8 | 928.405 us | 0.38% | 1.070 ms | 0.28% | 141.610 us | 15.25% | SLOW |
I16 | I16 | I32 | 2^28 | 2^1 | 1.136 ms | 0.48% | 1.139 ms | 0.48% | 2.700 us | 0.24% | SAME |
I16 | I16 | I32 | 2^28 | 2^4 | 988.966 us | 0.26% | 990.565 us | 0.31% | 1.599 us | 0.16% | SAME |
I16 | I16 | I32 | 2^28 | 2^8 | 961.984 us | 0.16% | 963.443 us | 0.20% | 1.459 us | 0.15% | SAME |
I16 | I32 | I32 | 2^28 | 2^1 | 1.395 ms | 0.54% | 1.391 ms | 0.60% | -4.513 us | -0.32% | SAME |
I16 | I32 | I32 | 2^28 | 2^4 | 994.327 us | 0.44% | 994.688 us | 0.48% | 0.361 us | 0.04% | SAME |
I16 | I32 | I32 | 2^28 | 2^8 | 930.997 us | 0.28% | 931.975 us | 0.29% | 0.978 us | 0.11% | SAME |
I16 | I64 | I32 | 2^28 | 2^1 | 1.954 ms | 0.61% | 1.956 ms | 0.60% | 1.804 us | 0.09% | SAME |
I16 | I64 | I32 | 2^28 | 2^4 | 1.367 ms | 0.35% | 1.378 ms | 0.35% | 10.904 us | 0.80% | SLOW |
I16 | I64 | I32 | 2^28 | 2^8 | 1.241 ms | 0.25% | 1.254 ms | 0.23% | 12.715 us | 1.02% | SLOW |
I16 | I128 | I32 | 2^28 | 2^1 | 5.191 ms | 0.17% | 5.202 ms | 0.17% | 10.509 us | 0.20% | SLOW |
I16 | I128 | I32 | 2^28 | 2^4 | 4.221 ms | 0.29% | 4.239 ms | 0.30% | 17.871 us | 0.42% | SLOW |
I16 | I128 | I32 | 2^28 | 2^8 | 4.044 ms | 0.32% | 4.061 ms | 0.32% | 16.941 us | 0.42% | SLOW |
I16 | F32 | I32 | 2^28 | 2^1 | 1.394 ms | 0.73% | 1.257 ms | 0.94% | -136.863 us | -9.82% | FAST |
I16 | F32 | I32 | 2^28 | 2^4 | 993.771 us | 0.46% | 861.820 us | 0.55% | -131.951 us | -13.28% | FAST |
I16 | F32 | I32 | 2^28 | 2^8 | 938.546 us | 0.25% | 791.470 us | 0.41% | -147.076 us | -15.67% | FAST |
I16 | F64 | I32 | 2^28 | 2^1 | 1.967 ms | 0.55% | 1.962 ms | 0.63% | -4.929 us | -0.25% | SAME |
I16 | F64 | I32 | 2^28 | 2^4 | 1.397 ms | 0.33% | 1.382 ms | 0.34% | -14.949 us | -1.07% | FAST |
I16 | F64 | I32 | 2^28 | 2^8 | 1.274 ms | 0.20% | 1.259 ms | 0.23% | -14.515 us | -1.14% | FAST |
I16 | C64 | I32 | 2^28 | 2^1 | 5.203 ms | 0.18% | 4.865 ms | 0.20% | -337.605 us | -6.49% | FAST |
I16 | C64 | I32 | 2^28 | 2^4 | 4.583 ms | 0.11% | 4.219 ms | 0.12% | -364.973 us | -7.96% | FAST |
I16 | C64 | I32 | 2^28 | 2^8 | 4.399 ms | 0.10% | 4.021 ms | 0.11% | -377.759 us | -8.59% | FAST |
I32 | I8 | I32 | 2^28 | 2^1 | 1.267 ms | 0.53% | 1.275 ms | 0.52% | 8.074 us | 0.64% | SLOW |
I32 | I8 | I32 | 2^28 | 2^4 | 921.451 us | 0.41% | 934.302 us | 0.40% | 12.851 us | 1.39% | SLOW |
I32 | I8 | I32 | 2^28 | 2^8 | 868.209 us | 0.31% | 876.769 us | 0.33% | 8.560 us | 0.99% | SLOW |
I32 | I16 | I32 | 2^28 | 2^1 | 1.667 ms | 0.53% | 1.423 ms | 0.75% | -244.123 us | -14.65% | FAST |
I32 | I16 | I32 | 2^28 | 2^4 | 1.285 ms | 0.33% | 975.869 us | 0.61% | -308.814 us | -24.04% | FAST |
I32 | I16 | I32 | 2^28 | 2^8 | 1.178 ms | 0.32% | 902.246 us | 0.56% | -275.914 us | -23.42% | FAST |
I32 | I32 | I32 | 2^28 | 2^1 | 1.436 ms | 0.86% | 1.445 ms | 0.93% | 8.957 us | 0.62% | SAME |
I32 | I32 | I32 | 2^28 | 2^4 | 965.283 us | 0.63% | 966.743 us | 0.64% | 1.460 us | 0.15% | SAME |
I32 | I32 | I32 | 2^28 | 2^8 | 887.400 us | 0.46% | 911.902 us | 0.40% | 24.502 us | 2.76% | SLOW |
I32 | I64 | I32 | 2^28 | 2^1 | 2.147 ms | 0.64% | 2.150 ms | 0.67% | 3.266 us | 0.15% | SAME |
I32 | I64 | I32 | 2^28 | 2^4 | 1.453 ms | 0.43% | 1.460 ms | 0.42% | 6.317 us | 0.43% | SLOW |
I32 | I64 | I32 | 2^28 | 2^8 | 1.294 ms | 0.27% | 1.303 ms | 0.25% | 8.674 us | 0.67% | SLOW |
I32 | I128 | I32 | 2^28 | 2^1 | 5.310 ms | 0.20% | 5.322 ms | 0.20% | 11.899 us | 0.22% | SLOW |
I32 | I128 | I32 | 2^28 | 2^4 | 4.277 ms | 0.28% | 4.293 ms | 0.28% | 15.860 us | 0.37% | SLOW |
I32 | I128 | I32 | 2^28 | 2^8 | 4.080 ms | 0.33% | 4.096 ms | 0.35% | 15.467 us | 0.38% | SLOW |
I32 | F32 | I32 | 2^28 | 2^1 | 1.628 ms | 0.78% | 1.625 ms | 0.78% | -3.238 us | -0.20% | SAME |
I32 | F32 | I32 | 2^28 | 2^4 | 1.239 ms | 0.53% | 1.230 ms | 0.56% | -8.616 us | -0.70% | FAST |
I32 | F32 | I32 | 2^28 | 2^8 | 1.135 ms | 0.89% | 1.118 ms | 0.62% | -16.887 us | -1.49% | FAST |
I32 | F64 | I32 | 2^28 | 2^1 | 2.246 ms | 0.57% | 2.247 ms | 0.58% | 0.285 us | 0.01% | SAME |
I32 | F64 | I32 | 2^28 | 2^4 | 1.570 ms | 0.34% | 1.573 ms | 0.32% | 3.117 us | 0.20% | SAME |
I32 | F64 | I32 | 2^28 | 2^8 | 1.418 ms | 0.18% | 1.422 ms | 0.18% | 3.940 us | 0.28% | SLOW |
I32 | C64 | I32 | 2^28 | 2^1 | 5.787 ms | 0.19% | 5.586 ms | 0.20% | -201.635 us | -3.48% | FAST |
I32 | C64 | I32 | 2^28 | 2^4 | 5.119 ms | 0.10% | 4.926 ms | 0.10% | -192.607 us | -3.76% | FAST |
I32 | C64 | I32 | 2^28 | 2^8 | 4.917 ms | 0.10% | 4.724 ms | 0.10% | -193.088 us | -3.93% | FAST |
I64 | I8 | I32 | 2^28 | 2^1 | 2.092 ms | 0.58% | 2.059 ms | 0.62% | -32.968 us | -1.58% | FAST |
I64 | I8 | I32 | 2^28 | 2^4 | 1.534 ms | 0.35% | 1.520 ms | 0.37% | -13.774 us | -0.90% | FAST |
I64 | I8 | I32 | 2^28 | 2^8 | 1.404 ms | 0.26% | 1.387 ms | 0.28% | -16.891 us | -1.20% | FAST |
I64 | I16 | I32 | 2^28 | 2^1 | 2.022 ms | 0.66% | 2.017 ms | 0.68% | -5.418 us | -0.27% | SAME |
I64 | I16 | I32 | 2^28 | 2^4 | 1.422 ms | 0.38% | 1.423 ms | 0.40% | 0.243 us | 0.02% | SAME |
I64 | I16 | I32 | 2^28 | 2^8 | 1.299 ms | 0.22% | 1.293 ms | 0.23% | -5.245 us | -0.40% | FAST |
I64 | I32 | I32 | 2^28 | 2^1 | 2.166 ms | 0.68% | 2.166 ms | 0.69% | 0.117 us | 0.01% | SAME |
I64 | I32 | I32 | 2^28 | 2^4 | 1.406 ms | 0.46% | 1.407 ms | 0.47% | 0.556 us | 0.04% | SAME |
I64 | I32 | I32 | 2^28 | 2^8 | 1.235 ms | 0.19% | 1.236 ms | 0.19% | 0.649 us | 0.05% | SAME |
I64 | I64 | I32 | 2^28 | 2^1 | 2.670 ms | 0.62% | 2.688 ms | 0.63% | 18.454 us | 0.69% | SLOW |
I64 | I64 | I32 | 2^28 | 2^4 | 1.751 ms | 0.42% | 1.763 ms | 0.49% | 12.425 us | 0.71% | SLOW |
I64 | I64 | I32 | 2^28 | 2^8 | 1.555 ms | 0.29% | 1.570 ms | 0.56% | 15.361 us | 0.99% | SLOW |
I64 | I128 | I32 | 2^28 | 2^1 | 6.180 ms | 0.16% | 6.196 ms | 0.16% | 16.385 us | 0.27% | SLOW |
I64 | I128 | I32 | 2^28 | 2^4 | 4.997 ms | 0.26% | 5.018 ms | 0.25% | 21.122 us | 0.42% | SLOW |
I64 | I128 | I32 | 2^28 | 2^8 | 4.747 ms | 0.32% | 4.772 ms | 0.32% | 25.223 us | 0.53% | SLOW |
I64 | F32 | I32 | 2^28 | 2^1 | 2.169 ms | 0.92% | 2.171 ms | 0.91% | 1.402 us | 0.06% | SAME |
I64 | F32 | I32 | 2^28 | 2^4 | 1.407 ms | 0.48% | 1.408 ms | 0.48% | 0.848 us | 0.06% | SAME |
I64 | F32 | I32 | 2^28 | 2^8 | 1.235 ms | 0.19% | 1.238 ms | 0.18% | 3.321 us | 0.27% | SLOW |
I64 | F64 | I32 | 2^28 | 2^1 | 2.681 ms | 0.65% | 2.700 ms | 0.66% | 18.499 us | 0.69% | SLOW |
I64 | F64 | I32 | 2^28 | 2^4 | 1.778 ms | 0.39% | 1.763 ms | 0.51% | -15.560 us | -0.88% | FAST |
I64 | F64 | I32 | 2^28 | 2^8 | 1.577 ms | 0.28% | 1.564 ms | 0.56% | -13.492 us | -0.86% | FAST |
I64 | C64 | I32 | 2^28 | 2^1 | 5.736 ms | 0.20% | 5.633 ms | 0.22% | -103.435 us | -1.80% | FAST |
I64 | C64 | I32 | 2^28 | 2^4 | 4.897 ms | 0.12% | 4.777 ms | 0.13% | -120.014 us | -2.45% | FAST |
I64 | C64 | I32 | 2^28 | 2^8 | 4.579 ms | 0.12% | 4.445 ms | 0.12% | -133.984 us | -2.93% | FAST |
I128 | I8 | I32 | 2^28 | 2^1 | 3.566 ms | 0.59% | 3.560 ms | 0.55% | -5.921 us | -0.17% | SAME |
I128 | I8 | I32 | 2^28 | 2^4 | 2.428 ms | 0.78% | 2.445 ms | 0.78% | 16.930 us | 0.70% | SAME |
I128 | I8 | I32 | 2^28 | 2^8 | 2.194 ms | 1.10% | 2.195 ms | 1.15% | 0.625 us | 0.03% | SAME |
I128 | I16 | I32 | 2^28 | 2^1 | 3.649 ms | 0.78% | 3.641 ms | 0.74% | -7.779 us | -0.21% | SAME |
I128 | I16 | I32 | 2^28 | 2^4 | 2.362 ms | 0.88% | 2.361 ms | 0.86% | -0.855 us | -0.04% | SAME |
I128 | I16 | I32 | 2^28 | 2^8 | 2.134 ms | 1.08% | 2.120 ms | 1.15% | -14.144 us | -0.66% | SAME |
I128 | I32 | I32 | 2^28 | 2^1 | 3.539 ms | 0.64% | 3.537 ms | 0.71% | -1.413 us | -0.04% | SAME |
I128 | I32 | I32 | 2^28 | 2^4 | 2.379 ms | 0.91% | 2.381 ms | 0.86% | 1.736 us | 0.07% | SAME |
I128 | I32 | I32 | 2^28 | 2^8 | 2.149 ms | 1.20% | 2.151 ms | 1.22% | 2.119 us | 0.10% | SAME |
I128 | I64 | I32 | 2^28 | 2^1 | 4.331 ms | 0.65% | 4.325 ms | 0.64% | -6.174 us | -0.14% | SAME |
I128 | I64 | I32 | 2^28 | 2^4 | 3.016 ms | 0.67% | 3.016 ms | 0.71% | -0.017 us | -0.00% | SAME |
I128 | I64 | I32 | 2^28 | 2^8 | 2.733 ms | 1.06% | 2.731 ms | 1.03% | -1.767 us | -0.06% | SAME |
I128 | I128 | I32 | 2^28 | 2^1 | 6.869 ms | 0.35% | 6.895 ms | 0.38% | 26.568 us | 0.39% | SLOW |
I128 | I128 | I32 | 2^28 | 2^4 | 5.371 ms | 0.32% | 5.399 ms | 0.31% | 27.652 us | 0.51% | SLOW |
I128 | I128 | I32 | 2^28 | 2^8 | 5.069 ms | 0.39% | 5.088 ms | 0.39% | 18.667 us | 0.37% | SAME |
I128 | F32 | I32 | 2^28 | 2^1 | 3.543 ms | 0.64% | 3.544 ms | 0.68% | 1.352 us | 0.04% | SAME |
I128 | F32 | I32 | 2^28 | 2^4 | 2.381 ms | 0.87% | 2.385 ms | 0.84% | 3.690 us | 0.15% | SAME |
I128 | F32 | I32 | 2^28 | 2^8 | 2.157 ms | 1.23% | 2.160 ms | 1.20% | 2.648 us | 0.12% | SAME |
I128 | F64 | I32 | 2^28 | 2^1 | 4.089 ms | 0.73% | 4.097 ms | 0.71% | 8.141 us | 0.20% | SAME |
I128 | F64 | I32 | 2^28 | 2^4 | 2.718 ms | 1.00% | 2.730 ms | 1.06% | 11.515 us | 0.42% | SAME |
I128 | F64 | I32 | 2^28 | 2^8 | 2.407 ms | 1.50% | 2.415 ms | 1.42% | 7.752 us | 0.32% | SAME |
I128 | C64 | I32 | 2^28 | 2^1 | 12.900 ms | 0.17% | 12.900 ms | 0.17% | 0.389 us | 0.00% | SAME |
I128 | C64 | I32 | 2^28 | 2^4 | 11.663 ms | 0.24% | 11.663 ms | 0.24% | 0.153 us | 0.00% | SAME |
I128 | C64 | I32 | 2^28 | 2^8 | 11.350 ms | 0.28% | 11.350 ms | 0.28% | -0.591 us | -0.01% | SAME |
Thx for reporting the benchmarks. Looks good except for Reduce Max on I8, I32, 2^28. A 14% slowdown is unfortunately below @gevtushenko's rule of "no regressions of more than 2% compared to previous implementation on 2^24+ problem sizes". Could you please investigate the cause of this regression? We should try to fix this.
14% slowdown is too large. Let me see if I can fix it
@bernhardmgruber (and @gevtushenko) All routines that show regressions here have been "artificially" improved by the following problem. Non-standard binary operators were recognized as operators that can be optimized as binary tree reduction. In fact, the code can't optimize these operators because it has no knowledge of their structures. In summary, these regressions cannot be avoided for user-provided binary operators
Fix nvbug: 4965585
The following routines showed performance regressions after the PR 2756:
The PR includes the following changes:
plus
operator andint/unsigned
data typesint64_t/uint64_t
use a binary-level reduction instead of a ternary reduction