Open novakovicdj opened 1 month ago
Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.
Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.
Here is a comparison of average host time between old and new structure | Kernel | New structure [ms] | Old structure [ms] | diff [ms] | |
---|---|---|---|---|---|
Op1dTensorGeneric | first run | 279.3786 | 291.3806 | -12.002 | |
other 100 runs | 0.2908 | 0.2549 | 0.0359 | ||
Op2dTensorGeneric | first run | 281.8186 | 283.4622 | -1.6436 | |
other 100 runs | 0.356 | 0.2432 | 0.1128 | ||
Op2dTensorLite | first run | 634.2228 | 662.2278 | -28.005 | |
other 100 runs | 0.335 | 0.2308 | 0.1042 | ||
Op2dTensorSquash | first run | 668.978 | 699.9932 | -31.0152 | |
other 100 runs | 0.3481 | 0.2548 | 0.0933 | ||
Op3dTensorGeneric | first run | 642.1512 | 656.3394 | -14.1882 | |
other 100 runs | 0.2659 | 0.2485 | 0.0174 | ||
OpTensorFwdBias | first run | 636.6204 | 654.8222 | -18.2018 | |
other 100 runs | 0.3351 | 0.2321 | 0.103 | ||
OpTensorFwdBiasGeneric | first run | 636.4756 | 662.4915 | -26.0159 | |
other 100 runs | 0.3498 | 0.2434 | 0.1064 | ||
OpTensorLeadingOnes | first run | 644.8348 | 666.8713 | -22.0365 | |
other 100 runs | 0.3466 | 0.2755 | 0.0711 | ||
OpTensorLeadingOnesGeneric | first run | 648.6535 | 669.6379 | -20.9844 | |
other 100 runs | 0.3552 | 0.2569 | 0.0983 | ||
Op4dTensorLite | first run | 641.4747 | 664.4976 | -23.0229 | |
other 100 runs | 0.33 | 0.2206 | 0.1094 | ||
Op4dTensorGeneric | first run | 650.7638 | 670.8961 | -20.1323 | |
other 100 runs | 0.3563 | 0.2456 | 0.1107 | ||
Op5dTensorGeneric | first run | 655.6774 | 685.431 | -29.7536 | |
other 100 runs | 0.3745 | 0.2437 | 0.1308 |
New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call
Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.
Here is a comparison of average host time between old and new structure
Kernel New structure [ms] Old structure [ms] diff [ms] Op1dTensorGeneric first run 279.3786 291.3806 -12.002 other 100 runs 0.2908 0.2549 0.0359 Op2dTensorGeneric first run 281.8186 283.4622 -1.6436 other 100 runs 0.356 0.2432 0.1128 Op2dTensorLite first run 634.2228 662.2278 -28.005 other 100 runs 0.335 0.2308 0.1042 Op2dTensorSquash first run 668.978 699.9932 -31.0152 other 100 runs 0.3481 0.2548 0.0933 Op3dTensorGeneric first run 642.1512 656.3394 -14.1882 other 100 runs 0.2659 0.2485 0.0174 OpTensorFwdBias first run 636.6204 654.8222 -18.2018 other 100 runs 0.3351 0.2321 0.103 OpTensorFwdBiasGeneric first run 636.4756 662.4915 -26.0159 other 100 runs 0.3498 0.2434 0.1064 OpTensorLeadingOnes first run 644.8348 666.8713 -22.0365 other 100 runs 0.3466 0.2755 0.0711 OpTensorLeadingOnesGeneric first run 648.6535 669.6379 -20.9844 other 100 runs 0.3552 0.2569 0.0983 Op4dTensorLite first run 641.4747 664.4976 -23.0229 other 100 runs 0.33 0.2206 0.1094 Op4dTensorGeneric first run 650.7638 670.8961 -20.1323 other 100 runs 0.3563 0.2456 0.1107 Op5dTensorGeneric first run 655.6774 685.431 -29.7536 other 100 runs 0.3745 0.2437 0.1308 New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call
The results are very strange; we need to obtain the experiment protocol. How was the program executed, and what was used for measurement? so far, according to the table, each subsequent launch is on average 30% slower
Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.
Here is a comparison of average host time between old and new structure Kernel New structure [ms] Old structure [ms] diff [ms] Op1dTensorGeneric first run 279.3786 291.3806 -12.002 other 100 runs 0.2908 0.2549 0.0359 Op2dTensorGeneric first run 281.8186 283.4622 -1.6436 other 100 runs 0.356 0.2432 0.1128 Op2dTensorLite first run 634.2228 662.2278 -28.005 other 100 runs 0.335 0.2308 0.1042 Op2dTensorSquash first run 668.978 699.9932 -31.0152 other 100 runs 0.3481 0.2548 0.0933 Op3dTensorGeneric first run 642.1512 656.3394 -14.1882 other 100 runs 0.2659 0.2485 0.0174 OpTensorFwdBias first run 636.6204 654.8222 -18.2018 other 100 runs 0.3351 0.2321 0.103 OpTensorFwdBiasGeneric first run 636.4756 662.4915 -26.0159 other 100 runs 0.3498 0.2434 0.1064 OpTensorLeadingOnes first run 644.8348 666.8713 -22.0365 other 100 runs 0.3466 0.2755 0.0711 OpTensorLeadingOnesGeneric first run 648.6535 669.6379 -20.9844 other 100 runs 0.3552 0.2569 0.0983 Op4dTensorLite first run 641.4747 664.4976 -23.0229 other 100 runs 0.33 0.2206 0.1094 Op4dTensorGeneric first run 650.7638 670.8961 -20.1323 other 100 runs 0.3563 0.2456 0.1107 Op5dTensorGeneric first run 655.6774 685.431 -29.7536 other 100 runs 0.3745 0.2437 0.1308 New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call
The results are very strange; we need to obtain the experiment protocol. How was the program executed, and what was used for measurement? so far, according to the table, each subsequent launch is on average 30% slower
I have talked with @CAHEK7 and he suggested to remove compilation time from first run time and to enable kernel profiling so here are new results that I have got note. diff for first run is (new_total_time - new_comp_time) - (old_total_time - old_comp_time)
Kernel | run | New structure [ms] (total) | New structure [ms] (compilation) | Old structure [ms] (total) | Old structure [ms] (compilation) | diff [ms] |
---|---|---|---|---|---|---|
Op1dTensorGeneric | First | 279.0176 | 278.813 | 289.2538 | 289.0891 | 0.0399 |
Other 100 | 2.3191 | 2.5519 | -0.2328 | |||
Op2dTensorGeneric | First | 282.0158 | 281.8159 | 285.6689 | 285.4783 | 0.0093 |
Other 100 | 2.318 | 2.3533 | -0.0353 | |||
Op2dTensorLite | First | 645.8057 | 645.6093 | 662.8484 | 662.6658 | 0.0138 |
Other 100 | 2.3686 | 2.275 | 0.0936 | |||
Op2dTensorSquash | First | 680.7002 | 680.5031 | 709.6122 | 709.4446 | 0.0295 |
Other 100 | 2.5681 | 2.2783 | 0.2898 | |||
Op3dTensorGeneric | First | 641.8974 | 641.701 | 695.5571 | 695.3875 | 0.0268 |
Other 100 | 2.471 | 2.1212 | 0.3498 | |||
OpTensorFwdBias | First | 640.256 | 640.062 | 680.8951 | 680.709 | 0.0079 |
Other 100 | 2.3217 | 2.4506 | -0.1289 | |||
OpTensorFwdBiasGeneric | First | 671.2975 | 671.0999 | 686.9962 | 686.8118 | 0.0132 |
Other 100 | 2.5107 | 2.3875 | 0.1232 | |||
OpTensorLeadingOnes | First | 676.697 | 676.5015 | 660.4297 | 660.2519 | 0.0177 |
Other 100 | 2.5548 | 2.4725 | 0.0823 | |||
OpTensorLeadingOnesGen | First | 656.037 | 655.834 | 677.3558 | 677.18 | 0.0272 |
Other 100 | 2.7894 | 2.4013 | 0.3881 | |||
Op4dTensorLite | First | 662.9597 | 662.7599 | 667.7071 | 667.5309 | 0.0236 |
Other 100 | 2.5528 | 2.2425 | 0.3103 | |||
Op4dTensorGeneric | First | 659.8318 | 659.6297 | 673.9867 | 673.8067 | 0.0221 |
Other 100 | 2.3756 | 2.3735 | 0.0021 | |||
Op5dTensorGeneric | First | 654.421 | 654.2273 | 693.2979 | 693.1147 | 0.0105 |
Other 100 | 2.2762 | 2.187 | 0.0892 |
This is draft PR for refactoring tensor ops kernels to solver structure, so far only Op1dTensorGeneric kernel is switched