ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.09k stars 230 forks source link

TensorOps kernels refactoring #3346

Open novakovicdj opened 1 month ago

novakovicdj commented 1 month ago

This is draft PR for refactoring tensor ops kernels to solver structure, so far only Op1dTensorGeneric kernel is switched

shurale-nkn commented 3 weeks ago

Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.

novakovicdj commented 2 weeks ago

Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.

Here is a comparison of average host time between old and new structure Kernel New structure [ms] Old structure [ms] diff [ms]
Op1dTensorGeneric first run 279.3786 291.3806 -12.002
other 100 runs 0.2908 0.2549 0.0359
Op2dTensorGeneric first run 281.8186 283.4622 -1.6436
other 100 runs 0.356 0.2432 0.1128
Op2dTensorLite first run 634.2228 662.2278 -28.005
other 100 runs 0.335 0.2308 0.1042
Op2dTensorSquash first run 668.978 699.9932 -31.0152
other 100 runs 0.3481 0.2548 0.0933
Op3dTensorGeneric first run 642.1512 656.3394 -14.1882
other 100 runs 0.2659 0.2485 0.0174
OpTensorFwdBias first run 636.6204 654.8222 -18.2018
other 100 runs 0.3351 0.2321 0.103
OpTensorFwdBiasGeneric first run 636.4756 662.4915 -26.0159
other 100 runs 0.3498 0.2434 0.1064
OpTensorLeadingOnes first run 644.8348 666.8713 -22.0365
other 100 runs 0.3466 0.2755 0.0711
OpTensorLeadingOnesGeneric first run 648.6535 669.6379 -20.9844
other 100 runs 0.3552 0.2569 0.0983
Op4dTensorLite first run 641.4747 664.4976 -23.0229
other 100 runs 0.33 0.2206 0.1094
Op4dTensorGeneric first run 650.7638 670.8961 -20.1323
other 100 runs 0.3563 0.2456 0.1107
Op5dTensorGeneric first run 655.6774 685.431 -29.7536
other 100 runs 0.3745 0.2437 0.1308

New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call

shurale-nkn commented 2 weeks ago

Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.

Here is a comparison of average host time between old and new structure

Kernel New structure [ms] Old structure [ms] diff [ms] Op1dTensorGeneric first run 279.3786 291.3806 -12.002 other 100 runs 0.2908 0.2549 0.0359 Op2dTensorGeneric first run 281.8186 283.4622 -1.6436 other 100 runs 0.356 0.2432 0.1128 Op2dTensorLite first run 634.2228 662.2278 -28.005 other 100 runs 0.335 0.2308 0.1042 Op2dTensorSquash first run 668.978 699.9932 -31.0152 other 100 runs 0.3481 0.2548 0.0933 Op3dTensorGeneric first run 642.1512 656.3394 -14.1882 other 100 runs 0.2659 0.2485 0.0174 OpTensorFwdBias first run 636.6204 654.8222 -18.2018 other 100 runs 0.3351 0.2321 0.103 OpTensorFwdBiasGeneric first run 636.4756 662.4915 -26.0159 other 100 runs 0.3498 0.2434 0.1064 OpTensorLeadingOnes first run 644.8348 666.8713 -22.0365 other 100 runs 0.3466 0.2755 0.0711 OpTensorLeadingOnesGeneric first run 648.6535 669.6379 -20.9844 other 100 runs 0.3552 0.2569 0.0983 Op4dTensorLite first run 641.4747 664.4976 -23.0229 other 100 runs 0.33 0.2206 0.1094 Op4dTensorGeneric first run 650.7638 670.8961 -20.1323 other 100 runs 0.3563 0.2456 0.1107 Op5dTensorGeneric first run 655.6774 685.431 -29.7536 other 100 runs 0.3745 0.2437 0.1308 New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call

The results are very strange; we need to obtain the experiment protocol. How was the program executed, and what was used for measurement? so far, according to the table, each subsequent launch is on average 30% slower

novakovicdj commented 2 weeks ago

Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.

Here is a comparison of average host time between old and new structure Kernel New structure [ms] Old structure [ms] diff [ms] Op1dTensorGeneric first run 279.3786 291.3806 -12.002 other 100 runs 0.2908 0.2549 0.0359 Op2dTensorGeneric first run 281.8186 283.4622 -1.6436 other 100 runs 0.356 0.2432 0.1128 Op2dTensorLite first run 634.2228 662.2278 -28.005 other 100 runs 0.335 0.2308 0.1042 Op2dTensorSquash first run 668.978 699.9932 -31.0152 other 100 runs 0.3481 0.2548 0.0933 Op3dTensorGeneric first run 642.1512 656.3394 -14.1882 other 100 runs 0.2659 0.2485 0.0174 OpTensorFwdBias first run 636.6204 654.8222 -18.2018 other 100 runs 0.3351 0.2321 0.103 OpTensorFwdBiasGeneric first run 636.4756 662.4915 -26.0159 other 100 runs 0.3498 0.2434 0.1064 OpTensorLeadingOnes first run 644.8348 666.8713 -22.0365 other 100 runs 0.3466 0.2755 0.0711 OpTensorLeadingOnesGeneric first run 648.6535 669.6379 -20.9844 other 100 runs 0.3552 0.2569 0.0983 Op4dTensorLite first run 641.4747 664.4976 -23.0229 other 100 runs 0.33 0.2206 0.1094 Op4dTensorGeneric first run 650.7638 670.8961 -20.1323 other 100 runs 0.3563 0.2456 0.1107 Op5dTensorGeneric first run 655.6774 685.431 -29.7536 other 100 runs 0.3745 0.2437 0.1308 New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call

The results are very strange; we need to obtain the experiment protocol. How was the program executed, and what was used for measurement? so far, according to the table, each subsequent launch is on average 30% slower

I have talked with @CAHEK7 and he suggested to remove compilation time from first run time and to enable kernel profiling so here are new results that I have got note. diff for first run is (new_total_time - new_comp_time) - (old_total_time - old_comp_time)

Kernel run New structure [ms] (total) New structure [ms] (compilation) Old structure [ms] (total) Old structure [ms] (compilation) diff [ms]
Op1dTensorGeneric First 279.0176 278.813 289.2538 289.0891 0.0399
  Other 100 2.3191   2.5519   -0.2328
Op2dTensorGeneric First 282.0158 281.8159 285.6689 285.4783 0.0093
  Other 100 2.318   2.3533   -0.0353
Op2dTensorLite First 645.8057 645.6093 662.8484 662.6658 0.0138
  Other 100 2.3686   2.275   0.0936
Op2dTensorSquash First 680.7002 680.5031 709.6122 709.4446 0.0295
  Other 100 2.5681   2.2783   0.2898
Op3dTensorGeneric First 641.8974 641.701 695.5571 695.3875 0.0268
  Other 100 2.471   2.1212   0.3498
OpTensorFwdBias First 640.256 640.062 680.8951 680.709 0.0079
  Other 100 2.3217   2.4506   -0.1289
OpTensorFwdBiasGeneric First 671.2975 671.0999 686.9962 686.8118 0.0132
  Other 100 2.5107   2.3875   0.1232
OpTensorLeadingOnes First 676.697 676.5015 660.4297 660.2519 0.0177
  Other 100 2.5548   2.4725   0.0823
OpTensorLeadingOnesGen First 656.037 655.834 677.3558 677.18 0.0272
  Other 100 2.7894   2.4013   0.3881
Op4dTensorLite First 662.9597 662.7599 667.7071 667.5309 0.0236
  Other 100 2.5528   2.2425   0.3103
Op4dTensorGeneric First 659.8318 659.6297 673.9867 673.8067 0.0221
  Other 100 2.3756   2.3735   0.0021
Op5dTensorGeneric First 654.421 654.2273 693.2979 693.1147 0.0105
  Other 100 2.2762   2.187   0.0892

This is part of code that I changed in tensor_ops test for this performance testing

` handle.EnableProfiling(true);

    auto start = std::chrono::steady_clock::now();
    miopen::OpTensor2(handle,
                      // miopenTensorOpAdd,
                      // miopenTensorOpMax,
                      // miopenTensorOpMin,
                      miopenTensorOpMul,
                      &alpha0,
                      a.desc,
                      a_dev.get(),
                      &alpha1,
                      b.desc,
                      b_dev.get(),
                      &beta,
                      c.desc,
                      c_dev.get(),
                      Aoffset,
                      Boffset,
                      Coffset,
                      false); // it does not verify non-standard behaviour
    auto end = std::chrono::steady_clock::now();
    auto elapsed_init =
        std::chrono::duration_cast<std::chrono::duration<float, std::milli>>(end - start)
            .count() -
        handle.GetKernelTime();

    printf("First time: %f ms\n", elapsed_init);

    float elapsed_other_full = 0;

    for(int i = 0; i < 100; i++)
    {
        start = std::chrono::steady_clock::now();
        miopen::OpTensor2(handle,
                          // miopenTensorOpAdd,
                          // miopenTensorOpMax,
                          // miopenTensorOpMin,
                          miopenTensorOpMul,
                          &alpha0,
                          a.desc,
                          a_dev.get(),
                          &alpha1,
                          b.desc,
                          b_dev.get(),
                          &beta,
                          c.desc,
                          c_dev.get(),
                          Aoffset,
                          Boffset,
                          Coffset,
                          false); // it does not verify non-standard behaviour
        end = std::chrono::steady_clock::now();
        auto elapsed_other =
            std::chrono::duration_cast<std::chrono::duration<float, std::milli>>(end - start)
                .count() -
            handle.GetKernelTime();
        elapsed_other_full += elapsed_other;
    }

    printf("Other times total: %f ms\nOther times average: %f ms\n",
           elapsed_other_full,
           elapsed_other_full / 100);`

I was switching between OpTensor and OpTensor2 to run tests for old and new structure, also only test case for one kernel was running each time. I will run more tests and try to figure out why host times for some kernels are slower than for the others.

novakovicdj commented 1 week ago

I did some profiling to compare old and new structure and saw that creation of network_config is slower than before, this is more visible for bigger dimension tensors and it is consequence of the new format of network_config. Network_config creation for 5d tensors is around 4 times slower than in the old structure and around 3 times slower compared to 1d tensor network_config in the new structure. Because of all of that I switched to using string and got speed up of around 2.2 times compared to using stream.

After that I run 500 iterations of old and new structure for all tensor kernels and got the result that the new version is faster for 0.0005ms on average, which is around 20% faster than old structure.

BrianHarrisonAMD commented 1 week ago

I am not seeing a lot of testing coverage for OpTensor before the changes. The only tests I can see that specifically run OpTensor seem to be these, and it looks like they aren't covering all ops.

Would it be possible to add new tests to the gtest suite to ensure correctness for the new solvers being added?

novakovicdj commented 1 week ago

I am not seeing a lot of testing coverage for OpTensor before the changes. The only tests I can see that specifically run OpTensor seem to be these, and it looks like they aren't covering all ops.

Would it be possible to add new tests to the gtest suite to ensure correctness for the new solvers being added?

Current test for tensorOp is covering all solvers except for Op2dTensorSquash but I did some changes and tested it locally and it worked fine. There is a plan to switch this test to gtest and then those improvements of testing tensorOps will be implemented.

As a part of this PR I will add some unit tests for Problem Descriptor, so please do not merge this yet