Dynamic Kernels Assignments

daniellowell commented 4 years ago

Purpose

This project exists to minimize our reliance on compile time parameterization in MIOpen's source kernels. The goal isn't to sacrifice performance, but rather determine a ways of reducing the compile time overhead of the first time iteration of neural networks using MIOpen.

Strategy

For some of these kernels the task is pretty straight forward; take the compile time parameters and move them into runtime parameters. In some cases this can be done without affecting performance. It may be the case, and often, that all compile time parameters may not be moved into the runtime without seriously affecting performance. In those cases we should identify those parameters that networks would change least frequently such that compiles are minimized. If remaining compile-time parameters do not reduce significantly the number of compiles, then it may be the case the the kernel should be converted to assembly code.

Priority Tasks

Data collection

Compile time impact for various networks
Performance impact of compile time modifications
Data collection on solver usage

Structural

[ ] Precompiled kernels for non-convolution kernels (@JehandadKhan) ROCm 3.8
[ ] Tuna prepared for bin-cache generated kernels (@JehandadKhan, @alexandraBara, @cderb )
[ ] Expand the methodology of ASM-iGEMM conversion to non-iGEMM kernels and non-convolutional kernels (@carlushuang, @shaojiewang , @jane-zxy , @fronteer )

Convolution Changes

Priority: HIGH

[ ] Convolutions
- [ ] iGEMM HIP source refactor (@asroy) 7/24/2020
- [x] ASM-iGEMM (@carlushuang , @shaojiewang, @jane-zxy, @fronteer)
  - [x] ASM-iGEMM dynamic non-xDLOPs (ROCm 3.8)
  - [ ] ASM-iGEMM static xDLOPs fp32 (ROCm 3.10)
  - [ ] ASM-iGEMM static xDLOPs fp16/bfp16 (ROCm 4.0)
- [ ] Make ASM iGEMM kernels only in hybrid MIOPEN_FIND_MODE (ISSUE: #299) (@zjing14 ) ROCm 3.8
- [ ] Data collection for solver usage via Tuna (@ce1adon)
  - Dependent-tasks:
    - [ ] OpenCL legacy kernels --> reduce applicability range of solvers (@ce1adon , @JehandadKhan )
    - [ ] OpenCL legacy kernels --> delete unused solvers (@ce1adon , @JehandadKhan )
    - [ ] OpenCL legacy kernels --> convert those most used to ASM (@shaojiewang , @carlushuang , @jane-zxy , @fronteer)
    - [ ] OpenCL legacy kernels --> improve performance of iGEMM, etc to cover legacy kernels (@shaojiewang , @carlushuang , @jane-zxy , @fronteer)
[x] im2col2d (@TejashShah) ROCm 3.8

Non-Convolution Changes

Priority: HIGH

[ ] Batch Normalization ROCm 3.8 - 3.9
- [ ] Fwd-spatial training (@muralinr)
  - [ ] variant 0 ROCm 3.9
  - [ ] variant 1 ROCm 3.8
  - [ ] variant 2 (Only make minor modifications)
  - [ ] variant 3 ROCm 3.8
- [ ] Bwd-spatial training (@muralinr)
  - [ ] variant 0 ROCm 3.9
  - [ ] variant 1 ROCm 3.8
  - [ ] variant 2 (Only make minor modifications)
  - [ ] variant 3 ROCm 3.8
- [x] Spatial Inference: (@daniellowell ) ROCm 3.8

Priority: MEDIUM

[ ] copyTensor / castTensor / setTensor / scaleTensor ROCm 3.8
- MIOpenSubTensorOpWithSubTensorKernel.cl
[ ] subSample / upSample (@alexandraBara) ROCm 3.8
- MIOpenUtilKernels3.cl
[x] TensorOps (@ce1adon) ROCm 3.8
[ ] Activations (@cderb) ROCm 3.8
- MIOpenNeuron.cl
[ ] transpose_NCHW2CNHW / transpose_CNHW2NCHW ROCm 3.8
- MIOpenUtilKernels4.cl
[x] RNN / RNN Update (@ce1adon) ROCm 3.8
[ ] Pooling
- Convert those most used to more dynamic ASM or composable kernel techniques (???)

atamazov commented 4 years ago

If remaining compile-time parameters do not reduce significantly the number of compiles, then it may be the case the the kernel should be converted to assembly code.

Because assembling is much faster than OCL/HIP compilation?

carlushuang commented 4 years ago

Basically speaking, if source code is .s assembly file, only need do assembly phase. If source code is HIP/OCL, need go front-end-> IR -> back-end, compile time should be much longer. So basically convert the static kernel to dynamic kernel can already save a great lot of time, weather it is in HIP or OCL dynamic.

The decision to choose [1] ASM-dynamic or [2] HIP/OCL-dynamic I think should based on following factors:

If [2] performance is OK (within 10% drop), then keep [2]
If performance of [2] is too low, like compiler generate a lot of scratch buffer, should choose [1]
If compiling [2] is about >3 times slower than [1], should choose [1]

So from my humble experience, we can have following preference:

memory bound kernel(BN, pooling, dropout, utility), [2] > [1]
compute bound kernel(conv), [1] > [2]

atamazov commented 4 years ago

@carlushuang Thanks for explanations. Just in case: AFAICS the assembly builds are ~100 times faster than HIP builds and ~15 times faster than OCL builds (you can try auto-tuning and see how many kernels fit into 3 second logging intervals). Therefore even linear transformation from HIP/OCL to ASM (without adding any "dynamism") would yield substantial acceleration. Of course, extending the coverage of a kernel (making it more "dynamic" than before) is the preferred way because it also saves space in the binary cache.

sabreshao commented 4 years ago

AFAICS the assembly builds are ~100 times faster than HIP builds and ~15 times faster than OCL builds (you can try auto-tuning and see how many kernels fit into 3 second logging intervals).

May I know the average compilation time for HIP/OCL/ASM kernel? I saw HIP takes 4s or more and ASM takes 100ms-200ms. Given much of kernel time is below 100ms, if each config only run once dynamic kernel still beats static ones. @daniellowell how do you plan to support mask rcnn/retinanet? Can we run these two with a env var such as MIOPEN_FIND_MODE=Fast to pick up only dynamic kernel?

daniellowell commented 4 years ago

@sabreshao The initial push for this is to support mask-rcnn and retinanet type networks.

ROCm / MIOpen