Open daniellowell opened 4 years ago
If remaining compile-time parameters do not reduce significantly the number of compiles, then it may be the case the the kernel should be converted to assembly code.
Because assembling is much faster than OCL/HIP compilation?
Basically speaking, if source code is .s
assembly file, only need do assembly phase. If source code is HIP/OCL, need go front-end-> IR -> back-end, compile time should be much longer. So basically convert the static
kernel to dynamic
kernel can already save a great lot of time, weather it is in HIP or OCL dynamic.
The decision to choose [1] ASM-dynamic or [2] HIP/OCL-dynamic I think should based on following factors:
So from my humble experience, we can have following preference:
@carlushuang Thanks for explanations. Just in case: AFAICS the assembly builds are ~100 times faster than HIP builds and ~15 times faster than OCL builds (you can try auto-tuning and see how many kernels fit into 3 second logging intervals). Therefore even linear transformation from HIP/OCL to ASM (without adding any "dynamism") would yield substantial acceleration. Of course, extending the coverage of a kernel (making it more "dynamic" than before) is the preferred way because it also saves space in the binary cache.
AFAICS the assembly builds are ~100 times faster than HIP builds and ~15 times faster than OCL builds (you can try auto-tuning and see how many kernels fit into 3 second logging intervals).
May I know the average compilation time for HIP/OCL/ASM kernel? I saw HIP takes 4s or more and ASM takes 100ms-200ms. Given much of kernel time is below 100ms, if each config only run once dynamic kernel still beats static ones. @daniellowell how do you plan to support mask rcnn/retinanet? Can we run these two with a env var such as MIOPEN_FIND_MODE=Fast to pick up only dynamic kernel?
@sabreshao The initial push for this is to support mask-rcnn and retinanet type networks.
Purpose
This project exists to minimize our reliance on compile time parameterization in MIOpen's source kernels. The goal isn't to sacrifice performance, but rather determine a ways of reducing the compile time overhead of the first time iteration of neural networks using MIOpen.
Strategy
For some of these kernels the task is pretty straight forward; take the compile time parameters and move them into runtime parameters. In some cases this can be done without affecting performance. It may be the case, and often, that all compile time parameters may not be moved into the runtime without seriously affecting performance. In those cases we should identify those parameters that networks would change least frequently such that compiles are minimized. If remaining compile-time parameters do not reduce significantly the number of compiles, then it may be the case the the kernel should be converted to assembly code.
Priority Tasks
Data collection
Structural
Convolution Changes
Priority: HIGH
[ ] Convolutions
[ ] iGEMM HIP source refactor (@asroy) 7/24/2020
[x] ASM-iGEMM (@carlushuang , @shaojiewang, @jane-zxy, @fronteer)
[ ] Make ASM iGEMM kernels only in hybrid MIOPEN_FIND_MODE (ISSUE: #299) (@zjing14 ) ROCm 3.8
[ ] Data collection for solver usage via Tuna (@ce1adon)
Non-Convolution Changes
Priority: HIGH
Priority: MEDIUM
[ ] copyTensor / castTensor / setTensor / scaleTensor ROCm 3.8
[ ] subSample / upSample (@alexandraBara) ROCm 3.8
[x] TensorOps (@ce1adon) ROCm 3.8
[ ] Activations (@cderb) ROCm 3.8
[ ] transpose_NCHW2CNHW / transpose_CNHW2NCHW ROCm 3.8
[x] RNN / RNN Update (@ce1adon) ROCm 3.8
[ ] Pooling