Initial Iteration Time: Overall Plan

JehandadKhan commented 4 years ago

This issue describes the initial ideas for the overall plan and opens it for discussion. While issue #337 explains some of the details, however, it is more focused on assigning dynamic kernels to developers and lacks the overall picture.

Primary Challenge

The primary issue facing MIOpen right now is that the cost of compiling and searching for the optimum algorithm far outweighs the benefits in performance we get by doing such an exhaustive search. While such an approach does make some sense for classical deep networks where the same back bone network is compiled once and then launched many times. However, if the parameters in the network are data dependent many kernels might be compiled while the realized performance advantage is very little in comparison to the amount of time spent on compiling and benchmarking kernels.

The primary thesis presented here is that for a convolution config that MIOpen does not know we should not compile any kernel at all and must be served by a precompiled, parameter-less kernel ( such as Winograd, Dynamic iGEMM or a rocBlas call ). This ensures that no time is spent on compiling a kernel and the GPU may be kept busy all the time.

It may be argued that an assembly kernel has a much shorter assembly/finalizing time and therefore, it may be acceptable to add asm kernels to the mix above. However, even an asm kernel takes more than 100ms to finalize, even which may be too long a time to benefit from the improved performance. I am planning an experiment to prove/disprove this theory.

Even though the above issues have been brought to the forefront by Mask-RCNN and retina-net, MIOpen faces the same issues on any other network which may not be tracked by us, therefore, it is imperative that we develop a solution that is general purpose and not a quick fix.

Proposal for Convolution Operations

The above goal may be achieved by relying on the find-db and the binary cache.

Ensuring that all winning solvers in the find-db are present in the binary cahche package. This would ensure that we maintain performance parity for classic network and special workloads that are critical.
All dynamic kernels ( in their various incarnations) must be part must be part of the binary cache. This is to ensure that when there is a find-db miss the associated binary cache is available and ready to be launched.
Col2Im and Im2Col must be fully dynamic and placed in the binary cache, to ensure launching gemm solution without any overhead. @asroy
A (yet another) find mode which operates as follows (@atamazov):
- Check in find-db, if there is a hit, load the kernel from the binary cache and serve the convolution.
- If there is a miss in find-db check if a dynamic kernel ( Winograd, Dynamic iGEMM) can serve the kernel, if so load them ( again from binary cache) and serve the convolution.
- If the dynamic kernels are unable to provide the convolution solution, fall back to GEMM.
  - Please note, the above decision tree ensures that no kernels are compiled for launching a convolution.
- Do not write to the find-db, since such solution are not perf optimal which violates the definition of the find-db
- Provide accurate Workspace size for the above cases, No over estimation of workspace size
- Only the original find mode (aka MIOPEN_FIND_MODE=1) should be allowed to write to the find-db, since everything else is a heuristic and should not be serialized.
- Implement hooks in the find-mode to override based on heuristic functions. Similar in function to the isWinogradApplicableAndFast but formalized to be extendable with any logic that dependens on the ProblemDescription. This is important for extensibility and future proofing our find-machinery. This goal is lower priority as compared to the other items in the list above but must be kept in mind while designing code paths.

What is the difference of Dynamic convolution kernel vs. "non-dynamic"?

When converting a static kernel (with compile time args) to a dynamic one, it must be ensured that batchsize, image_height and image_width are always kept dynamic, other parameters may be made static to improve performance. However, it must be kept in mind that for such kernels all the permutations of all the static arguments would need to be hosted in the binary cache resulting in increased disk usage for the library.

Proposal for Non-Convolution Operations

While convolution operations form the bulk of expensive operations, non-convolution operations also contribute considerable amount of delay, especially if kernels follow the same design such that any change in the input parameters require a re-compile. Based on the usage of non-Conv operations in the frameworks Batchnorm is the first priority for conversion to a dynamic kernel since most other ops are implemented internally by higher level frameworks.

For batch-norm we would require three kernels for the following cases ( based on memory access patterns and parameter variations)

for batchsize = 1
for batchsize = 2
for everything else

These kernels should not have any other compile time specified parameters so we can stash them in the binary cache.

The ASM version of Batch Norm is expected in a few weeks and should hopefully mitigate some of the compilation issues, however, as discussed earlier even the assembly time might not be acceptable in the over all runtime.

It is also possible that we add Batch Norm to the find-db so that we can implement similar mechanisms for non-conv ops as we have for convolution. However, this may be kept as a long term goal due to the amount of work and additional discussion/thought required.

Following Batchnorm, tensor ops would be converted to dynamic kernels ( most of them are already converted as mentioned in issue #337 ) and finally followed by pooling.

Task List:

In addition to the task assignment in dynamic kernels #<> the following tasks have been identified and assigned so far:

[x] Identify dynamic solvers as fall back set | @JehandadKhan | Aug-28-2020|
[x] Implemented Find Mode described above | @atamazov, @DrizztDoUrden | #417 Sep-11-2020 |
[ ] Implement API based override for the Find Mode for programmatic access | @JehandadKhan, @atamazov | #489 ready for review since Oct-6-2020 |
[x] ~~Dynamic Im2Col2d / Col2Im2D~~
[ ] Dynamic batchnorm | @daniellowell, @muralinr | <Please fill in the expected completion date |
[ ] Identify and add more Tasks | @JehandadKhan | ongoing |

Guidelines for Reviewers/Comments

Please refrain from referencing comments on other threads, which makes it difficult to follow the conversation. I will try to update the description with discussion below.

@atamazov @daniellowell

asroy commented 4 years ago

Col2Im and Im2Col must be fully dynamic and placed in the binary cache, to ensure launching gemm solution without any overhead. @asroy

We can implement Col2Im and Im2Col using dynamic composable kernels, once it is implemented. Converting the OCL kernel will be throw-away work.

Non-convolution operations can also be written using dynamic composable kernels.

We can have dynamic composable kernels for following operations in ROCm 3.10 time frame: iGEMM, Im2Col, Col2Im. Pooling is like a simplified version of convolution, so it's also straightforward to rewritten using dynamic composable kernel, but since it's already fully dynamic, so we don't need to do it for ROCm 3.10.

Batch-norm is performance critical, so it may need some optimization work, and can be rewritten using composable kernel after 3.10.

atamazov commented 4 years ago

Thank you Jehandad, for very detailed and informative analysis. Some feedback to "Proposal for Convolution Operations":

Ensuring that all winning solvers in the find-db are present in the binary cahche package...

[Informative] Exactly. And IIRC this is already one of our current goals for Embedded. It is very good that we reuse existing machinery for the new features.

All dynamic kernels ( in their various incarnations) must be part must be part of the binary cache...

Yes. Unfortunately, that would require some modification of the mechanisms that collect pre-compiled binary cache. We have winners, and we have to collect also all dynamic kernels. In case they are not winners, which is quite possible.

However, I do not see substantial harm if some dynamic kernel is missing. It would be automatically compiled and put into binary cache only once (per MIOpen user), see below.

A (yet another) find mode which operates as follows:

This is very like the existing Fast Find mode, that re-uses Immmediate mode machinery. We only need to improve the Immediate mode fallback, which was planned long ago. The desired functionality will be ensured automatically (except that if will be capable to build dynamic kernels missing from the pre-compiled binary cache).

atamazov commented 4 years ago

When converting a static kernel (with compile time args) to a dynamic one, it must be ensured that batchsize, image_height and image_width are always kept dynamic, other parameters may be made static to improve performance. However, it must be kept in mind that for such kernels all the permutations of all the static arguments would need to be hosted in the binary cache resulting in increased disk usage for the library.

This is a bit questionable. It is so that currently known networks (Mask-RCNN, retina-net...) vary only N, H and W? Even if this is so right now, the future networks may vary more parameters.

JehandadKhan commented 4 years ago

This is a bit questionable. It is so that currently known networks (Mask-RCNN, retina-net...) vary only N, H and W? Even if this is so right now, the future networks may vary more parameters.

Your assessment is correct, the comment was only meant to highlight the fact that N, H and W have a higher priority when converting parameters from static to dynamic.

t-vi commented 4 years ago

Thank you for working on this!

I wonder if for BatchNorm a relatively simple HIP kernel might work (i.e. one that use shuffling for the reduction). A long time ago I improved and benchmarked the PyTorch BN kernel and it seemed to have good performance for FP32 on CUDA.

Another architecture that suffers tremendously from this problem is FastSpeech (e.g. https://github.com/ming024/FastSpeech2/), a TTS system.

In my experience, I would tend to think that the input dimensions are the most important thing to get working dynamically, kernel size, strides etc. much less so.

atamazov commented 4 years ago

On behalf of @daniellowell

Phased implementation for convolutions

Since the heuristic implementation is a bit more experimental, and we have a couple of different approaches, I propose we implement this in two phases.

First get the PR merge that implements the reduced set of solvers so that we have the Dynamic Hybrid Find. In case of find-db miss, it would benchmark the remaining solvers, or implement a crude selection process (rather the former than the latter).
- [x] Implemented in PR #417
Then we can have two different (competing) PR to implement the fallback heuristic.
- [x] The first being the WTI approach, ticket #410, base PR #424
- [ ] The other being our MLP classifier, ticket #428

hgaspar commented 4 years ago

Regarding the question about how dynamic the number of channels is, the answer is definitely less than the rest (batch size, width, height): The network specifies the number of channels per layer, and while there is variability (128,256, etc,etc) , it is known beforehand, and most importantly, dataset independent. That is not the case for the remaining params, which are dataset and context dependent, which is the main point! For example, in the RPN part of masked RCNN, you see the batch size depending on the number of proposal regions (and the rest of the dimensions fixed before hand for a given run, but mutable via configuration parameters),

ROCm / MIOpen