atamazov commented 4 years ago

According to analysis at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/398#issuecomment-680970315, we can use existing Fast Find mode. However, Immediate Mode Fallback path needs to be improved for this purpose:

(10) Beef up Immediate Mode Fallback with Solvers that provide Dynamic kernels ("Dynamic Solvers"). The minimum requirements for these Solvers [originated from #398]:
- (10.10) Batch size and input picture size (N, W, H) must NOT be compiled into the kernel(s) that consist the Solution. These must go into the kernel as a parameters at run time instead.
- (10.20) Minimum applicable domain:
- (10.20.10) FP32 data type
- (10.20.20) Non-group (group_count == 1)
- (10.20.30) 2D convolutions (all directions)
- (10.30) The Solvers should be
- (10.30.10) Non-tunable, or
- (10.30.20) Tunable but capable to provide efficient default PerformanceConfig when there is perf-db miss, or
- (10.30.30) [Future] Tunable and able to auto-tune quickly as per https://github.com/ROCmSoftwarePlatform/MIOpen/issues/378#issuecomment-682203795.
(20) Dynamic Solvers should provide the following public methods:
- (20.10) bool IsDynamic() const that should return true.
- (20.20) float GetWti(ConvolutionContext&) that should compute and return approximated value of the expected WTI or -2.0 when this value can't be computed. Tips:
- (20.20.10) Value 1.0 corresponds to the 100% utilization of HW capabilities as if Direct computational algorithm is used.
- (20.20.20) [Notice] WTI may exceed 1.0 hor highly optimized algorithms like Winograd.
- (20.20.30) It is Ok to return -2.0 for non-FP32 types.
- (20.20.40) We also need similar function for GEMM, see (30.10) below.
(30) The Immediate Mode fallback in the core of the library will use the best Solver from the set consisting of GEMM plus all Dynamic Solvers. The criteria is the largest WTI and smallest workspace size.
- (30.10) Note that these Solvers include GEMM. Right now we do not have GEMM Solver, but it is quite possible to properly integrate GEMM into Implicit Mode Fallback provided that a function that is capable to calculate approximate WTI (in the applicable domain, see 10.20) will be provided.

atamazov commented 4 years ago

Status

[x] Extend interface of the Dynamic Solver (20) #424
[x] Implement simple GetWti for 3x3 Winograd Fwd/Bwd (20.20) #424
[x] Update the core of the Immediate Mode Fallback, GetWti for GEMM is not available (30) #424
[x] Calculate approximate WTI for GEMM (30.10) #640
[x] Better implementation of GetWti for RxS Winograd, Fwd/Bwd/WrW (20.20) #640
[x] Update the core of the Immediate Mode Fallback, use GetWti GEMM for (30) #640

daniellowell commented 4 years ago

(10.30.30) [Future] Tunable and able to auto-tune quickly as per https://github.com/ROCmSoftwarePlatform/MIOpen/issues/378#issuecomment-682203795.

AFAIK from our data the tolerance for even 1 compile is pretty low. The kernel should already be loadable and our heuristics / model should just be a selector.

daniellowell commented 4 years ago

Keep in mind, what we do with Find, is "hopefully" just a bridge to a better API implementation. There are still a lot of good features in immediate mode, so we still need a plan to have this all work with immediate mode. Additionally, Find 2.0 is still on the books to be implemented.

atamazov commented 4 years ago

AFAIK from our data the tolerance for even 1 compile is pretty low. The kernel should already be loadable and our heuristics / model should just be a selector.

We already discussed that (on the meeting) and IIRC agreed that it is not physically possible to pre-compile all the dynamic kernels that comply the minimal requirements (dynamic N, H, W, see (10.10). From the other hand, spending a minute for compiling some dozen of kernels per a network (that runs hours) should be okay.

Similarly, spending ~5 sec for tuning a dynamic kernel (that would work in a network for a long time) looks like a wise investment. Anyway, the design is flexible enough to enable/disable this stuff later, after some experiments.

daniellowell commented 4 years ago

Similarly, spending ~5 sec for tuning a dynamic kernel (that would work in a network for a long time) looks like a wise investment.

If by this you mean that this happens online, then I completely disagree. End users will not know how to do this and MIOpen will never know the difference between ResNet50 versus Mask-RCNN.

daniellowell commented 4 years ago

Each developer must create GetWti(ConvolutionContext&) so that give any input we can calculate their relative score on that input? I have my doubts...but let's see. Personally, I still would rather have a data driven classifier.

atamazov commented 4 years ago

We need GetWti() for Dynamic kernels only. Also we do not need WTI for any input right away. I would advise developers to begin with something that is easy to calculate (and return wti_unknown (-2.0) for the rest). Later we can extend an applicable domain of GetWti().

daniellowell commented 4 years ago

I'm concerned, or rather confused, that without a WTI based on inputs it will be functionally useless.

atamazov commented 4 years ago

Me too -- that is why I am asking kernel developers to help.

aserio commented 4 years ago

This is a blocker for ROCm 4.0

aserio commented 4 years ago

Proof of concept for Winograd kernels are looking good. @atamazov to look into integrating this feature with the broader library

atamazov commented 4 years ago

424 should be merged in first.

atamazov commented 3 years ago

Status update:

424 is ready for merge (pending for CI).
Then, I am going to finish GetWti for WinogradRxS2x3 Solver (this is done in the getwti-wino2111 branch).
Add basic GetWti for GEMM (done in the getwti-wino2111 branch).
- ~Solvers/Invokers for GEMM is a blocker for this (@DrizztDoUrden is working on it).~
Update the core of the library to handle GetWti (done in the getwti-wino2111 branch).

atamazov commented 3 years ago

@aserio Staus updated at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/410#issuecomment-683985700

daniellowell commented 3 years ago

@atamazov What is left in this task list?

atamazov commented 3 years ago

Basically nothing. All the necessary redesign (required for GetWti) is done and the minimal implementation is ready. So this specific ticket can be closed, I think.

I believe we can further improve performance by adding GetWti() implementations to some other dynamic solvers. To figure this out, I would recommend:

testing performance with important networks,
identifying weak points/areas,
deciding which dynamic solvers can be used to fill the performance gaps and,
implement approximate GetWti() in them.

ROCm / MIOpen

Initial Interation Time: Convolutions / Immediate Mode: Fallback improvements using GetWti() #410

Status

424 should be merged in first.

424 is ready for merge (pending for CI).