ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.07k stars 224 forks source link

Initial Interation Time: Convolutions / Immediate Mode: Fallback improvements using GetWti() #410

Closed atamazov closed 7 months ago

atamazov commented 4 years ago

According to analysis at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/398#issuecomment-680970315, we can use existing Fast Find mode. However, Immediate Mode Fallback path needs to be improved for this purpose:

atamazov commented 4 years ago

Status

daniellowell commented 4 years ago

(10.30.30) [Future] Tunable and able to auto-tune quickly as per https://github.com/ROCmSoftwarePlatform/MIOpen/issues/378#issuecomment-682203795.

AFAIK from our data the tolerance for even 1 compile is pretty low. The kernel should already be loadable and our heuristics / model should just be a selector.

daniellowell commented 4 years ago

Keep in mind, what we do with Find, is "hopefully" just a bridge to a better API implementation. There are still a lot of good features in immediate mode, so we still need a plan to have this all work with immediate mode. Additionally, Find 2.0 is still on the books to be implemented.

atamazov commented 4 years ago

AFAIK from our data the tolerance for even 1 compile is pretty low. The kernel should already be loadable and our heuristics / model should just be a selector.

We already discussed that (on the meeting) and IIRC agreed that it is not physically possible to pre-compile all the dynamic kernels that comply the minimal requirements (dynamic N, H, W, see (10.10). From the other hand, spending a minute for compiling some dozen of kernels per a network (that runs hours) should be okay.

Similarly, spending ~5 sec for tuning a dynamic kernel (that would work in a network for a long time) looks like a wise investment. Anyway, the design is flexible enough to enable/disable this stuff later, after some experiments.

daniellowell commented 4 years ago

Similarly, spending ~5 sec for tuning a dynamic kernel (that would work in a network for a long time) looks like a wise investment.

If by this you mean that this happens online, then I completely disagree. End users will not know how to do this and MIOpen will never know the difference between ResNet50 versus Mask-RCNN.

daniellowell commented 4 years ago

Each developer must create GetWti(ConvolutionContext&) so that give any input we can calculate their relative score on that input? I have my doubts...but let's see. Personally, I still would rather have a data driven classifier.

atamazov commented 4 years ago

We need GetWti() for Dynamic kernels only. Also we do not need WTI for any input right away. I would advise developers to begin with something that is easy to calculate (and return wti_unknown (-2.0) for the rest). Later we can extend an applicable domain of GetWti().

daniellowell commented 4 years ago

I'm concerned, or rather confused, that without a WTI based on inputs it will be functionally useless.

atamazov commented 4 years ago

Me too -- that is why I am asking kernel developers to help.

aserio commented 4 years ago

This is a blocker for ROCm 4.0

aserio commented 4 years ago

Proof of concept for Winograd kernels are looking good. @atamazov to look into integrating this feature with the broader library

atamazov commented 4 years ago

424 should be merged in first.

atamazov commented 3 years ago

Status update:

atamazov commented 3 years ago

@aserio Staus updated at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/410#issuecomment-683985700

daniellowell commented 3 years ago

@atamazov What is left in this task list?

atamazov commented 3 years ago

Basically nothing. All the necessary redesign (required for GetWti) is done and the minimal implementation is ready. So this specific ticket can be closed, I think.

I believe we can further improve performance by adding GetWti() implementations to some other dynamic solvers. To figure this out, I would recommend: