Closed atamazov closed 7 months ago
(10.30.30) [Future] Tunable and able to auto-tune quickly as per https://github.com/ROCmSoftwarePlatform/MIOpen/issues/378#issuecomment-682203795.
AFAIK from our data the tolerance for even 1 compile is pretty low. The kernel should already be loadable and our heuristics / model should just be a selector.
Keep in mind, what we do with Find, is "hopefully" just a bridge to a better API implementation. There are still a lot of good features in immediate mode, so we still need a plan to have this all work with immediate mode. Additionally, Find 2.0 is still on the books to be implemented.
AFAIK from our data the tolerance for even 1 compile is pretty low. The kernel should already be loadable and our heuristics / model should just be a selector.
We already discussed that (on the meeting) and IIRC agreed that it is not physically possible to pre-compile all the dynamic kernels that comply the minimal requirements (dynamic N, H, W, see (10.10). From the other hand, spending a minute for compiling some dozen of kernels per a network (that runs hours) should be okay.
Similarly, spending ~5 sec for tuning a dynamic kernel (that would work in a network for a long time) looks like a wise investment. Anyway, the design is flexible enough to enable/disable this stuff later, after some experiments.
Similarly, spending ~5 sec for tuning a dynamic kernel (that would work in a network for a long time) looks like a wise investment.
If by this you mean that this happens online, then I completely disagree. End users will not know how to do this and MIOpen will never know the difference between ResNet50 versus Mask-RCNN.
Each developer must create GetWti(ConvolutionContext&)
so that give any input we can calculate their relative score on that input? I have my doubts...but let's see.
Personally, I still would rather have a data driven classifier.
We need GetWti()
for Dynamic kernels only. Also we do not need WTI for any input right away. I would advise developers to begin with something that is easy to calculate (and return wti_unknown
(-2.0) for the rest). Later we can extend an applicable domain of GetWti()
.
I'm concerned, or rather confused, that without a WTI based on inputs it will be functionally useless.
Me too -- that is why I am asking kernel developers to help.
This is a blocker for ROCm 4.0
Proof of concept for Winograd kernels are looking good. @atamazov to look into integrating this feature with the broader library
Status update:
GetWti
for WinogradRxS2x3 Solver (this is done in the getwti-wino2111
branch).GetWti
for GEMM (done in the getwti-wino2111
branch).
getwti-wino2111
branch).@aserio Staus updated at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/410#issuecomment-683985700
@atamazov What is left in this task list?
Basically nothing. All the necessary redesign (required for GetWti) is done and the minimal implementation is ready. So this specific ticket can be closed, I think.
I believe we can further improve performance by adding GetWti()
implementations to some other dynamic solvers. To figure this out, I would recommend:
GetWti()
in them.
According to analysis at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/398#issuecomment-680970315, we can use existing Fast Find mode. However, Immediate Mode Fallback path needs to be improved for this purpose:
group_count == 1
)bool IsDynamic() const
that should returntrue
.float GetWti(ConvolutionContext&)
that should compute and return approximated value of the expected WTI or -2.0 when this value can't be computed. Tips: