ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.07k stars 224 forks source link

ConvHipImplicitGemmForwardV4R4Xdlops applicability with output channel number #2284

Closed junliume closed 1 year ago

junliume commented 1 year ago

[Problem Observations] On gfx90a nodes:

$$$:/opt/rocm# MIOPEN_FIND_MODE=1 /opt/rocm/bin/MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 336 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 336 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.024693 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 4, 232, 20, 20, 1, 1, 336,  249446400, 898304, 1075200, 10102, 80, 0.024693
Forward Convolution FAILED: 0.257789 > 0.082

$$$:/opt/rocm# MIOPEN_FIND_MODE=1 /opt/rocm/bin/MIOpenDriver convfp16 -n 4 -c 88 -H 20 -W 20 -k 336 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver convfp16 -n 4 -c 88 -H 20 -W 20 -k 336 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.013244 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 4, 88, 20, 20, 1, 1, 336,  94617600, 340736, 1075200, 7144, 107, 0.013244
Forward Convolution FAILED: 0.183936 > 0.082

$$$:/opt/rocm# MIOPEN_FIND_MODE=1 /opt/rocm/bin/MIOpenDriver convfp16 -n 1 -c 88 -H 24 -W 24 -k 336 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver convfp16 -n 1 -c 88 -H 24 -W 24 -k 336 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.010862 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 1, 88, 24, 24, 1, 1, 336,  34062336, 160512, 387072, 3136, 50, 0.010862
Forward Convolution FAILED: 0.39445 > 0.082

Something they share in common is the strange output channel number: -k 336.

[Experiments] k=128

$$$:/opt/rocm# /opt/rocm/bin/MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 128 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64
MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 128 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64
Forward Conv solutions available: 2
- id: 88 algo: 0, time: 10 ms, ws: 0, name: GemmFwd1x1_0_1
- id: 84 algo: 3, time: 20 ms, ws: 0, name: ConvBinWinogradRxSf2x3g1
Warning: Solution id (64) is not reported by the library. Trying it anyway...
MIOpen Forward Conv. Algorithm: -1, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.030702 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 4, 232, 20, 20, 1, 1, 128,  95027200, 801792, 409600, 3095, 39, 0.030702
Forward Convolution Verifies OK on GPU reference (0.000249472)

k=64

$$$:/opt/rocm# /opt/rocm/bin/MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 64 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64
MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 64 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64
Forward Conv solutions available: 2
- id: 88 algo: 0, time: 10 ms, ws: 0, name: GemmFwd1x1_0_1
- id: 84 algo: 3, time: 20 ms, ws: 0, name: ConvBinWinogradRxSf2x3g1
Warning: Solution id (64) is not reported by the library. Trying it anyway...
MIOpen Forward Conv. Algorithm: -1, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.023467 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 4, 232, 20, 20, 1, 1, 64,  47513600, 772096, 204800, 2025, 42, 0.023467
Forward Convolution Verifies OK on GPU reference (0.000271501)

k=32

$$$:/opt/rocm# /opt/rocm/bin/MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64
MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64
Forward Conv solutions available: 2
- id: 88 algo: 0, time: 10 ms, ws: 0, name: GemmFwd1x1_0_1
- id: 84 algo: 3, time: 20 ms, ws: 0, name: ConvBinWinogradRxSf2x3g1
Warning: Solution id (64) is not reported by the library. Trying it anyway...
MIOpen Forward Conv. Algorithm: -1, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.022631 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 4, 232, 20, 20, 1, 1, 32,  23756800, 757248, 102400, 1050, 38, 0.022631
Forward Convolution Verifies OK on GPU reference (0.00029627)

k=16

$$$:/opt/rocm# /opt/rocm/bin/MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 16 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64
MIOpenDriver convfp16 -n 4 -c 232 -H 20 -W 20 -k 16 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64
Forward Conv solutions available: 2
- id: 88 algo: 0, time: 10 ms, ws: 0, name: GemmFwd1x1_0_1
- id: 84 algo: 3, time: 20 ms, ws: 0, name: ConvBinWinogradRxSf2x3g1
Warning: Solution id (64) is not reported by the library. Trying it anyway...
MIOpen Forward Conv. Algorithm: -1, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.019520 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 4, 232, 20, 20, 1, 1, 16,  11878400, 749824, 51200, 609, 41, 0.019520
Forward Convolution FAILED: 0.261389 > 0.0082

Unless k is very small (i.e. 16 in the above case), usually the solver will pass for k == 2^n format.

@zjing14 @asroy @atamazov @JehandadKhan : should we explicitly exam ConvHipImplicitGemmForwardV4R4Xdlops applicability as regard to what format of k is required?

CC: @averinevg @DrizztDoUrden

junliume commented 1 year ago

Proposal is something similar to the following?

    const auto k = ProblemInterpreter::GetOutputChannelK(problem);
    if(k % GetEPackLength(ctx, problem, false) != 0)
        return false;
atamazov commented 1 year ago

@junliume @zjing14 @JehandadKhan A couple of ideas:

Unfortunately, I don't have MI200/MI100 on hand and thus unable to check these hypotheses myself.

atamazov commented 1 year ago

@junliume

should we explicitly exam ConvHipImplicitGemmForwardV4R4Xdlops applicability as regard to what format of k is required?

Of course, the bugs in the solver are also possible, and this is in fact the first hypothesis that comes to mind. The best assignee for this work is an engineer who is fully aware of the kernel design. But the problem is that this is a difficult and time-consuming work.

/cc @zjing14 @asroy @JehandadKhan

junliume commented 1 year ago

@atamazov some corrections: the numerical verifications work for k%32 ==0 for observations, and the root cause might be related to the basic tile sizes and shapes used in CK utilizing xdlops. @zjing14 is proposing a patch soon.

atamazov commented 1 year ago

@junliume Thanks, I see this K % 32 == 0 thing in the topmost comment. If the solver developers confirm that IsApplicable() must be fixed for FP16 (and possibly for BF16), then we are fine. If not, then I would recommend experiments listed at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2284#issuecomment-1659214998 (these are not expected to be time-consuming).

atamazov commented 1 year ago

@zjing14 🚀 Thanks for #2297! Do you have time to continue investigations? Or it would be better to assign some other engineer?

@junliume @JehandadKhan I recommend lowering urgency (maybe to https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_normal).