TianZerL / Anime4KCPP

A high performance anime upscaler
MIT License
1.76k stars 138 forks source link

Better AMD GPUs support through ROCm/HIP #115

Open GZGavinZhao opened 6 months ago

GZGavinZhao commented 6 months ago
GZGavinZhao commented 6 months ago

Thanks for the review! I'll address them once I fix the performance issue. I have some very bad benchmark results here:

> ./build/bin/Anime4KCPP_CLI -B
Benchmark test under 8-bit integer input and serial processing...

CPU score:
 DVD(480P->960P): 71.4286 FPS
 HD(720P->1440P): 41.0959 FPS
 FHD(1080P->2160P): 18.2927 FPS

OpenCL score: (pID = 0, dID = 0)
 DVD(480P->960P): 1000 FPS
 HD(720P->1440P): 333.333 FPS
 FHD(1080P->2160P): 166.667 FPS

CUDA score: (dID = 0)
 DVD(480P->960P): 62.5 FPS
 HD(720P->1440P): 24.3902 FPS
 FHD(1080P->2160P): 11.1111 FPS

This benchmark is ran on AMD Radeon RX Vega 64 (gfx900). A similar benchmark result is also reproduced on AMD Radeon RX6600M (gfx1032). The build flag I used is cmake -GNinja -B build -S . -DCMAKE_BUILD_TYPE=Release -DEnable_HIP=ON -DEnable_OpenCL=ON -DMaximum_Optimization=ON. ROCm version is 5.5.1.

There's no way that ROCm runs this much slower than OpenCL. I'll continue to investigate this issue. The HIP code is an automatic translation from CUDA to HIP using the hipify-perl tool, so I don't know if that could be an issue.

GZGavinZhao commented 6 months ago

Fortunately I think the benchmark results are misleading. I did a real world test by up-scaling a 1080P 4-minute episode of One Room Season 3 Episode 1. Flag used is -q -C avc1 -t 16 -T 16 -x -X -M <cuda|opencl>. Total processing time with the OpenCL backend took 4.39018 minutes, and the ROCm backend took 3.26867 minutes.

I profiled the benchmark and saw that the majority of the time is spent on hipStreamCreate and hipStreamSynchronize. I think what happened is that for a single image, ROCm performed badly because of the overhead of streams (does this also appear with CUDA vs OpenCL backend?), but when it's video processing streams becomes a benefit perhaps due to better parallelization.

TianZerL commented 6 months ago

The creation and destruction of streams on CUDA should be low cost. I am using the dynamic "steam" on CUDA, which will create and destory "stream" in each processing and make the code simpler. Maybe it is better to use a static "stream" in ROCM.

There is actually some "warm up" before benchmarking, which make the result of CUDA normal.