AmusementClub / vs-mlrt

Efficient CPU/GPU ML Runtimes for VapourSynth (with built-in support for waifu2x, DPIR, RealESRGANv2/v3, Real-CUGAN, RIFE, SCUNet and more!)
GNU General Public License v3.0
294 stars 20 forks source link

fp16 i/o #26

Closed Ironclad17 closed 1 year ago

Ironclad17 commented 1 year ago

I just wanted to make sure no one else had this issue, but the example code in release v13 caused me some problems. First, the input for RIFE is src while the previous lines output is padded. Second, the output for RIFE is flt but src_width & src_height are equal to src.width & src.height which is fine for RIFE probably, but when using ESRGAN or other upscalers leads to a cropped image. You want them equal to flt.width & flt.height I think. I don't really understand why the resize lines have 3 inputs. For ESRGAN would this provide similar savings on memory bandwidth? tw = (src.width + 31) // 32 * 32 th = (src.height + 31) // 32 * 32 padded = core.resize.Point(src, format=vs.RGBH, matrix_in=1, src_width=tw, src_height=th) flt = RealESRGAN(padded, model=RealESRGANModel.animevideov3, backend=Backend.TRT(fp16=True, device_id=0, num_streams=2)) res = core.resize.Point(flt, format=vs.YUV420P8, matrix=1, src_width=flt.width, src_height=flt.height)

AkarinVS commented 1 year ago

That example was meant for RIFE. For super-resolution filters, you need to use these:

th = (src.height + 31) // 32 * 32  # adjust 32 and 31 to match specific AI network input resolution requirements.
tw = (src.width  + 31) // 32 * 32  # same.
padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th)
flt = vsmlrt.RIFE(src, model=RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output
oh = src.height * (flt.height // th)  # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers.
ow = src.width  * (flt.width  // tw)
res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)

Release note updated.

I believe FP16 io is only useful if your application satisfies one or (ideally) more of these conditions:

  1. network is not very computational intensive (e.g. RIFE is one prime example, see https://github.com/AmusementClub/vs-mlrt/wiki/NVIDIA-GeForce-RTX-4090 for some concrete performance numbers.)
  2. your GPU is very high-end, so computation (time) and communication (time) ratio is even less.
  3. the data transfer is significant. this could be due to the way the network is designed (again, RIFE is a prime example, which has 11 input planes for every single inference! usual filters like waifu2x and esrgan only has 3) or due to very high resolution (generally 4k or above).
  4. your application framework is not able to issue concurrent requests (usual VS won't be affected by this. this mainly affects mpv, which is fundamentally a sequential processing pipeline w.r.t. VS, so there is almost no concurrent requests and thus no way to hide the PCIe transfer delays. e.g. Profiling showed that mpv can barely keep GPU 50% loaded, that's including fp32 data transfer time! If it were able to issue 2 or 3 concurrent requests (num_streams), the computation should be able to almost fully hide the PCIe transfer time.