TRT optimization pass will fail in UHD at high scales and Ensemble=True

Samhayne commented 1 year ago

As we know, RIFE doesn't like patterns. I'm using high scales, even in UHD, to fight these distorted / disordered patterns that you would get in some interpolation frames.

But I can't get the trt optimization pass to run successfully with scale=4 and ensemble=True in UHD.

Will succeed:

RIFE(clip, model='4.4', scale=4, num_streams=1, sc=True, sc_threshold=0.12, ensemble=False, trt=True)

Will fail:

RIFE(clip, model='4.4', scale=4, num_streams=1, sc=True, sc_threshold=0.12, ensemble=True, trt=True)

I can only guess that the GPU memory is running out? On my old Nvidia 1070 (8GB) trt optimization would already fail in UHD with scale=1. Now with my Nvidia 4090 (24GB) trt optimization fails with scale=4 and ensemble=True. (for all RIFE models)

The script...

import os, sys
import vapoursynth as vs
core = vs.core

sys.path.append(r"P:\_Apps\StaxRip\StaxRip 2.29.0-x64 (VapourSynth)\Apps\Plugins\VS\Scripts")
core.std.LoadPlugin(r"P:\_Apps\StaxRip\StaxRip 2.29.0-x64 (VapourSynth)\Apps\Plugins\Dual\L-SMASH-Works\LSMASHSource.dll", altsearchpath=True)
clip = core.lsmas.LibavSMASHSource(r"T:\Test\tst.mp4")
from vsrife import RIFE
import torch
os.environ["CUDA_MODULE_LOADING"] = "LAZY"
clip = vs.core.resize.Bicubic(clip, format=vs.RGBH, matrix_in_s="709")

# RIFE calls with different workspace sizes that didn't help
#clip = RIFE(clip, model='4.4', scale=4, num_streams=1, sc=True, sc_threshold=0.12, ensemble=True, trt=True, trt_max_workspace_size=536870912)
#clip = RIFE(clip, model='4.4', scale=4, num_streams=1, sc=True, sc_threshold=0.12, ensemble=True, trt=True, trt_max_workspace_size=4294967296)

clip = RIFE(clip, model='4.4', scale=4, num_streams=1, sc=True, sc_threshold=0.12, ensemble=True, trt=True)
clip = vs.core.resize.Bicubic(clip, format=vs.YUV420P8, matrix_s="709")
clip.set_output()

...will produce lots of these warnings while processing...

...
[10/01/2023-11:53:31] [TRT] [E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[10/01/2023-11:53:31] [TRT] [E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[10/01/2023-11:53:31] [TRT] [W] Requested amount of GPU memory (8589934592 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
...

...and eventually end with the error output:


Python exception: 

Traceback (most recent call last):
  File "src\cython\vapoursynth.pyx", line 2866, in vapoursynth._vpy_evaluate
  File "src\cython\vapoursynth.pyx", line 2867, in vapoursynth._vpy_evaluate
  File "T:\Test\tst.vpy", line 12, in <module>
    clip = RIFE(clip, model='4.4', scale=4, num_streams=1, sc=True, sc_threshold=0.12, ensemble=True, trt=True)
  File "D:\Python\Python310\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\Python\Python310\lib\site-packages\vsrife\__init__.py", line 219, in RIFE
    flownet = lowerer(
  File "D:\Python\Python310\lib\site-packages\torch_tensorrt\fx\lower.py", line 316, in __call__
    return do_lower(module, inputs)
  File "D:\Python\Python310\lib\site-packages\torch_tensorrt\fx\passes\pass_utils.py", line 118, in pass_with_validation
    processed_module = pass_(module, input, *args, **kwargs)
  File "D:\Python\Python310\lib\site-packages\torch_tensorrt\fx\lower.py", line 313, in do_lower
    lower_result = pm(module)
  File "D:\Python\Python310\lib\site-packages\torch\fx\passes\pass_manager.py", line 238, in __call__
    out = _pass(out)
  File "D:\Python\Python310\lib\site-packages\torch\fx\passes\pass_manager.py", line 238, in __call__
    out = _pass(out)
  File "D:\Python\Python310\lib\site-packages\torch_tensorrt\fx\passes\lower_pass_manager_builder.py", line 202, in lower_func
    lowered_module = self._lower_func(
  File "D:\Python\Python310\lib\site-packages\torch_tensorrt\fx\lower.py", line 178, in lower_pass
    interp_res: TRTInterpreterResult = interpreter(mod, input, module_name)
  File "D:\Python\Python310\lib\site-packages\torch_tensorrt\fx\lower.py", line 130, in __call__
    interp_result: TRTInterpreterResult = interpreter.run(
  File "D:\Python\Python310\lib\site-packages\torch_tensorrt\fx\fx2trt.py", line 252, in run
    assert engine
AssertionError

Tried with 0.25/0.5x/2x/4x the workspace size but didn't help.

HolyWu commented 1 year ago

The engine cannot be built due to out of VRAM as you already see the warnings while processing. scale=4 makes the process resolution 4x largar than scale=1 and requires more VRAM for obvious reason.

Samhayne commented 1 year ago

@HolyWu Thanks for your reply! Shouldn't a changed workspace size reduze the VRAM requirement? On the other hand I read somewhere that the setting would be outdated?

HolyWu commented 1 year ago

Probably the max_workspace_size is for restrict VRAM usage at execution time, not during engine building time. I'm not sure.

HolyWu / vs-rife

TRT optimization pass will fail in UHD at high scales and Ensemble=True #29