HolyWu / vs-grlir

GRLIR function for VapourSynth
MIT License
15 stars 0 forks source link

Running out of GPU memory #2

Closed AIisCool closed 1 year ago

AIisCool commented 1 year ago

First off thank you for porting this!

I've tried setting tiling as low as possible but find that my 3090 with 24GB VRAM is running out of memory with a short clip of only 50 frames.

Is there anything else I can do or is this just how it is? Also will you be adding the other models in the future?

HolyWu commented 1 year ago

What's the exact tile size you tried? On my 3050 with 8GB VRAM I have to use tile_w=480, tile_h=270 on a 1920x1080 clip.

The other models are much less useful/attractive to me though, especially the denoising/jpeg models that are only trained for specific noise sigma or quality factor and cannot use arbitrary values.

AIisCool commented 1 year ago

I've tried tile_w=180, tile_h=144 for a 720x576 clip, and lower at tile_w=90, tile_h=72, both with tile_pad=64. Btw what is a good value for the tile padding, as I can see the tiling in the video.

I also notice high CPU usage as well. CPU is AMD Ryzen 5950x. Took about 45 seconds to load the first frame and for some reason nearly 2 minutes to load the second frame while advancing the preview.

Is it my script perhaps?

import vapoursynth as vs
import adjust
import functools
import havsfunc as haf
from vapoursynth import core
core = vs.core

clip = core.ffms2.Source(r'C:/Users/User/Desktop/test/input.mp4')

clip = core.resize.Point(clip=clip, format=vs.YUV444PS, matrix_s="470bg", matrix_in_s="470bg")

clip = core.resize.Spline64(clip=clip, format=vs.RGBS)

from vsgrlir import grlir
clip = grlir(clip=clip, num_streams=30, device_index=0, tile_w=180, tile_h=144, tile_pad=64)

clip = core.resize.Spline64(clip=clip, format=vs.RGBS)
clip = core.resize.Spline64(clip=clip, format=vs.YUV444P16, matrix_s="470bg", matrix_in_s="470bg")

clip.set_output()

As for the denoising/jpeg models, that's too bad. I had hoped they were good enough to help clear away some artifacts in my input videos.

HolyWu commented 1 year ago

It's probably insane to specify num_streams=30. You should begin with num_streams=1 and increase it one by one until you get the highest FPS or OOM.

AIisCool commented 1 year ago

I guess I assumed more streams would equal faster processing. So far I've tried num_streams=1 and gone up but still appears to be extremely slow. When it maybe loads, hitting the memory error.

Error on frame 2 request:
CUDA out of memory. Tried to allocate 588.00 MiB (GPU 0; 24.00 GiB total capacity; 21.20 GiB already allocated; 0 bytes free; 22.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Really unsure what the issue could be on my end, as if you're able to use it without any issues with 8GB VRAM, surely it'd be faster and no issues for me.

Thank you for your assistance, it's quite frustrating to not be able to have it work as smoothly as it apparently should.

styler00dollar commented 1 year ago

Transformers require a lot of memory and are slow. It also uses fp32 and no tensorrt, which means extra slow. Use low tiling resolution and a low amount of streams and accept that it is slow. Your padding is also big, which means many tiles to generate.

AIisCool commented 1 year ago

@styler00dollar alright, I guess it just is what it is. Shame it's not really usable as-is, at least for me, as the results look really nice. Should I update Pytorch and Cuda? Would that help at all? I'm using 1.13.1+cu117

Selur commented 1 year ago

When using:

# Imports
import vapoursynth as vs
# getting Vapoursynth core
core = vs.core
import site
import os
import ctypes
# Adding torch dependencies to PATH
path = site.getsitepackages()[0]+'/torch_dependencies/bin/'
ctypes.windll.kernel32.SetDllDirectoryW(path)
path = path.replace('\\', '/')
os.environ["PATH"] = path + os.pathsep + os.environ["PATH"]
os.environ["CUDA_MODULE_LOADING"] = "LAZY"
# Loading Plugins
core.std.LoadPlugin(path="i:/Hybrid/64bit/vsfilters/Support/fmtconv.dll")
core.std.LoadPlugin(path="i:/Hybrid/64bit/vsfilters/DeinterlaceFilter/TIVTC/libtivtc.dll")
core.std.LoadPlugin(path="i:/Hybrid/64bit/vsfilters/SourceFilter/DGDecNV/DGDecodeNV.dll")
# source: 'C:\Users\Selur\Desktop\VTS_03_2_scene_horses.ts'
# current color space: YUV420P8, bit depth: 8, resolution: 720x576, fps: 25, color matrix: 470bg, yuv luminance scale: limited, scanorder: top field first
# Loading C:\Users\Selur\Desktop\VTS_03_2_scene_horses.ts using DGSource
clip = core.dgdecodenv.DGSource("G:/Output/2023-04-21@15_23_35_6110.dgi")# 25 fps, scanorder: top field first
# Setting detected color matrix (470bg).
clip = core.std.SetFrameProps(clip, _Matrix=5)
# Setting color transfer info (470bg), when it is not set
clip = clip if not core.text.FrameProps(clip,'_Transfer') else core.std.SetFrameProps(clip, _Transfer=5)
# Setting color primaries info (BT.601 NTSC), when it is not set
clip = clip if not core.text.FrameProps(clip,'_Primaries') else core.std.SetFrameProps(clip, _Primaries=5)
# Setting color range to TV (limited) range.
clip = core.std.SetFrameProp(clip=clip, prop="_ColorRange", intval=1)
# making sure frame rate is set to 25
clip = core.std.AssumeFPS(clip=clip, fpsnum=25, fpsden=1)
clip = core.std.SetFrameProp(clip=clip, prop="_FieldBased", intval=2) # tff
clip = core.tivtc.TFM(clip=clip)
# cropping the video to 692x572
clip = core.std.CropRel(clip=clip, left=12, right=16, top=4, bottom=0)
from vsgrlir import grlir
# adjusting color space from YUV420P8 to RGBS for VsGRLIR
clip = core.resize.Bicubic(clip=clip, format=vs.RGBS, matrix_in_s="470bg", range_s="limited")
# resizing using GRLIR
clip = grlir(clip=clip) # 2768x2288
# resizing 2768x2288 to 1920x1488
# adjusting resizing
clip = core.fmtc.resample(clip=clip, w=1920, h=1488, kernel="lanczos", interlaced=False, interlacedd=False)
# adjusting output color from: RGBS to YUV420P10 for x265Model
clip = core.resize.Bicubic(clip=clip, format=vs.YUV420P10, matrix_s="470bg", range_s="limited", dither_type="error_diffusion")
# set output frame rate to 25fps (progressive)
clip = core.std.AssumeFPS(clip=clip, fpsnum=25, fpsden=1)
# Output
clip.set_output()

with a Geforce RTX 4080 16GB RAM and calling:

VSPipe.exe "C:\Users\Selur\Desktop\test_2.vpy" -c y4m --progress g:\test.y4m

I get:

Script evaluation done in 1.62 seconds
Error: Failed to retrieve frame 29 with error: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Output 61 frames in 344.72 seconds (0.18 fps)

When using: ,tile_w=144, tile_h=128 => max. 5.5 GB VRAM usage, 0.12fps. (no issues) ,tile_w=288, tile_h=128 => max. 6.7 GB VRAM usage, 0.13fps. (no issues) ,tile_w=288, tile_h=572 => max. 9.3 GB VRAM usage, 0.18fps. (no issues) ,tile_w=346, tile_h=572 => max. 10.3 GB VRAM usage, 0.20fps. (no issues) ,tile_w=346, tile_h=572 => max. 10.3 GB VRAM usage, 0.20fps. (no issues) ,tile_w=692, tile_h=572 => max. 15.7 GB VRAM usage, 0.20fps. As expected, it crashed, but earlier than when no tiling is used with:

Error: Failed to retrieve frame 25 with error: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Output 57 frames in 281.97 seconds (0.20 fps)

=> GRLIR really is a memory hog

Selur commented 1 year ago

Okay, when I pad the source to mod32 before applying GRLIR:

# Imports
import vapoursynth as vs
# getting Vapoursynth core
core = vs.core
import site
import os
import ctypes
# Adding torch dependencies to PATH
path = site.getsitepackages()[0]+'/torch_dependencies/bin/'
ctypes.windll.kernel32.SetDllDirectoryW(path)
path = path.replace('\\', '/')
os.environ["PATH"] = path + os.pathsep + os.environ["PATH"]
os.environ["CUDA_MODULE_LOADING"] = "LAZY"
# Loading Plugins
core.std.LoadPlugin(path="i:/Hybrid/64bit/vsfilters/Support/fmtconv.dll")
core.std.LoadPlugin(path="i:/Hybrid/64bit/vsfilters/DeinterlaceFilter/TIVTC/libtivtc.dll")
core.std.LoadPlugin(path="i:/Hybrid/64bit/vsfilters/SourceFilter/DGDecNV/DGDecodeNV.dll")
# source: 'C:\Users\Selur\Desktop\VTS_03_2_scene_horses.ts'
# current color space: YUV420P8, bit depth: 8, resolution: 720x576, fps: 25, color matrix: 470bg, yuv luminance scale: limited, scanorder: top field first
# Loading C:\Users\Selur\Desktop\VTS_03_2_scene_horses.ts using DGSource
clip = core.dgdecodenv.DGSource("G:/Output/2023-04-21@16_48_20_4610.dgi")# 25 fps, scanorder: top field first
# Setting detected color matrix (470bg).
clip = core.std.SetFrameProps(clip, _Matrix=5)
# Setting color transfer info (470bg), when it is not set
clip = clip if not core.text.FrameProps(clip,'_Transfer') else core.std.SetFrameProps(clip, _Transfer=5)
# Setting color primaries info (BT.601 NTSC), when it is not set
clip = clip if not core.text.FrameProps(clip,'_Primaries') else core.std.SetFrameProps(clip, _Primaries=5)
# Setting color range to TV (limited) range.
clip = core.std.SetFrameProp(clip=clip, prop="_ColorRange", intval=1)
# making sure frame rate is set to 25
clip = core.std.AssumeFPS(clip=clip, fpsnum=25, fpsden=1)
clip = core.std.SetFrameProp(clip=clip, prop="_FieldBased", intval=2) # tff
clip = core.tivtc.TFM(clip=clip)
# cropping the video to 694x574
clip = core.std.CropRel(clip=clip, left=10, right=16, top=2, bottom=0)
from vsgrlir import grlir
clip = core.std.AddBorders(clip=clip, left=10, right=12, top=14, bottom=16) # add borders to archive mod 32 (VsGRLIR) - 716x604
# adjusting color space from YUV420P8 to RGBS for VsGRLIR
clip = core.resize.Bicubic(clip=clip, format=vs.RGBS, matrix_in_s="470bg", range_s="limited")
# resizing using GRLIR
clip = grlir(clip=clip) # 2864x2416
# resizing 2864x2416 to 1920x1490
clip = core.std.CropRel(clip=clip, left=40, right=48, top=56, bottom=64) # removing borders (VsGRLIR) -  2776x2296
# adjusting resizing
clip = core.fmtc.resample(clip=clip, w=1920, h=1490, kernel="lanczos", interlaced=False, interlacedd=False)
# adjusting output color from: RGBS to YUV420P10 for x265Model
clip = core.resize.Bicubic(clip=clip, format=vs.YUV420P10, matrix_s="470bg", range_s="limited", dither_type="error_diffusion")
# set output frame rate to 25fps (progressive)
clip = core.std.AssumeFPS(clip=clip, fpsnum=25, fpsden=1)
# Output
clip.set_output()

encoding works fine and only 15.1GB of VRAM are used. => any idea why increasing the size to mod32 lowers RAM usage and stops the encoding from crashing? (mod16 seems to be enough an uses ony 14.0GB of VRAM)