HolyWu / vs-rife

RIFE function for VapourSynth
MIT License
94 stars 7 forks source link

Potential Speedup with number of threads #37

Closed jensdraht1999 closed 10 months ago

jensdraht1999 commented 10 months ago

@HolyWu

I commented out:

#if num_streams > vs.core.num_threads:
#    raise vs.Error("rife: setting num_streams greater than `core.num_threads` is useless")

and then set the number of streams to 24. It gave me 125-135 Frames per second, instead of 115 Frames per second with 12 streams. If I set it to 25/26 streams, it's getting very slow like 55fps, because of the new cuda fallback policy, which literally would end the program instead of working slowly.

My hardware: I5-10500H Laptop with 6Cores / 12 threads. RTX 3060 with 6144mb

The video I have tested, just the first minute: 720p anime video.

Suggestion:

I will try to look up with the help ai coding tools, if we can slowly increase the number of streams up until the point, where the vram is full with a 10% percent margin. So we might get the biggest boost. I do not think we should set a default num_stream number or at least set it to 1, because, the numofstreams value may change with the resoultion of the video.

A 4k video is 9 times bigger than a 720p video, which means, you cannot have "numofstreams=24", which would not fit into the memory. A dynamic approach to this would be better.

I will close this issue, since this is just a documentation for me and if you think, this is important for you.

jensdraht1999 commented 9 months ago

Something is not working as it worked on the day I posted. It really is not improving performance setting the number more than thread available. Must look at: -Spectre, Meltdown, Downfall.

jensdraht1999 commented 9 months ago

@HolyWu

Another test:

18:32 with script3.py/script4.py/script5.py and 3 bat files. First it gets split into three parts, then it interpolates, then it merges. This was a naive approach by the way, if all videos would have been equally split. It might have been a little bit faster. Num_streams was 5 for each script.

18:43 with script2.py with numstream 12. Just scaling and merging audio and video together.

18:35 with script3.py/script4.py/script5.py and 3 bat files. First it gets split into three parts, then it interpolates, then it merges. This was equally split. Num_streams was 5 for each script.

So this means, even if cuda is utilized 100% it does not get any faster in any meaninful way.

The Video: 720p with 23:42 runtime 23.974 FPS upscaled x3. The Hardware: I5-10500H Laptop with 6 Cores / 12 threads. Nvidia RTX 3060 with 6144mb.

So the good news is, that it does not get any faster. This is pretty much the limit on how fast it goes.