C0untFloyd / roop-unleashed

Evolved Fork of roop with Web Server and lots of additions
GNU Affero General Public License v3.0
2.33k stars 540 forks source link

Drop of performance since Gradio update #103

Closed Senooy closed 1 year ago

Senooy commented 1 year ago

Describe the bug I can see a clear drop of performance since the gradio update. My GPU (RTX 3070ti) is hitting 100% but it takes a much longer time to process. Any ways to downgrade my version ?

Thanks

Details What OS are you using?

Are you using a GPU?

Which version of roop unleashed are you using?

Latest

Screenshots If applicable, add screenshots to help explain your problem.

eniora commented 1 year ago

Since you have 8GB VRAM, set "Max. Number of Threads" to 5 or 6 in UI settings, apply and completely restart roop. Should fix the issue.

Senooy commented 1 year ago

Since you have 8GB VRAM, set "Max. Number of Threads" to 5 or 6 in UI settings, apply and completely restart roop. Should fix the issue.

Doesn't seem to make a difference!

eniora commented 1 year ago

IDK man, tested earlier today on 8GB card and it was fine (after the latest update from yesterday, it significantly improved the performance for me), I could do multiple swaps and no slowdowns at all using 6 threads. However there was a discussion I started 2 days ago about this very same issue here: https://github.com/C0untFloyd/roop-unleashed/issues/89 C0untFloyd made an update that improved the performance.

Count also said in the future he may implement a feature to unload the models to CPU which should be beneficial for low VRAM GPUs. roop is and has always been VRAM hungry. Coincidentally I was reading the official roop discord server today and there was a small discussion about this very problem, latest roop (main and the gradio beta one) doesn't release the memory at all, unleashed is still better at it with the latest commit from yesterday regarding the VRAM usage.

Darknessssenkrad commented 1 year ago

same here I was able to run 1 Thread Cuda (3050 4gb vram) and I was getting 7-8 it/s now I hardly get 1-1.5 it/s and a freshresintall didn't help if I try to use more Threads I get OOM errors

faiqraedaya commented 1 year ago

Same issue. RTX 3070, went from XX fps to 0.0X fps. ~100% GPU and VRAM utilisation. Tinkering with the settings doesn't seem to change anything.

eniora commented 1 year ago

As a test, can you guys try with tensorrt provider in settings? set it, apply settings with 4 threads and restart roop completely. tensorrt uses less VRAM with almost same speed as cuda but it doesn't support the enhancers from what I tried. So try without any enhancer and see if you can get good FPS.

This is with tensorrt:

tensor

This is with CUDA:

cuda

Same video and settings, both on 1070 8GB using 5 threads (no enhancer). 8 FPS is good considering the video I am testing on has a face covering a large portion of the screen.

C0untFloyd commented 1 year ago

TensorRT isn't used as a device, it will use your CPU instead (that's why some of the enhancers don't work with it): See https://github.com/C0untFloyd/roop-unleashed/blob/main/roop/utilities.py#L185

For plain swapping CPU might be faster however. I'm reading in other roop forks forums too from time to time and there are at least two (including the official SD Extension by the original guy) where the GPU Support was completely removed as it's slower for swapping, see here

But while you're at it testing, if you check the keep frames option it will use almost the same code/threading as the old roop unleashed. My gut feeling is that it's running quite a bit faster, although the frame processing is not chained (meaning face detection is running on the whole image and previous results aren't re-used). In theory it should be a lot slower than the standard but it isn't. I'm not satisfied with the current frame processing myself, I think I will split the plain image processing from the writing to file as currently threads need to wait for each other to maintain frame order.

Siraj-HM commented 1 year ago

Same issue. RTX 3070, went from XX fps to 0.0X fps. ~100% GPU and VRAM utilisation. Tinkering with the settings doesn't seem to change anything.

facing the same issue with 3070 the older version is miles ahead in speed but love the new improvements. @C0untFloyd please help us :(

Senooy commented 1 year ago

Anyone could downgrade the version ?

C0untFloyd commented 1 year ago

Rewrote the chained threading last night and separated it in parts with buffered reading/writing, speeding it up about 4 times. I think it's even faster than the previous v2.0.3! As a test video I'm using a 90 frames clip from the TV News, it's 1920x1080 resolution.

Pure swapping on my not state-of-the-art NVIDIA 2060 Super:

Processing: 100%|███████████████████████████████████████████████████████████████████████| 90/90 [00:21<00:00, 4.17frame/s, memory_usage=12.33GB, execution_threads=8]

Previously 1.4frame/s

With GFPGAN on: Processing: 100%|███████████████████████████████████████████████████████████████████████| 90/90 [03:11<00:00, 2.13s/frame, memory_usage=13.30GB, execution_threads=8]

Still twice as fast as previously without enhancement!

I need to do some more tweaking and will commit this after my regular work day.

Siraj-HM commented 1 year ago

Rewrote the chained threading last night and separated it in parts with buffered reading/writing, speeding it up about 4 times. I think it's even faster than the previous v2.0.3! As a test video I'm using a 90 frames clip from the TV News, it's 1920x1080 resolution.

Pure swapping on my not state-of-the-art NVIDIA 2060 Super:

Processing: 100%|███████████████████████████████████████████████████████████████████████| 90/90 [00:21<00:00, 4.17frame/s, memory_usage=12.33GB, execution_threads=8]

Previously 1.4frame/s

With GFPGAN on: Processing: 100%|███████████████████████████████████████████████████████████████████████| 90/90 [03:11<00:00, 2.13s/frame, memory_usage=13.30GB, execution_threads=8]

Still twice as fast as previously without enhancement!

I need to do some more tweaking and will commit this after my regular work day.

Thank you will give it a try and let you know

lysxelapsed commented 1 year ago

Interesting. With chained threading and implementation of nvenc, swapping is roughly twice as fast (swapping only) compared to 2.0.3 on my system (i7-4770K, rtx 3060 using only 8 pci-express lanes), while it takes almost twice the time with GFPGAN compared to the "old" approach. I'm curious about the difference my new system will make - parts arrived today.

But: Did all of you take the time it takes, to write the frames onto the harddrive first (2.0.3), into account? Especially with very long Videos, that took up to 20 minutes (older 2.5" s-ata ssd), which are not accounted for in the shown processing time. So of course you won't reach the same frame rates with gradio, which doesn't write all the frames first. To really compare, you have to stop the total time with 2.0.3. Also the hd's and ssd's got very warm in the process of writing all the frames first, and froze the whole system occasionally.

eniora commented 1 year ago

I just made a test on 3090 24GB and 1070 8GB comparing the current 2.7.3 gradio build to the old 2.0.3, they are the exact same speed on the 1070! (total time including the frame extraction for 2.0.3). What people are reporting here is weird that the gradio build is times slower than 2.0.3. My test was on a 1 min 720p 30FPS (1727 frames) video with one face, I chose many faces/all faces on both tests with keep fps option and ran it multiple times to get an average. Didn't use enhancer. 7 threads used on 1070 and 20 threads on 3090. My tests were mainly focused on the 1070 since it seems only people with low VRAM cards are reporting these gradio slowdowns. I also tested on another 40 seconds video and same results, same speed on both.

On 1070: 2 mins 14 seconds on 2.0.3 (total time including frame extraction, swapping time was 2 mins 3 seconds) 2 mins 13 seconds on gradio 2.7.3 (total time including frame extraction, swapping time was 2 mins 5 seconds)

Here are screenshots: Gradio 2.0.3

On 3090 2.0.3 was ~10% faster. I have to say though that 2.0.3 is still better at VRAM usage and clears the VRAM better than the gradio build after swapping. But what's important is that they are both the same speed for me.

Anyway looking forward to test the new thread splitting to parts update count talked about above!

lysxelapsed commented 1 year ago

regarding my comment that GFPGAN takes twice as long: scratch that! thanks to @phineas-pta's comment here #116, I now know why that was. Threads should be significantly lower when using GFPGAN. Maybe the VRAM handling with enhancers on can still be improved?

Senooy commented 1 year ago

Thank you for the tests

On the same video I went from 5-8 frames per seconds to 7 seconds just to process one frame oof

lysxelapsed commented 1 year ago

how's the VRAM usage? are you using an enhancer? try lowering threads further and don't forget to apply settings (happened to me a few times). or even better: start with one thread and increase one by one (apply settings and restart in between to be sure), until performance drops again.

C0untFloyd commented 1 year ago

Man, this stuff is really weird! First, as other people pointed out too, v2.0.3 wasn't running faster for me than the gradio version. The frames per second display in the output isn't very accurate and just notices the timespan between progress updates. The gradio versions should in theory run faster even (when using enhancers etc.) because the gradio version doesn't do any post-processing if there is no face detected previously AND it only does post-processing in the bounding boxes of the target faces. BUT v2.0.3 always extracts all of the movie frames at the beginning and just loads them in each thread, which probably negates the speed improvement again. In theory the gradio version with 'keep frames' checked should be the fastest. Now back to my optimizing adventures: The screenshot I posted this morning, was running with every movie frame loaded into memory, which obviously won't work with bigger movie clips or less memory. So I revised the code again & again and it seemed to get even slower. Then I changed to CPU-only and it was running about 3x as fast as with CUDA-enabled. That means my improvements were working, there must be some sort of race-condition in the CUDA code or the face recognition blocks the other models. As I wrote previously, some of the roop forks went back to CPU-only. What I'm doing now is mixing it and adding preloading for slow drives. There's now a checkbox to force the face swapping to be done on CPU and the rest will use whatever provider you selected. In addition a preload buffer size can be set to the number of images to be preloaded for each thread. My sweet spots are 8 threads, 16 images buffer size and force cpu turned on. This is about 4 times faster as previously. The processing time will also be output after finishing so you people don't have to guess anymore:

Processing took 30.049463987350464 secs

eniora commented 1 year ago

Thanks a lot count, can't wait to test this new update! You said CPU processing is 3x times faster than CUDA and some forks are now using CPU only, but I just tested and tried using the CPU as provider and it's like 20x times slower than CUDA (a mere GTX 1070) at face swapping (no enhancer), not sure if I am understanding you correctly here.

C0untFloyd commented 1 year ago

Don't worry Force CPU only uses the CPU for face analysis, swapping/enhancement still uses GPU (and it's toggleable).

eniora commented 1 year ago

Ahhh OK I see, so it's only for face analysis. btw is it possible to implement the same mechanism 2.0.3 used to release VRAM after swapping? for example on 2.0.3 when I am swapping it uses around 6.5GB VRAM and 3.5 GB after it finishes (you can see on my screenshots above), on gradio it's using 7.8GB and 7.5GB after swapping, same settings on both. I just like how VRAM efficient 2.0.3 was, probably that's one reason why some people are having speed issues with the gradio build?

eniora commented 1 year ago

hmm it's much slower now after the update, tried with and without Force CPU for Face Analyzer, tried with 16 and 8 threads, tried with frame buffer size set to 0, 8, 16 and 32 and I can't get it to process more than 4 FPS. I made sure to click apply settings and restarting roop completely every time. Was 14 FPS on average before update on the same video. Now it feels like the GPU isn't being used fully and the temperature is low too.

Here's a screenshot while CPU analyzer is running, I am getting the same speed with any other settings.

Edit: Processing took 456.1616461277008 secs vs 133 seconds which was right before the update (see my comparison post above)

C0untFloyd commented 1 year ago

Wow, that's a huge indeed! That's with keep frames unchecked, isn't it? As I wrote, it's weird because in my case it's the exact opposite, I have constantly triple the speed than before. Perhaps the 2 methods could be selectable so everybody wins... VRAM: the only difference between those versions is the usage of classes and gradio itself. I supect it is partly because gradio is running in the browser which allocates ram just when starting. If you want to test every bit you could have your internal GPU use the browser, see if that makes a VRAM Difference. One more thing: frame buffer size must be at least 1 and there's no need to be conservative with threads or that size, it's using system ram.

eniora commented 1 year ago

Yah that's with keep frames unchecked. Following tests were done with 20 threads, frame buffer tested on both 16-32 with the new update which didn't make a difference for me. What's weird is that VRAM usage is so low with the new update, some runs only use 8GB and it's very slow.

On 3090 (keep frames unchecked):

before update: 40 FPS after update: 3 FPS

with keep frames checked:

before update: 40 FPS after update: 10-15 FPS

Edit: tested again with some other settings while keep frames is checked: (20 buffer and 16 threads which are the best settings for the new update for me on 3090 so far)

before update: 40 FPS (33 seconds swapping time) after update: 33 FPS (41 seconds swapping time)

Still 2.7.3 always wins regardless keep frames is checked or not, both on 1070 and 3090.

C0untFloyd commented 1 year ago

before update: 40 FPS (33 seconds swapping time) after update: 33 FPS (41 seconds swapping time)

This, I don't understand even remotely. There was no change to the keep frames code and with force cpu unchecked it should be identical. I need a vacation...

eniora commented 1 year ago

I need a vacation...

You deserve it after all this work 😂 I mean every PC, configuration etc. is different, it's hard to please everyone haha. But yah like you said, I think the 2 methods being selectable could be a good solution with the old method as default? (let's see what others here think about the new update and how it runs for them). Also if you want to test/debug anytime you want on my PC you can DM me on discord enio0292 and I can give you teamviewer/anydesk access.

eniora commented 1 year ago

@C0untFloyd I think I discovered something. Sec I am still testing, will edit this post when I am done.

OK so.. While Force CPU for Face Analyser is unchecked, if I set the Frame Buffer Size to anything more than 34 it's so fast (35 and above same fast speed, 34 and anything less it doesn't make a difference, will be slow), it's even faster than before the update at 35+ frame buffer but unfortunately it takes so long to create the video after it's done processing (after Processing: 100%|███| step), it takes ~110 seconds for video creation alone and it's only a 15 seconds video, no real CPU or GPU usage I can see when it's doing that)

Force CPU for Face Analyser doesn't make a difference speed-wise, it's all in the Frame Buffer Size for me. With both scenarios while Face Analyser is checked or unchecked and 34 or less Buffer Size, it's very slow and video creation also takes long.

When keep frames is checked it's fast both at swapping and at creating the video with 35+ buffer size. Now if only at 35 Buffer Size it can create the video normally without this slow down while keep frames is unchecked. So what I think is without keep frames checked ffmpeg or w/e is struggling to take the buffered frames from memory? that's my guess. I wish I knew coding so I could help you with this 😄

I think I confused you even more now haha. Sorry man!

Edit: just discovered that the number of Frame Buffer Size is related to the GPU threads count, so at 10 threads for example you need higher than 35 buffer size for it to be fast...

Senooy commented 1 year ago

For my part, performance increased a lot since the last commit, thank you.

@C0untFloyd I think I discovered something. Sec I am still testing, will edit this post when I am done.

Nice

eniora commented 1 year ago

@Senooy good to know you have better performance now. However the plot thickens :D can you please post the settings you used? keep frames checked? Frame Buffer Size, is Face Analyser checked? how many GPU threads etc.

I edited my post above for my discovery.

C0untFloyd commented 1 year ago

I did some more coding, simplified the threading and I believe it's now really testing your GPU capabilities. And drum roll it's really fast on my machine even with everything being CUDA. The trick is to find your sweet spot, to have your GPU churning through without filling your VRAM too much. And sometimes less is more! As @eniora wrote, frame buffer size is tied to number of threads, so in my case my sweet spot is 4 Threads, 4 Frame Buffer Size. This means an image buffer of 4x4 = 16 Images. With that I have a constant speed of 8 fps which is awesome for my mediocre card. Be aware that this needs to be tweaked when using enhancers to avoid overfilling your VRAM. With GFPGAN my sweet spot is 2*4. I just committed the new code and need to sleep. Only 4 hours sleep for me tonite then 😭 Have fun!

Senooy commented 1 year ago

I did some more coding, simplified the threading and I believe it's now really testing your GPU capabilities. And drum roll it's really fast on my machine even with everything being CUDA. The trick is to find your sweet spot, to have your GPU churning through without filling your VRAM too much. And sometimes less is more! As @eniora wrote, frame buffer size is tied to number of threads, so in my case my sweet spot is 4 Threads, 4 Frame Buffer Size. This means an image buffer of 4x4 = 16 Images. With that I have a constant speed of 8 fps which is awesome for my mediocre card. Be aware that this needs to be tweaked when using enhancers to avoid overfilling your VRAM. With GFPGAN my sweet spot is 2*4. I just committed the new code and need to sleep. Only 4 hours sleep for me tonite then 😭 Have fun!

Awesome I'm testing it rn thanks for your dedication !

eniora commented 1 year ago

Yah thanks a lot for your dedication! I have bad news though😢 With the new commit faces flicker so bad, I posted tests videos below.

With latest update (the one you just committed): download (after latest commit)

before latest commit: download (before latest commit)

The good news is that's it's much better now for me with processing speed but still slower than 2.7.3 overall, with the right settings and thread count/frame buffer it can almost catch up with 2.7.3 but it's kinda a pain trying to find the best settings for every video/resolution and whether GFPGAN is activated or not, etc. I think I can live with it (after face flickering is fixed) but this isn't ideal IMO especially for new users that are testing some forks and came here to test unleashed. I am tired as well and will only have 5 hours of sleep 😄 Cheers!

Edit: Flickering is gone by either checking the "Keep Frames" option or lowering the GPU threads to less than 4. On 4 threads it starts to flicker and the more I increase it the flickering increases drastically, Setting Frame Buffer Size to any value doesn't affect the flickering, only the GPU threads option does. Think it should be an easy fix for you.

BTW I do believe that this whole new update from earlier today (Setting Frame Buffer Size and Force CPU for Face Analyzer etc.) is beneficial for low VRAM cards or certain GPUs/PC combinations (though it was slower for me on 1070 as well) but for fast cards such as the 3090 it can be slower/less beneficial. For that reason sadly I think it's not a universal enhancement update, at least with the current code.

C0untFloyd commented 1 year ago

Used my morning break to debug some more, there was a huge threading bug which corrupted the frame order. Fixed it real quick, loosing ~ 1 frame per second due to race conditions though. As I said before 'Keep Frames' doesn't use the latest changes, it's the same as always (excluding the force cpu thing). I have a new supect why the slowdowns are happening: https://github.com/C0untFloyd/roop-unleashed/blob/old_deprecated/roop/core.py#L109 ^ uncommented in the old tkinter branch

https://github.com/C0untFloyd/roop-unleashed/blob/main/roop/core.py#L112 ^ activated

And @eniora is the one to blame because of his VRAM Complaints! 😄 Before that activation there was no tensorflow dependancy and I didn't have this wildly varying slowdowns. The code was copied straight away from the original roop, I need to test if this is part of the solution later.

AlonDan commented 1 year ago

Just an idea to make things faster: Maybe to make things a bit faster when processing is the way the good old Windows GUI roop unlashed was?

First Extract the whole sequence, only then do the rest (preview or render) So it will always read from the temporary folder, once load a different video / sequence remove the first one so it won't keep leftover junk and used lots of space with unused images.

eniora commented 1 year ago

All good now 😁 thank you! With the latest commit from an hour ago it's fast and good, no flickers. Testing on 1070 with 5 threads and 5 buffer size, it's as fast as 2.7.3 but with Force CPU for Face Analyzer disabled (I think it should be disabled by default after installing unleashed), when I enable it it takes twice the time to swap and process, same with my home PC which has 3090 and 5950x. Yah sorry about the VRAM complaints, I was just comparing with the old tkinter build and thought it would be an easy fix hehe. I am waiting to see the others who commented here and complained about the speed and how the performance is for them now after all these commits.

Now with the enhancer enabled, I am getting a ~30 seconds pause at the end of frame processing every time, like this:

Screenshot 2023-08-17 125607

It pauses for 30 seconds shortly before processing the last few frames, never had this with 2.7.3 No GPU or CPU usage I can see while it's stuck, I guess the enhanced buffered frames are being released from memory that causes the slowdown? I am using 5 threads and 5 buffer for the above screenshot, using 2 threads and 4 buffer fixed the issue, so like you said it's 2x4 for enhancers with your 8GB card and seems like it applies to me. It's weird because with 5x5 my VRAM wasn't full, max was 7.3GB out of 8GB with no shared memory usage. Anyway I think I will be using 3x5 for everything on 1070 which is a good balance with or without enhancers and no slowdown at the end of processing.

C0untFloyd commented 1 year ago

Just an idea to make things faster: Maybe to make things a bit faster when processing is the way the good old Windows GUI roop unlashed was?

First Extract the whole sequence, only then do the rest (preview or render) So it will always read from the temporary folder, once load a different video / sequence remove the first one so it won't keep leftover junk and used lots of space with unused images.

That's exactly what happens when checking keep frames 😛

No GPU or CPU usage I can see while it's stuck, I guess the enhanced buffered frames are being released from memory that causes the slowdown?

Sort of. I couldn't debug it yet but I assume it's because at the end there is a race condition with the remaining frames needing to be written to the resulting file in the correct order. The more threads, the more images left and if there's a bunch of them the saving thread is too slow (also depending on your media speed of course). See, that's the cool thing with the new revised code. Previously while processing a video one thread was waiting for the previous to come to an end to keep the frame order intact. Now they just process as much as fast as they can and the extra saving thread cares about sorting and saving. I for one don't want to go back, even on my old laptop the whole process is much faster than previously, even comparing it to the tkinter version. Try processing a video with parts without faces and see the framerate fly into the 2-digits 🤣

eniora commented 1 year ago

The more threads, the more images left and if there's a bunch of them the saving thread is too slow

Sorry to say this but can you PLEASE make a toggle in settings to disable the buffering system/recent optimizations and make it like 2.7.3? Like I said before, people have all kinds of rigs and configs so what works for me or you may not work best for them so the more options the better IMO. I am saying this because when the enhancer is enabled anything more than 3 threads causes that long pause to happen at the end and with 3 threads it's slower than on 2.7.3 where I was able to set it to 5-6 threads and processing time was ~30% faster. Now if you manage to fix that long pause with more than 3 threads it would be awesome and probably I can see increased performance compared to 2.7.3. To be honest with you all the updates since yesterday didn't give me any speed boost compared to 2.7.3 with or without enhancer and with two computers one with 1070 and one with 3090.

Edit: Just tested on a 3090 with the enhancer and I can't set GPU threads to more than 3 without that long pause happening at the end of processing. So this issue applies even on my 3090 even that the VRAM usage is no more than 12GB out of 24GB.

That's exactly what happens when checking keep frames

Can you also make a toggle that simulates the keep frames option and deletes them automatically after processing? (exactly like old roop) so people who have issues or slowdowns with the new gradio update for whatever reason can have flexibility with choosing what works best for them.

C0untFloyd commented 1 year ago

Can you also make a toggle that simulates the keep frames option and deletes them automatically after processing? (exactly like old roop)

Done. Also I removed the supposed memory leak fix for tensorflow I'm curious if this changes anything, perhaps this makes the new method fast for you too? Committed as v2.7.6

While testing the old method again I noticed totally different speeds between runs. The first run had me swapping at a mediocre ~ 3 fps. I started over again and it was very slow at about 1.45 fps. After coding some more and doing another run I had ~ 8 fps almost constantly! This is really weird, my best guess is that VRAM is garbage collected in a lazy fashion and if it's full there is some memory swapping going on. I doubt this can be solved by code.

eniora commented 1 year ago

Very nice implementation of the In-Memory and Extracting frames option! Simple, easy and placed right.

Anyway tested the new commit on both 1070 and 3090, Force CPU for Face Analyser is still slow for me (which I have it disabled now for all my tests) and that pause at the end still happens when I am using the enhancer and even on a 3090 I have to set threads to 3 max for the pause to not happen which is very slow compared to when I set it to 16 (which causes a 20-30 seconds pause at the end 10-15 frames before the total, and that's regardless of how many buffer size I set). Looks like how early the pause happens, which is always near the end in any case, is relevant to what the Frame Buffer Size is set to.

I assume it's because at the end there is a race condition with the remaining frames needing to be written to the resulting file in the correct order

But if that's true, why doesn't it happen when the face enhancer isn't enabled? that pause only happens if GFPGAN is enabled. Can you please make it that if I set threads to 0 it means it's disabled completely (like 2.7.3)? If only that pause doesn't happen everything would have been perfect. 😢 Can you reproduce the pause I am talking about? on your 8GB card set threads to 5 and buffer to 10, enable GFPGAN and test on a short video (10-15 seconds video will do).

C0untFloyd commented 1 year ago

Committed the next iteration of it, the pause should be gone. I'm curious if this makes a difference with your 3090, I noticed both methods degrade after the first run and seem to run slower. But it's not always and after the first run the degradation does never continue and the speed stays roughly the same. For me there's hardly a difference between both methods in terms of speed now. Both run best with 4-5 threads in my tests and smaller buffer size of 2-3. The buffer size isn't used with the extract frames method anyway.

eniora commented 1 year ago

Awesome! everything working great now, no pauses or anything. My sweet spot is 18x18 for threads and buffer on 3090 and it's now sometimes faster than 2.7.3! or at least the same speed. With extract frame method it's the same speed as well but I prefer in-memory processing anyway (my suggestion before to add the proper extract frames feature was for people who have issues with in-memory). Force CPU for Face Analyzer still much slower for me at swapping and processing (only faster at detecting the source face and target image/video).

Small suggestion: extract frames method doesn't show the fancy new total time feature :P Finally an update capable to replace 2.7.3 for me 😄will test tomorrow on 1070 but it will most likely be fine because the issues I had were both with 1070 and 3090. Thanks a lot.

Edit: I forgot to reply to this: No degradation of speed after the first run for me and sometimes it's even faster after the first run but let's say it's the same speed as long as my VRAM doesn't get full, and it doesn't with 18 threads on 3090. 4-5 threads would be the best with 1070/8GB cards and I think 6-9 would be best with 12GB cards.

eniora commented 1 year ago

@C0untFloyd with one swap out of many I did I had an error with 2.7.7

2023-08-19 03:16:25.7321029 [E:onnxruntime:Default, cuda_call.cc:116 onnxruntime::CudaCall] CUDA failure 719: unspecified launch failure ; GPU=0 ; hostname=DESKTOP-owner ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\gpu_data_transfer.cc ; line=63 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyHostToDevice, static_cast(stream.GetHandle()));

It happened in the middle of swapping (around 50% in), GFPGAN enhancer was enabled, it kept throwing this error hundreds of times in the console and I had to close it, I forgot to do control + C so I could go up and see the full error and how it started. Anyway it only happened once and I was testing on a 9k frames video, tested again twice and it didn't happen. If it happens again I will let you know.

Here is a benchmark with 2.7.7 against 2.7.3 with a 9k frames video (with GFPGAN):

2.7.7 (16 threads x 16 buffer) 2 7 7

2.7.3 (20 threads) 2 7 3

I would say the speed difference is in the margin of error at this point and I may need to tweak my settings further for 2.7.7 to get the best performance.

Siraj-HM commented 1 year ago

I tested the latest build on my 3070 with eight threads and frame buffer 4. It starts at 12fps but drops to 11s per frame after 25% processing. I tried the keep in memory option too same thing happened :( what am I doing wrong folks?

eniora commented 1 year ago

@Siraj-HM try with 3-4 threads and 3-4 buffer size, you're filling your VRAM with 8 threads on 3070. I noticed it uses a bit more VRAM since the buffer size update but the speed should be the same as before with a lower thread number. Don't use Force CPU for Face Analyser in settings and use In-Memory processing method. But the most important thing is to set to 3 or 4 threads with 3 buffer size.

eniora commented 1 year ago

I just tested on a 1070 8GB, everything is working good and it's actually ~10% faster than 2.7.3 on an average benchmark I did with 5 runs, especially with enhancer. I did multiple runs on the same session and no slowdowns occurred at all. My settings were:

In-Memory processing method Force CPU for Face Analyser disabled 4 threads with 4 buffer size

Like I said, the new 2.7.7 update (or most updates after 2.7.3) use more VRAM on the same thread count, so if you for example used 5 or 6 threads on 2.7.3 you have to use 4 threads on 2.7.7 (with 4 buffer size). Don't worry, the speed should be the same or even slightly faster with 4 threads on 2.7.7 compared to 6 threads on 2.7.3. The idea is that you don't want to fill your VRAM because if that happens roop will start using the shared memory which is hundreds of times slower than the GPU memory hence the significant slowdowns, that's why it's a good idea to start with a low thread count and start increasing until your VRAM fills then undo the last thread increase.

AlonDan commented 1 year ago

I just did some tests with the default 2.7.7 settings and I get so much slower results compare to what I had before (not that it was super fast) So I tried the new feature: "Extract Frames to media" because I hope it will do what the good old Roop Unleashed GUI for windows did extract super fast, than start MUCH FASTER than anything I get in gradio with my old machine.

Unfortunately, it is still exactly the same super slow on my machine, I did the exact same with the old Roop Unleashed GUI and it was so much faster just like I remembered, so I guess I'll have to stick with the old version (I know it's much limited) but I have no choice as I only have:

which is not good for the latest Gradio version, it's just extremely slower to render few seconds of one video.

I guess it works great for more modern machines so at least it's going good ways with that, unless there will be some "magic" in the way it will work more like the GUI one with super fast export fast and maybe the process itself is just so much slower for other reasons I have no idea, but if some cool change in that area will happen I'll sure give it a try and let you know if it improved.

In the meantime, please keep up the good work! ❤

eniora commented 1 year ago

@AlonDan can you please try on the new 2.7.7 with the following settings:

Threads 1, Buffer 1 In-Memory processing Force CPU for Face Analyzer disabled

Apply them and completely restart roop. Make sure threads and buffer both set to 1 in the config.yaml file, adjusting them in UI settings sometimes tends to set a random value for buffer especially if you previously had a high buffer value and then set it to a very low value like 1 (no idea why).

The reason it's slow for you is because your 4GB VRAM is getting full with the default settings which is 4 threads.

With only 1 thread and 1 buffer size it's still really fast for me and only slightly slower than with 4 threads, though my VRAM usage is 4.2GB so hopefully it would work for you, it's worth a try!

Screenshot 2023-08-19 174053

I do believe the Gradio build deals different with VRAM and it's a bit more VRAM hungry that's why I use much lower threads count than with old unleashed builds but it should get similar speed so it's fine.

Darknessssenkrad commented 1 year ago

everything seems to be fixed, I can use 1 Thread on my 3050 (laptop) and 14 frames BUT on the Colab notebook it's crazy 12 Threads and 14 Frames it's what I'm using on the colab I get 10-12 frames/second no enhancements

2023-08-19 16:11:50.324872: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-08-19 16:11:52.009750: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Using provider ['CUDAExecutionProvider'] - Device:cuda Running on local URL: http://127.0.0.1:7860/ Running on public URL: https://45e46d4e3119fd686a.gradio.live/

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run gradio deploy from Terminal to deploy to Spaces (https://huggingface.co/spaces) Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'device_id': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0'}} find model: /root/.insightface/models/buffalo_l/1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0 Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'device_id': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0'}} find model: /root/.insightface/models/buffalo_l/2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0 Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'device_id': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0'}} find model: /root/.insightface/models/buffalo_l/det_10g.onnx detection [1, 3, '?', '?'] 127.5 128.0 Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'device_id': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0'}} find model: /root/.insightface/models/buffalo_l/genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0 Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'device_id': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0'}} find model: /root/.insightface/models/buffalo_l/w600k_r50.onnx recognition ['None', 3, 112, 112] 127.5 127.5 set det-size: (640, 640) [ROOP.CORE] Processing video /content/roop-unleashed/temp/7d5ae92c96a09c7a03a537eb421e92c32220c31c/test.mkv [ROOP.CORE] Creating video with 22.0 FPS... ['ffmpeg', '-hide_banner', '-hwaccel', 'auto', '-y', '-loglevel', 'error', '-f', 'rawvideo', '-vcodec', 'rawvideo', '-s', '854x480', '-pix_fmt', 'bgr24', '-r', '22.0', '-an', '-i', '-', '-vcodec', 'libx264', '-crf', '14', '-vf', 'colorspace=bt709:iall=bt601-6-625:fast=1', '-pix_fmt', 'yuv420p', '/content/roop-unleashed/output/test_fake.mp4'] Processing: 0% 0/3875 [00:00<?, ?frame/s]Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'device_id': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0'}} inswapper-shape: [1, 3, 128, 128] Processing: 11% 424/3875 [00:59<05:25, 10.61frame/s, memory_usage=06.44GB, execution_threads=12]

AlonDan commented 1 year ago

Thanks for the advice @eniora I just tried few tests based on your suggestion, I don't see much difference actually when I use 1 thread and 1 buffer it's slower than when I use 4 threads, at least I could get almost 3.x fps, now I get 1.x

I also did some tests with different numbers such as: Buffer: 1-2 and same with Thread: 1, 2 and 3 (4 was the default)

I don't think it likes old machines such as my GPU, while I get something like 8-9 fps in the good old Roop Unleashed GUI... I know, I'm loosing the good cool features but it's so much faster. 1.x - 3.x is very slow.

I hope it will be improved in the future, but I'm not counting on it because my hardware is ancient tech.

Anyway, thanks for trying to help I appreciate it ❤

cmd_2023-08-19_19-44-25

eniora commented 1 year ago

@AlonDan You're welcome! Yah I think your GPU doesn't work well with Gradio for whatever reason. Do you remember your VRAM usage while it was swapping using 1 thread? I wonder if it was more than 4GB and if it was using the shared memory at this point.🤔

AlonDan commented 1 year ago

VRAM was about It was around 3.4 - 3.9 sometimes 4

I'll keep my eyes on Roop Unleashed Gradio, the community is sure awesome. I wish the GUI version could also continue but oh well... Gradio is more efficient probably