C0untFloyd / roop-unleashed

Evolved Fork of roop with Web Server and lots of additions
GNU Affero General Public License v3.0
2.23k stars 516 forks source link

Won't get the GPU to get utilized on MacBook with M3 Max and 128 GB RAM. #946

Open Gabbelgu opened 3 weeks ago

Gabbelgu commented 3 weeks ago

Describe the bug I won't get the GPU to get utilized on my MacBook. Other apps like LLM can utilize up to 70 GB RAM for the graphic processor.

To Reproduce Steps to reproduce the behavior: I've enabled CoreML, Max. Number of Threads = 18, GFPGAN and the other processors. Same problem with Max. Number of Threads = 3, GFPGAN and the other processors.. Same problem with Max. Number of Threads = 8, GFPGAN and the other processors..

My configuration is:

MacBook Pro 16" 2023 M3 Max 128 GB RAM Python 3.11 The rate is quite low like 1 to 2s / frames, and it mostly hangs up, not going forward for 3-5s, then recalculates to 1-2s / frames.

Details What OS are you using?

Are you using a GPU?

Which version of roop unleashed are you using? 4.3.1

Screenshots If applicable, add screenshots to help explain your problem.

BrZHub commented 2 weeks ago

I had the same issue on a Macbook Air M2 24GB Framerate was about 2sec per frame. I upgraded the onnxruntime to 1.19.2 and now it does about 20 frames per second.

Just remove these two lines in requirements.txt:

onnxruntime==1.17.1; sys_platform == 'darwin' and platform_machine != 'arm64'
onnxruntime-silicon==1.16.3; sys_platform == 'darwin' and platform_machine == 'arm64'

And add this one:

onnxruntime==1.19.2; sys_platform == 'darwin'

And performance should be a lot better

Gabbelgu commented 2 weeks ago

Thank you, I tried it with removing the two lines and adding the one line in the requirements.txt but it is not working for me.

Bildschirmfoto 2024-10-10 um 23 50 58 Bildschirmfoto 2024-10-10 um 23 42 12
codecowboy commented 23 hours ago

@BrZHub > I upgraded the onnxruntime to 1.19.2 and now it does about 20 frames per second

Can you explain how you upgraded the runtime? python -m pip install onnxruntime ==1.19.2 ? I'm on an M1Pro with 16GB which is also doing about 2 frames / second. It also seems like platform_machine == arm64 would be fairly important?

codecowboy commented 6 hours ago

@C0untFloyd Any chance you could provide some guidance here? Am happy to do some testing and add to the wiki - have got lots of time on my hands

BrZHub commented 5 hours ago

@BrZHub > I upgraded the onnxruntime to 1.19.2 and now it does about 20 frames per second

Can you explain how you upgraded the runtime? python -m pip install onnxruntime ==1.19.2 ? I'm on an M1Pro with 16GB which is also doing about 2 frames / second. It also seems like platform_machine == arm64 would be fairly important?

My requirements.txt file looks like this:

--extra-index-url https://download.pytorch.org/whl/cu118

numpy==1.26.4 gradio==4.44.0 fastapi<0.113.0 opencv-python-headless==4.9.0.80 onnx==1.17.0 insightface==0.7.3 albucore==0.0.16 psutil==5.9.6 torch==2.1.2+cu118; sys_platform != 'darwin' torch==2.1.2; sys_platform == 'darwin' torchvision==0.16.2+cu118; sys_platform != 'darwin' torchvision==0.16.2; sys_platform == 'darwin' onnxruntime==1.19.2; sys_platform == 'darwin' onnxruntime-gpu==1.17.1; sys_platform != 'darwin' tqdm==4.66.4 ftfy regex pyvirtualcam

It changed onnx and onnxruntime. It installs the dependencies listed in this file when you start runMacOS.sh So it probably overrides anything you install manually using "pip install"

On the settings page I set the provider to "coreml"

image

If i run this test clip and swap all faces without adding any additional filters it runs an average of 11.5 FPS:

Processing clip.trim_12-39-03.mp4 took 55.71 secs, 11.52 frames/s

https://github.com/user-attachments/assets/9c7412ba-9ea3-44bb-b40d-77962e9e7005

After looking at this further and looking at CPU/GPU usage, I'm not actually sure it's using CoreML, but there is no chart to see if it is using the NPU... But upgrading the ONNX libraries did increase the performance by 5x on my machine.. (15" MacBook Air M2) So there might be more gains to make.

codecowboy commented 5 hours ago

Many thanks. What do you have your no of execution threads set to in settings? I'm not sure if that is referring to the cpu or gpu. I've now tried editing requirements.txt as per yours but don't see a performance increase.

I also wondered if we could make use of https://pypi.org/project/onnxruntime-coreml/ somehow.

See also https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider.html

My python is pretty rusty but happy to collaborate with someone on this.

codecowboy commented 58 minutes ago

Have done a bit of digging and the following is placed in a number of files which load the models:


# replace Mac mps with cpu for the moment
            self.devicename = self.plugin_options["devicename"].replace('mps', 'cpu')

My guess is that no use is being made of the GPU or at least the Metal layer. I don't have a deep enough understanding of how CoreML works to know how that all fits together