davefojtik / RunPod-Fooocus-API

RunPod serverless worker for Fooocus-API. Standalone or with network volume
GNU General Public License v3.0
34 stars 16 forks source link

Random CUDA errors #36

Open mingekko opened 2 weeks ago

mingekko commented 2 weeks ago

Hello!

About once every 2 weeks the following errors appear for a few hours and then it fixes itself:

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

I can't decide what this phenomenon is - especially since the error disappears after a few hours and then reappears after 1-2 weeks... Can anyone help me find out what is causing this?

This is my configuration: img

davefojtik commented 2 weeks ago

That error message is often caused by an incompatible GPU driver on the machine and is usually solved by disabling Cuda Malloc. But that is disabled in Fooocus by default as far as I know.

The fact that it's appearing randomly and "fixes itself after a few hours" would suggest to be a problem with specific workers that are spawning for example when your normally used GPUs get low in availability.

I would suggest writing down what GPUs are your workers using normally (you can see what GPU it is when hovering over the rectangles representing individual workers in your endpoint details), and then checking what GPUs are being used when you encounter such an error. Alternatively, you could go over your list of secondary GPU selections and try to find the problematic one right away. But that could be time-consuming if you have many models selected since you need to change the endpoint settings, purge all the active workers to spawn new ones and test them.

I use basically just 4090s at this point (which are also the most cost-effective ones for this task) and never had such an error yet. So if you'll find how to reproduce it frequently or the GPU model that is causing this, let us know for sure.