ArtVentureX / sd-webui-agent-scheduler

627 stars 65 forks source link

[urgent] Agent scheduler error - not working #202

Open Rakna123 opened 9 months ago

Rakna123 commented 9 months ago

Automatic1111 not working when i use Agent scheduler and getting this error

Status: failed
Error: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

If enable, error occurs after few minutes. If i disable i am not getting error.

This error in production code. I have tried fresh installation in new server still got this error. I am monitoring the log and if error occurs i am restarting the server so next few minutes it will work. I am doing this past 48 hrs without sleepless nights. Please someone tell me how to fix it.

@artventuredev @ArtVentureX

artventuredev commented 8 months ago

Apologies for the late reply; I've been quite busy with my full-time job recently.

This issue has popped up for some time, but I was unable to identify the root cause. Since I develop the extension on an M1 MacBook, I'm unable to reproduce it on my system. Someone managed to fix it by upgrading to Cuda 12, as mentioned here. Please try following the method described above to see if it resolves the issue.

Rakna123 commented 8 months ago

After trying for many days we too can't able to find the root cause. Initially i thought this error specific to Agent scheduler so we tried to create our own scheduler. This error occurs in Automatic1111 API and this occurs randomly so can't able to identify when and why it occurs.

So this error may be from CUDA because we get same error in ComfyUI API also. We wrote temporary fix by restarting the service when error occurs. Last 2 days there were no error (Previously 100+ restarts per day) . Working correctly without doing anything.

artventuredev commented 8 months ago

Thank you for the update. The extension primarily integrates several internal A1111 APIs and does not involve anything related to CUDA, model loading, or inference. Therefore, identifying the cause of the issue was challenging for me.

I'm glad that, for reasons yet unclear, the issue has been resolved on your end.