TheBlokeAI / dockerLLM

TheBloke's Dockerfiles
MIT License
299 stars 59 forks source link

Unexpected error from cudaGetDeviceCount #19

Closed noisefloordev closed 7 months ago

noisefloordev commented 8 months ago

When I start 3x 3090 cloud instances, I keep getting this error:

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

This seems to happen almost all of the time now (at least on "cloud" instances), so getting a working GPU instance is a nightmare. All I've found from searching is "I rebooted and the problem went away", which isn't much help.

Just thought I'd see if anyone knew of a workaround. It sucks since 3x3090 on their cloud instances seems to be the most economical way to get Goliath running on RunPod (about $.75/hr), I keep having to switch to an A100 instance which costs twice as much...