FoldingAtHome / fah-issues

49 stars 9 forks source link

FAHClient's reponse to a missing OpenCL is inadequate #1246

Open bb30994 opened 6 years ago

bb30994 commented 6 years ago

The same problem exists in both Windows and Linux.

If the OpenCL driver is not present, FAHClient correctly identifies the problem but does nothing to prevent the repeatedly dumping WUs because they cannot be processed.

e.g.-

00:45:57: GPUs: 2 00:45:57: GPU 0: Bus:3 Slot:0 Func:0 NVIDIA:4 GM107 [GeForce GTX 750 Ti] 1306 00:45:57: GPU 1: Bus:4 Slot:0 Func:0 NVIDIA:5 GM206 [GeForce GTX 960] 2308 00:45:57:CUDA Device 0: Platform:0 Device:0 Bus:4 Slot:0 Compute:5.2 Driver:9.0 00:45:57:CUDA Device 1: Platform:0 Device:1 Bus:3 Slot:0 Compute:5.0 Driver:9.0 00:45:57: OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so': 00:45:57: libOpenCL.so: cannot open shared object file: No such file or directory

In this case, two CUDA devices were identified and, in fact, zero OpenCL devices were identified. The client proceeded to download WUs which could only run on OpenCL, to repeatedly issue messages suggesting that I needed to manually set the index, and dump the WUs.

02:04:05:WU01:FS01:Starting 02:04:05:ERROR:WU01:FS01:Failed to start core: OpenCL device matching slot 1 not found, try setting 'opencl-index' manually

(restart attempted and message reissued 6 times.) NOTE: Valid settings for opencl-index DO NOT EXIST because zero OpenCL devices have been identified.

This must be treated as a critical Error, disabling the download of future WUs that require OpenCL. In my case, it downloaded and dumped 7 perfectly good WUs before I stopped it.

Inasmuch as we do not currently differentiate between GPU projects require OpenCL and those which do not, creating that kind of an identification process would be a useful enhancement In the meantime (temporarily) the best option is probably to fix the error message when there are zero choices for opencl-index and pause the slot promptly until a better solution can be found.

Here are the first two of 7 WUs which were dumped as FAILED before I stopped it. OpenCL missing.txt

bb30994 commented 6 years ago

Fundamentally, we need an enhancement to both FAHClient's hardware detection activities and the Assignment Server logic. If, like the example above, I have CUDA devices and no OpenCL devices, or, conversely I have OpenCL devices but no CUDA devices the AS logic needs to be smart enough to either give me a GPU assignment that my hardware can process or issue the message saying there are no assignments for this configuration and temporarily stop assigning WUs.

informatorius commented 5 years ago

I don't know if it is related but it looks like FAHclient tries to access libopencl.so to get opencl devices which is the developer sdk api. Instead FAHclient should use the opencl runtime to query devices. If a opencl driver is installed then you have an opencl runtime. Currently for FAHclient you also need to install the opencl developer api libopencl.so unnecessary. e.g. command clinfo works without libopencl.so

bb30994 commented 5 years ago

@informatorius In Windows7, clinfo does not exist.

C:\Users\bruce>clinfo : 'clinfo' is not recognized as an internal or external command, operable program or batch file.

Should the installation of FAH add something similar that can work on all platforms (Windows/Linux/MacOS) or are we unable to find something that will always work?

codeman101 commented 5 years ago

It's not just Windows I have the same problem on Linux. I have opencl-headers installed but I get the error. 04:09:32:ERROR:WU01:FS01:Failed to start core: OpenCL device matching slot 1 not found, try setting 'opencl-index' manually.

ffissore commented 4 years ago

I've worked around the issue of libOpenCL.so: cannot open shared object file: No such file or directory by running sudo ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 /usr/lib/x86_64-linux-gnu/libOpenCL.so

crmason2 commented 4 years ago

@ffissore I tried the symbolic link trick but fahclient would seg fault on startup until I removed it. So that may only be an option on limited systems.

informatorius commented 4 years ago

Did you try that? sudo apt-get install ocl-icd-libopencl1 sudo apt-get install ocl-icd-opencl-dev

crmason2 commented 4 years ago

Not running a deb flavor so I don't have access to those packages. I did try installing ocl-icd-devel which added a libOpenCL.so link, but again gave a seg fault. I'm runing on an old video card that I had to fight to get the drivers to work for so most likely my configuration is just busted.

But the real issue is the same as the original. I keep getting work units that I can't process because I don't have OpenCL.

shorttack commented 4 years ago

Label: installer enhancement This is a well-known issue. The Linux FAH installer should install OpenCL. It's the one "apt get inatall" that keeps FAH from being "just an app" on Linux (Mint at least).

In Windows, an nVidia website download and install works reliably for me.

ffissore commented 4 years ago

@shorttack FYI on ubuntu, the package to install is ocl-icd-opencl-dev as suggested by @informatorius . It has created the same symlink I originally manually created (see comment)

codeman101 commented 4 years ago

For arch and arch based users this issue is resolved thanks to the arch wiki page on fah. When I followed that page as a guide all went well.