Fix duplicate GPU problem

davidpanderson commented 5 years ago

The GPU detection logic sometimes decides that 1 GPU is actually 2. Apparently this can happen because of buggy drivers, or because there are multiple OpenCL "platforms" (e.g. POCL is present).

Note: gpu_opencl.cpp contains the comment //TODO: Must we check if multiple platforms found the same GPU and merge the records?

bema-aei commented 5 years ago

As far as I understand, a "buggy driver" (as reported on boinc_alpha mailing list) also creates a new OpenCL platform with the same device(s), so both cases are actually the same issue.

Note that (so far) this problem is limited to OpenCL on NVidia.

Einstein@Home asked The Khronos Group to add a property to the device record that would allow to uniquely identify a device regardless of the interface (OpenCL platforms or even CUDA). I remember that this request was originally ignored or declined, but I don't know what the current status is. @brevilo might know more.

Finally, merging the device records from different platforms would help with the device scheduling on the client side, but wouldn't solve the problem that a possibly unsuitable platform is passed to the application. This is complicated by the fact that different applications may have different requirements regarding the platform. IMHO we either need

a way for the app to tell the client which platform to use (or not to use) or
pass the app a list of platforms and a list of devices (one for each platform), such that the app can pick a platform to its own criteria and then knows which device to use.

RichardHaselgrove commented 5 years ago

@bema-aei: I'm sorry, but I don't think that's a complete analysis of the problems we're seeing. Over the first 10 years of GPU computing, we've become accustomed to referring to "the driver" as a single entity, most commonly provided by the hardware manufacturer. But in reality, the downloaded driver file is a multi-component delivery system, and as with all software installation packages, it is responsible for both installing new components, and uninstalling old components.

OpenCL is just one of the installed/uninstalled packages, and it probably wasn't originally created by the hardware manufacturer: OpenCL is supposed to be a cross-platform language, after all.

In recent years, there has been a move away from hardware manufacturers supplying driver packages direct to end users. I ran a controlled experiment on a Windows 10 machine some time ago, with the result that:

With an NVidia GPU installed, Microsoft supplied a driver with CUDA included, but without OpenCL
With an Intel iGPU active, Microsoft supplied an Intel driver with OpenCL capability
Reverting to the NVidia GPU, OpenCL computing was possible using the Intel OpenCL stack supplied by Microsoft

all without any direct intervention by either hardware provider.

So, in these cases, I don't think that 'buggy driver' is quite the right description: I'd perhaps call it 'poor shared component management', and we ought to be able to detect and mitigate that.

There's a helpful and informative copy of BOINC's own coproc detection output online at http://stateson.net/images/coproc_info_10_nfg.xml: this comes from a machine with 5 AMD devices (to complete the hardware set). BOINC has detected them as opencl_device_index 0 thru 4, and 0 thru 4 again. But BOINC has identified them as device_num 0 thru 9. Context and discussion at SETI message 1989298: a similar analysis was performed by @JuhaSointusalo in BOINC message 90061. In the second case (but not the first), there is evidence that two different opencl_driver_versions were installed.

I think the issues reported by Jacob Klein on the alpha mailing list are more properly described as 'buggy' (whether drivers or deployments, we wait to see), associated with the Windows insider builds he was testing.

smoe commented 5 years ago

When adding an AMD RX 580 to an Nvidia GTX 1660 under Windows 10, the Nvidia OpenCL is disabled. Instead, from the project (Einstein for me), the ATI OpenCL workunits are retrieved while the Nvidia OpenCL platform is no longer found. It is possible to run NVidia CUDA and ATI OpenCL jobs in parallel, though - like with SETI. I admit not to know if this is the expected behaviour. Ping me you want me to run antything on that machine.

Another observation of mine under Ubuntu 18.04 on a former GPU mining rig is that access via the single PCI3x16 and the many PCI2x1 ports are apparently distinguished. Having only one card in the PCI3 works fine. Adding one to PCI2 makes it still a single card in BOINC only even though there are two in the system. Adding two to PCI2 show as two cards, even though there are three in total now. Anyone else with similar observations?

EDIT: I have revisited that machine a bit more systematically and identified a non-functional USB riser. That explains the "missing card" phenotype.

RichardHaselgrove commented 5 years ago

Sticking to the Windows 10 machine for the time being, and depending how deep you're prepared to dig, I'd be interested in seeing:

The startup lines from BOINC's Event Log, showing the outcomes of GPU detection
The coproc_info.xml file from BOINC's data directory
The output from Oblomov's CLinfo (download from https://github.com/Oblomov/clinfo, at bottom: run at command prompt)

bema-aei commented 5 years ago

@RichardHaselgrove Actually I didn't mean my analysis to be complete.

To narrow down this problem: I would guess that all the "duplicate device" problems arise from the OpenCL device handling. Is there any case of "double device" which involves only CUDA or ATI (CAL, Stream?) Apps?

BOINC / boinc

Fix duplicate GPU problem #3200