Detect wrong device_num for multiple cards (0 for all device) on OSX

mancausoft commented 8 years ago

I have a problem seems with Boinc (7.6.22 and 7.6.33) Hw detector on OSX (or opencl library). I have two card and one platform. Platform 0 Card 0 is the Intel Card Platform 0 Card 1 is the Nvidia card

when I start boinc it say:

Mer 7 Set 02:11:47 2016 | | OpenCL: NVIDIA GPU 0: GeForce GT 750M (driver version 10.10.13 310.42.25f01, device version OpenCL 1.2, 2048MB, 2048MB available, 178 GFLOPS peak)
Mer 7 Set 02:11:47 2016 | | OpenCL: Intel GPU 0: Iris Pro (driver version 1.2(Aug 29 2016 22:20:39), device version OpenCL 1.2, 1536MB, 1536MB available, 384 GFLOPS peak)

And in the file I have device_num 0 for all two cards: (I paste only important row of this file)

<nvidia_opencl>
<name>GeForce GT 750M</name>
<vendor>NVIDIA</vendor>
<device_num>0</device_num>
</nvidia_opencl>

<intel_gpu_opencl>
<name>Iris Pro</name>
<vendor>Intel</vendor>
<device_num>0</device_num>
</intel_gpu_opencl>

all two card is detected as device 0 and when boinc start a work for NVIDIA cards it pass the wrong device id (0 instead of 1) and the works fails.

I can see with PS the params passed:

boinc_project 42690 0.3 0.0 2447260 3588 ?? SN 1:01AM 0:00.02 milkyway_1.37_x86_64-apple-darwin__opencl_nvidia_101 -f -np 20 -p 0.507773611546175 5.314882185745 -1.78457837505283 192.629590982596 39.017990164971 1.87224701958271 3.15095839718822 4.18477040237846 -0.533261040271202 200.315243139419 22.88 1.901 2.99 24.2836812450552 -0.4582270374547 194.265010250273 11.7345663011072 2.75052293940004 0.026636379053808 6.0638655173807 --device 0

I try to start manually apple-darwin__opencl_nvidia_101 passing --device 1 and it works.

ChristianBeer commented 7 years ago

Is this still present or did a OS upgrade somehow fix that?

mancausoft commented 7 years ago

Is still present (macOS Sierra 10.12.4) Mer 12 Apr 12:15:52 2017 | | CUDA: NVIDIA GPU 0: GeForce GT 750M (driver version 8.0.71, CUDA version 8.0, compute capability 3.0, 2048MB, 824MB available, 711 GFLOPS peak) Mer 12 Apr 12:15:52 2017 | | OpenCL: NVIDIA GPU 0: GeForce GT 750M (driver version 10.16.34 355.10.05.35f05, device version OpenCL 1.2, 2048MB, 824MB available, 711 GFLOPS peak) Mer 12 Apr 12:15:52 2017 | | OpenCL: Intel GPU 0: Iris Pro (driver version 1.2(Mar 16 2017 22:07:31), device version OpenCL 1.2, 1536MB, 1536MB available, 384 GFLOPS peak) Mer 12 Apr 12:15:52 2017 | | OpenCL CPU: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz (OpenCL driver vendor: Apple, driver version 1.1, device version OpenCL 1.2)

Ageless93 commented 7 years ago

This is correct. You have one Nvidia GPU and one Intel GPU. Both are device zero in their brand groups. If you have multiple Nvidia GPUs, you'd have device 0, device 1, device 2 etc. Same for AMD GPUs. Thus far Intel GPUs are only device zero because there aren't any CPUs yet with more than one GPU on board.

So, a hypothetical system with three Nvidia GPUs, two AMD GPUs and one Intel GPU will have: Nvidia device 0, device 1, device 2; AMD device 0, device 1 and Intel device 0.

-- Jord van der Elst.

On Wed, Apr 12, 2017 at 12:18 PM, Andrea Milazzo notifications@github.com wrote:

I have the same problem (macOS Sierra 10.12.4) Mer 12 Apr 12:15:52 2017 | | CUDA: NVIDIA GPU 0: GeForce GT 750M (driver version 8.0.71, CUDA version 8.0, compute capability 3.0, 2048MB, 824MB available, 711 GFLOPS peak) Mer 12 Apr 12:15:52 2017 | | OpenCL: NVIDIA GPU 0: GeForce GT 750M (driver version 10.16.34 355.10.05.35f05, device version OpenCL 1.2, 2048MB, 824MB available, 711 GFLOPS peak) Mer 12 Apr 12:15:52 2017 | | OpenCL: Intel GPU 0: Iris Pro (driver version 1.2(Mar 16 2017 22:07:31), device version OpenCL 1.2, 1536MB, 1536MB available, 384 GFLOPS peak) Mer 12 Apr 12:15:52 2017 | | OpenCL CPU: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz (OpenCL driver vendor: Apple, driver version 1.1, device version OpenCL 1.2)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/1641#issuecomment-293534944, or mute the thread https://github.com/notifications/unsubscribe-auth/AKXbU-dDgNOfLJ8erfBDYss182KfPW1Tks5rvKTrgaJpZM4J2rTo .

mancausoft commented 7 years ago

@Ageless93 and where i should found the number to pass as device? If you see on the original report the problem is when boinc start a OpenCL Intel jobm, here you need to pass the device 1 to use the intel card. Unfortunatly i don't have for now intel job on my boinc, so I can't check if it will be fail or not. Let me know a way to check if the bug is still present. I have the same content inside the file coproc_info.xml:

<nvidia_opencl>
<name>GeForce GT 750M</name>
<vendor>NVIDIA</vendor>
<device_num>0</device_num>
</nvidia_opencl>

<intel_gpu_opencl>
<name>Iris Pro</name>
<vendor>Intel</vendor>
<device_num>0</device_num>
</intel_gpu_opencl>

ChristianBeer commented 7 years ago

The device_num that is given by BOINC is unique on the host. If there are two different GPUs they need different device_nums. See log output from multi GPU systems: e.g. https://setiathome.berkeley.edu/forum_thread.php?id=80130#1809180

CharlieFenton commented 7 years ago

It is not clear what the poster meant when she wrote that the application fails. @mancausoft, What do you see that tells you it fails?

The --device argument is used only for very old BOINC clients (earlier than version 6.13.3.) Unless the Milkyway project application is very old, it should get the information on which GPU to use from the init_data.xml file provided by the BOINC client. Details can be found at http://boinc.berkeley.edu/trac/wiki/OpenclApps.

Because BOINC support for OpenCL has evolved over time, the method for determining the correct GPU to use is fairly complicated. Modern versions of the BOINC client use a value called gpu_opencl_dev_index, which is unique for each GPU.

@ChristianBeer, please see api/boinc_opencl.cpp for the full details.

ChristianBeer commented 7 years ago

To me it seems the problem is that two instances of app --device 0 are started which try to use the same device which is not possible. This problem seems to be present since the advent of OpenCL and usage of OpenCL capable GPU of different vendors. So if indeed init_data.xml via the API solves the problem by ignoring the argument, the problem should go away when MilkyWay builds an app that uses the new API. All Apps are from late 2016 but that doesn't tell us what BOINC version was used to build the science apps.

@mancausoft You should maybe test with one of those newer apps and ask directly at milkyway what version of the API they used to build their science apps.

mancausoft commented 7 years ago

the problem was related to the wrong card selected, the binary was for nvidia hw only, and with --device 0 (according the debug output) OpenCL use the intel card and the program exit with an error. when i pass manually the --device 1 option, it choose the nvidia card, and the program create some result.

Before open this bug, i spent some time debbuging this problem, but now it 's been a long time and do not remember the details.

Btw, I just receive a JObs (from Einstein@home) for invidia Opencl, it start woth param: --device 0 but this jobs select the correct GPU: ./stderr.txt:Using OpenCL device "GeForce GT 750M" by: NVIDIA

So I think it's a problem in milkways jobs, how they use the params device:

The code used inside milkway to choose the openCL device (clr->devNum is the params --device):

ci->plat = ids[platformChoice];
    devs = mwGetAllDevices(ci->plat, &nDev);
    if (!devs)
    {
        free(ids);
        return MW_CL_ERROR;
    }

    err = mwSelectDevice(ci, devs, clr, nDev);
    free(ids);
    free(devs);
    if (err != CL_SUCCESS)
    {
        mwPerrorCL(err, "Failed to select a device");
        return err;
    }

cl_int mwSelectDevice(CLInfo* ci, const cl_device_id* devs, const CLRequest* clr, const cl_uint nDev)
{
    cl_int err = CL_SUCCESS;

    if (clr->devNum >= nDev)
    {
        mw_printf("Requested device is out of range of number found devices\n");
        return MW_CL_ERROR;
    }

    ci->dev = devs[clr->devNum];
    err = mwGetDeviceType(ci->dev, &ci->devType);
    if (err != CL_SUCCESS)
        mw_printf("Failed to find type of device %u\n", clr->devNum);

    return err;
}

Ageless93 commented 7 years ago

Is this still an issue?

CharlieFenton commented 7 years ago

This was an issue with MilkyWay, not with BOINC.

From what I can see reading the thread, the problem was that MilkyWay's app was using obsolete methodology for selecting the GPU. As I wrote in this thread, the issue was corrected a long time before that:

Modern versions of the BOINC client use a value called gpu_opencl_dev_index, which is unique for each GPU.

Apparently MilkyWay had not update their app as described 4 years ago at http://boinc.berkeley.edu/trac/wiki/OpenclApps

BOINC / boinc

Detect wrong device_num for multiple cards (0 for all device) on OSX #1641