arrayfire / arrayfire

ArrayFire: a general purpose GPU library.
https://arrayfire.com
BSD 3-Clause "New" or "Revised" License
4.53k stars 535 forks source link

Errors when CUDA-MPS is active #1459

Open traktofon opened 8 years ago

traktofon commented 8 years ago

Hi, when the NVIDIA CUDA Multi Process Service (MPS) is active, I encounter two problems: a) the opencl backend doesn't work b) the unified backend cannot invoke the cuda backend

To reproduce, run "nvidia-cuda-mps-control -d" as root, then test with the "examples/helloworld":

~/af/build/helloworld> AF_PRINT_ERRORS=1 ./helloworld_opencl 
In function opencl::DeviceManager::DeviceManager()
In file src/backend/opencl/platform.cpp:329
OpenCL Error (-30): Invalid Value when calling clCreateContext

ArrayFire Exception (Internal error:998):
In function opencl::DeviceManager::DeviceManager()
In file src/backend/opencl/platform.cpp:329
OpenCL Error (-30): Invalid Value when calling clCreateContext

In function void af::setDevice(int)
In file src/api/cpp/device.cpp:91
terminate called after throwing an instance of 'af::exception'
  what():  ArrayFire Exception (Internal error:998):
In function opencl::DeviceManager::DeviceManager()
In file src/backend/opencl/platform.cpp:329
OpenCL Error (-30): Invalid Value when calling clCreateContext

In function void af::setDevice(int)
In file src/api/cpp/device.cpp:91
Aborted
~/af/build/helloworld> AF_PRINT_ERRORS=1 ./helloworld_unified 
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/platform.cpp:359
CUDA Error (2): out of memory

ArrayFire Exception (Device out of memory:101):
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/platform.cpp:359
CUDA Error (2): out of memory

In function void af::setDevice(int)
In file src/api/cpp/device.cpp:91
terminate called after throwing an instance of 'af::exception'
  what():  ArrayFire Exception (Device out of memory:101):
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/platform.cpp:359
CUDA Error (2): out of memory

In function void af::setDevice(int)
In file src/api/cpp/device.cpp:91
Aborted

If the cuda-mps service is not running, then all four backends work properly.

I also encountered problems with ArrayFire.jl, where even the cpu and cuda backends don't work if cuda-mps is running. Without cuda-mps, the backends work fine.

In theory, whether cuda-mps is running or not should be completely transparent to CUDA applications. On multi-user systems, and for MPI-parallelized programs, cuda-mps is beneficial, so it would be nice if ArrayFire could work properly with MPS.

Tested with ArrayFire-3.3.2, both binary distribution and compiled from source. CUDA version is 7.5. Nvidia driver is 352.63.

Regards, Frank

pavanky commented 8 years ago

@frank-otto I have been able to reproduce the OpenCL issue but not the unified backend. Can you double check if unified still fails? Could it potentially be someone else was using the GPU simultaneously?

pavanky commented 8 years ago

Verified the following stand alone code fails with nvidia-cuda-mps-control -d running

#define __CL_ENABLE_EXCEPTIONS
#include "cl.hpp"
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>

using namespace cl;
using namespace std;

int main(int argc, char* argv[]) {
    Context(CL_DEVICE_TYPE_GPU);
    static const unsigned elements = 1000;
    vector<float> data(elements, 5);
    Buffer a(begin(data), end(data), true, false);
    Buffer b(begin(data), end(data), true, false);
    Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));

    Program addProg(R"d(
        kernel
        void add(   global const float * restrict const a,
                    global const float * restrict const b,
                    global       float * restrict const c) {
            unsigned idx = get_global_id(0);
            c[idx] = a[idx] + b[idx];
        }
    )d", true);

    auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
    add(EnqueueArgs(elements), a, b, c);

    vector<float> result(elements);
    cl::copy(c, begin(result), end(result));

    std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
}
pavanky commented 8 years ago

Able to reproduce the same problem using this stand alone code.

#define __CL_ENABLE_EXCEPTIONS
#include "cl.hpp"
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>

using namespace cl;
using namespace std;

int main(int argc, char* argv[])
{
    std::vector<cl::Platform>   platforms;
    cl::Platform::get(&platforms);

    for (auto platform : platforms) {
        cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM,
                                        (cl_context_properties)(platform()),
                                        0};
        std::vector<cl::Device> devices;
        try {
            platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
        } catch(...) {
            continue;
        }

        std::cout << platform.getInfo<CL_PLATFORM_NAME>() << std::endl;
        cl::Context Context = cl::Context(devices[0], cps);
        static const unsigned elements = 1000;
        vector<float> data(elements, 5);
        Buffer a(begin(data), end(data), true, false);
        Buffer b(begin(data), end(data), true, false);
        Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));

        Program addProg(R"d(
        kernel
        void add(   global const float * restrict const a,
                    global const float * restrict const b,
                    global       float * restrict const c) {
            unsigned idx = get_global_id(0);
            c[idx] = a[idx] + b[idx];
        }
    )d", true);

        auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
        add(EnqueueArgs(elements), a, b, c);

        vector<float> result(elements);
        cl::copy(c, begin(result), end(result));

        std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
        std::cout << std::endl;
    }
}
pavanky commented 8 years ago

After having looked into cl.hpp and cl2.hpp, the problem seems to be the usage of clCreateContext (in failure case) vs clCreateContextFromType (in working case). This is most likely an NVIDIA bug that I am skeptical that they will fix.

We could change our device manager to use clCreateContextFromType, but that would create OpenCL contexts with more than one device which is different from one to one mapping between devices and contexts we have right now.

Changing the context creation definitely causes problems on OSX, so we could do it optionally do it for Linux only, but I am bit wary of doing this.

@arrayfire/core-devel thoughts?

Not relevant anymore

pavanky commented 8 years ago

Nevermind, I was testing this incorrectly. Even the first C++ code segment I posted is failing now.

pavanky commented 8 years ago

@frank-otto can you test the first stand alone code snippet on your machine ? This could potentially be NVIDIA blocking all non CUDA applications running when this daemon is enabled.

traktofon commented 8 years ago

@pavanky, thanks for looking into this issue.

Sorry for the delay, the GPU machine was busy and I couldn't run tests in the previous days. Now I had a chance to test the snippet you posted. I compiled with:

g++ -std=c++11 -Wall -o test.x -I/usr/include/CL test.cc -lOpenCL

The results with MPS running are:

$ ./test.x terminate called after throwing an instance of 'cl::Error' what(): clCreateContextFromType Aborted

And without MPS:

$ ./test.x 10, 10, 10, 10, 10, 10, 10, [...]

As for the unified backend, it still fails for me with the "out of memory" error on trying to use the CUDA backend dynamically. I am certain that the GPU was otherwise idle.

Some additional notes on the system:

pavanky commented 8 years ago

@frank-otto Ok, it looks like there's nothing we can do about the OpenCL backend, because it seems to be failing in a stand alone application independent of arrayfire.

As for the unified backend, this is weird. Can I get an output of af::info() with MPS disabled?

pavanky commented 8 years ago

Also an output of nvidia-smi and nvidia-smi -a when MPS is failing would be good.

traktofon commented 8 years ago

@pavanky, this is the code I use to get af::info output:

#include <arrayfire.h>
#include <cstdio>
#include <cstdlib>

using namespace af;

int main(int argc, char *argv[])
{
    try {
        // Select a device and display arrayfire info
        int device = argc > 1 ? atoi(argv[1]) : 0;
        af::setDevice(device);
        af::info();
    } catch (af::exception& e) {
        fprintf(stderr, "%s\n", e.what());
        throw;
    }
    return 0;
}

And I compile it with:

g++ -Wall -std=c++11 -o afinfo.x -I$AF_PATH/include afinfo.cc -L$AF_PATH/lib -laf

With MPS running, the output is:

$ ./afinfo.x ArrayFire Exception (Device out of memory:101): In function cuda::DeviceManager::DeviceManager() In file src/backend/cuda/platform.cpp:359 CUDA Error (2): out of memory

In function void af::setDevice(int) In file src/api/cpp/device.cpp:91 terminate called after throwing an instance of 'af::exception' what(): ArrayFire Exception (Device out of memory:101): In function cuda::DeviceManager::DeviceManager() In file src/backend/cuda/platform.cpp:359 CUDA Error (2): out of memory

In function void af::setDevice(int) In file src/api/cpp/device.cpp:91 Aborted

Without MPS running, the output is:

$ ./afinfo.x ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default) Platform: CUDA Toolkit 7.5, Driver: 352.63 [0] Tesla K20c, 4800 MB, CUDA Compute 3.5 -1- Tesla K20c, 4800 MB, CUDA Compute 3.5 -2- Tesla K20c, 4800 MB, CUDA Compute 3.5

Without MPS, the output of nvidia-smi is:

Mon Jun 20 18:53:30 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:02:00.0     Off |                    0 |
| 30%   41C    P0    48W / 225W |     12MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:42:00.0     Off |                    0 |
| 32%   44C    P0    50W / 225W |     12MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          Off  | 0000:43:00.0     Off |                    0 |
| 30%   41C    P0    53W / 225W |     12MiB /  4799MiB |     87%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

And output of nvidia-smi -a:


==============NVSMI LOG==============

Timestamp                           : Mon Jun 20 18:54:08 2016
Driver Version                      : 352.63

Attached GPUs                       : 3
GPU 0000:02:00.0
    Product Name                    : Tesla K20c
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322714059579
    GPU UUID                        : GPU-3efe11b6-d282-11c7-daad-5773e5b99064
    Minor Number                    : 0
    VBIOS Version                   : 80.10.39.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x200
    Inforom Version
        Image Version               : 2081.0204.00.07
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102210DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x098210DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 2
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : 30 %
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 4799 MiB
        Used                        : 12 MiB
        Free                        : 4787 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Exclusive_Process
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 41 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 90 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 48.82 W
        Power Limit                 : 225.00 W
        Default Power Limit         : 225.00 W
        Enforced Power Limit        : 225.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 225.00 W
    Clocks
        Graphics                    : 705 MHz
        SM                          : 705 MHz
        Memory                      : 2600 MHz
    Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Default Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Max Clocks
        Graphics                    : 758 MHz
        SM                          : 758 MHz
        Memory                      : 2600 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

GPU 0000:42:00.0
    Product Name                    : Tesla K20c
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322714059051
    GPU UUID                        : GPU-54fd78b1-6231-81d7-7524-0ff0138c8f14
    Minor Number                    : 1
    VBIOS Version                   : 80.10.39.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x4200
    Inforom Version
        Image Version               : 2081.0204.00.07
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x42
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102210DE
        Bus Id                      : 0000:42:00.0
        Sub System Id               : 0x098210DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 2
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : 33 %
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 4799 MiB
        Used                        : 12 MiB
        Free                        : 4787 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Exclusive_Process
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 44 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 90 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 50.28 W
        Power Limit                 : 225.00 W
        Default Power Limit         : 225.00 W
        Enforced Power Limit        : 225.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 225.00 W
    Clocks
        Graphics                    : 705 MHz
        SM                          : 705 MHz
        Memory                      : 2600 MHz
    Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Default Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Max Clocks
        Graphics                    : 758 MHz
        SM                          : 758 MHz
        Memory                      : 2600 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

GPU 0000:43:00.0
    Product Name                    : Tesla K20c
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322714059533
    GPU UUID                        : GPU-913c0a7d-59c2-bee7-cc36-b1e11aeab7be
    Minor Number                    : 2
    VBIOS Version                   : 80.10.39.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x4300
    Inforom Version
        Image Version               : 2081.0204.00.07
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x43
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102210DE
        Bus Id                      : 0000:43:00.0
        Sub System Id               : 0x098210DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 2
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : 31 %
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 4799 MiB
        Used                        : 12 MiB
        Free                        : 4787 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Exclusive_Process
    Utilization
        Gpu                         : 97 %
        Memory                      : 6 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 42 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 90 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 51.43 W
        Power Limit                 : 225.00 W
        Default Power Limit         : 225.00 W
        Enforced Power Limit        : 225.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 225.00 W
    Clocks
        Graphics                    : 705 MHz
        SM                          : 705 MHz
        Memory                      : 2600 MHz
    Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Default Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Max Clocks
        Graphics                    : 758 MHz
        SM                          : 758 MHz
        Memory                      : 2600 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

With MPS running, nvidia-smi ouputs:

Mon Jun 20 18:55:07 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:02:00.0     Off |                    0 |
| 31%   42C    P8    16W / 225W |    105MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:42:00.0     Off |                    0 |
| 34%   45C    P8    16W / 225W |    105MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          Off  | 0000:43:00.0     Off |                    0 |
| 31%   43C    P0    48W / 225W |    105MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     15639    C   nvidia-cuda-mps-server                          89MiB |
|    1     15639    C   nvidia-cuda-mps-server                          89MiB |
|    2     15639    C   nvidia-cuda-mps-server                          89MiB |
+-----------------------------------------------------------------------------+

Note: the nvidia-cuda-mps-server only shows up after a CUDA application was run (the failing afinfo.x is sufficient). And nvidia-smi -a outputs:


==============NVSMI LOG==============

Timestamp                           : Mon Jun 20 18:57:02 2016
Driver Version                      : 352.63

Attached GPUs                       : 3
GPU 0000:02:00.0
    Product Name                    : Tesla K20c
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322714059579
    GPU UUID                        : GPU-3efe11b6-d282-11c7-daad-5773e5b99064
    Minor Number                    : 0
    VBIOS Version                   : 80.10.39.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x200
    Inforom Version
        Image Version               : 2081.0204.00.07
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102210DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x098210DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : 30 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 4799 MiB
        Used                        : 105 MiB
        Free                        : 4694 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 14 MiB
        Free                        : 242 MiB
    Compute Mode                    : Exclusive_Process
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 39 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 90 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 16.41 W
        Power Limit                 : 225.00 W
        Default Power Limit         : 225.00 W
        Enforced Power Limit        : 225.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 225.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Default Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Max Clocks
        Graphics                    : 758 MHz
        SM                          : 758 MHz
        Memory                      : 2600 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 15639
            Type                    : C
            Name                    : nvidia-cuda-mps-server
            Used GPU Memory         : 89 MiB

GPU 0000:42:00.0
    Product Name                    : Tesla K20c
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322714059051
    GPU UUID                        : GPU-54fd78b1-6231-81d7-7524-0ff0138c8f14
    Minor Number                    : 1
    VBIOS Version                   : 80.10.39.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x4200
    Inforom Version
        Image Version               : 2081.0204.00.07
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x42
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102210DE
        Bus Id                      : 0000:42:00.0
        Sub System Id               : 0x098210DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : 30 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 4799 MiB
        Used                        : 105 MiB
        Free                        : 4694 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 14 MiB
        Free                        : 242 MiB
    Compute Mode                    : Exclusive_Process
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 41 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 90 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 16.61 W
        Power Limit                 : 225.00 W
        Default Power Limit         : 225.00 W
        Enforced Power Limit        : 225.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 225.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Default Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Max Clocks
        Graphics                    : 758 MHz
        SM                          : 758 MHz
        Memory                      : 2600 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 15639
            Type                    : C
            Name                    : nvidia-cuda-mps-server
            Used GPU Memory         : 89 MiB

GPU 0000:43:00.0
    Product Name                    : Tesla K20c
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322714059533
    GPU UUID                        : GPU-913c0a7d-59c2-bee7-cc36-b1e11aeab7be
    Minor Number                    : 2
    VBIOS Version                   : 80.10.39.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x4300
    Inforom Version
        Image Version               : 2081.0204.00.07
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x43
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102210DE
        Bus Id                      : 0000:43:00.0
        Sub System Id               : 0x098210DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : 30 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 4799 MiB
        Used                        : 105 MiB
        Free                        : 4694 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 14 MiB
        Free                        : 242 MiB
    Compute Mode                    : Exclusive_Process
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 39 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 90 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 16.80 W
        Power Limit                 : 225.00 W
        Default Power Limit         : 225.00 W
        Enforced Power Limit        : 225.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 225.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Default Applications Clocks
        Graphics                    : 705 MHz
        Memory                      : 2600 MHz
    Max Clocks
        Graphics                    : 758 MHz
        SM                          : 758 MHz
        Memory                      : 2600 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 15639
            Type                    : C
            Name                    : nvidia-cuda-mps-server
            Used GPU Memory         : 89 MiB

NOTE: The GPUs are set to compute mode EXCLUSIVE_PROCESS as that is recommended by Nvidia for when MPS is used. I did also test with the DEFAULT compute mode, but the results are the same (i.e. unified backend and opencl backend failing as reported above).

Thanks!