Errors when CUDA-MPS is active #1459

Open traktofon opened 8 years ago

traktofon commented 8 years ago

Hi, when the NVIDIA CUDA Multi Process Service (MPS) is active, I encounter two problems: a) the opencl backend doesn't work b) the unified backend cannot invoke the cuda backend

To reproduce, run "nvidia-cuda-mps-control -d" as root, then test with the "examples/helloworld":

~/af/build/helloworld> AF_PRINT_ERRORS=1 ./helloworld_opencl 
In function opencl::DeviceManager::DeviceManager()
In file src/backend/opencl/platform.cpp:329
OpenCL Error (-30): Invalid Value when calling clCreateContext

ArrayFire Exception (Internal error:998):
In function opencl::DeviceManager::DeviceManager()
In file src/backend/opencl/platform.cpp:329
OpenCL Error (-30): Invalid Value when calling clCreateContext

In function void af::setDevice(int)
In file src/api/cpp/device.cpp:91
terminate called after throwing an instance of 'af::exception'
  what():  ArrayFire Exception (Internal error:998):
In function opencl::DeviceManager::DeviceManager()
In file src/backend/opencl/platform.cpp:329
OpenCL Error (-30): Invalid Value when calling clCreateContext

In function void af::setDevice(int)
In file src/api/cpp/device.cpp:91
~/af/build/helloworld> AF_PRINT_ERRORS=1 ./helloworld_unified 
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/platform.cpp:359
CUDA Error (2): out of memory

ArrayFire Exception (Device out of memory:101):
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/platform.cpp:359
CUDA Error (2): out of memory

In function void af::setDevice(int)
In file src/api/cpp/device.cpp:91
terminate called after throwing an instance of 'af::exception'
  what():  ArrayFire Exception (Device out of memory:101):
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/platform.cpp:359
CUDA Error (2): out of memory

In function void af::setDevice(int)
In file src/api/cpp/device.cpp:91

If the cuda-mps service is not running, then all four backends work properly.

I also encountered problems with ArrayFire.jl, where even the cpu and cuda backends don't work if cuda-mps is running. Without cuda-mps, the backends work fine.

In theory, whether cuda-mps is running or not should be completely transparent to CUDA applications. On multi-user systems, and for MPI-parallelized programs, cuda-mps is beneficial, so it would be nice if ArrayFire could work properly with MPS.

Tested with ArrayFire-3.3.2, both binary distribution and compiled from source. CUDA version is 7.5. Nvidia driver is 352.63.

Regards, Frank

pavanky commented 8 years ago

@frank-otto I have been able to reproduce the OpenCL issue but not the unified backend. Can you double check if unified still fails? Could it potentially be someone else was using the GPU simultaneously?

pavanky commented 8 years ago

Verified the following stand alone code fails with nvidia-cuda-mps-control -d running

#include "cl.hpp"
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>

using namespace cl;
using namespace std;

int main(int argc, char* argv[]) {
    static const unsigned elements = 1000;
    vector<float> data(elements, 5);
    Buffer a(begin(data), end(data), true, false);
    Buffer b(begin(data), end(data), true, false);
    Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));

    Program addProg(R"d(
        void add(   global const float * restrict const a,
                    global const float * restrict const b,
                    global       float * restrict const c) {
            unsigned idx = get_global_id(0);
            c[idx] = a[idx] + b[idx];
    )d", true);

    auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
    add(EnqueueArgs(elements), a, b, c);

    vector<float> result(elements);
    cl::copy(c, begin(result), end(result));

    std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
pavanky commented 8 years ago

Able to reproduce the same problem using this stand alone code.

#include "cl.hpp"
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>

using namespace cl;
using namespace std;

int main(int argc, char* argv[])
    std::vector<cl::Platform>   platforms;

    for (auto platform : platforms) {
        cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM,
        std::vector<cl::Device> devices;
        try {
            platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
        } catch(...) {

        std::cout << platform.getInfo<CL_PLATFORM_NAME>() << std::endl;
        cl::Context Context = cl::Context(devices[0], cps);
        static const unsigned elements = 1000;
        vector<float> data(elements, 5);
        Buffer a(begin(data), end(data), true, false);
        Buffer b(begin(data), end(data), true, false);
        Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));

        Program addProg(R"d(
        void add(   global const float * restrict const a,
                    global const float * restrict const b,
                    global       float * restrict const c) {
            unsigned idx = get_global_id(0);
            c[idx] = a[idx] + b[idx];
    )d", true);

        auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
        add(EnqueueArgs(elements), a, b, c);

        vector<float> result(elements);
        cl::copy(c, begin(result), end(result));

        std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
        std::cout << std::endl;
pavanky commented 8 years ago

After having looked into cl.hpp and cl2.hpp, the problem seems to be the usage of clCreateContext (in failure case) vs clCreateContextFromType (in working case). This is most likely an NVIDIA bug that I am skeptical that they will fix.

We could change our device manager to use clCreateContextFromType, but that would create OpenCL contexts with more than one device which is different from one to one mapping between devices and contexts we have right now.

Changing the context creation definitely causes problems on OSX, so we could do it optionally do it for Linux only, but I am bit wary of doing this.

@arrayfire/core-devel thoughts?

Not relevant anymore

pavanky commented 8 years ago

Nevermind, I was testing this incorrectly. Even the first C++ code segment I posted is failing now.

pavanky commented 8 years ago

@frank-otto can you test the first stand alone code snippet on your machine ? This could potentially be NVIDIA blocking all non CUDA applications running when this daemon is enabled.

traktofon commented 8 years ago

@pavanky, thanks for looking into this issue.

Sorry for the delay, the GPU machine was busy and I couldn't run tests in the previous days. Now I had a chance to test the snippet you posted. I compiled with:

g++ -std=c++11 -Wall -o test.x -I/usr/include/CL -lOpenCL

The results with MPS running are:

$ ./test.x terminate called after throwing an instance of 'cl::Error' what(): clCreateContextFromType Aborted

And without MPS:

$ ./test.x 10, 10, 10, 10, 10, 10, 10, [...]

As for the unified backend, it still fails for me with the "out of memory" error on trying to use the CUDA backend dynamically. I am certain that the GPU was otherwise idle.

Some additional notes on the system:

pavanky commented 8 years ago

@frank-otto Ok, it looks like there's nothing we can do about the OpenCL backend, because it seems to be failing in a stand alone application independent of arrayfire.

As for the unified backend, this is weird. Can I get an output of af::info() with MPS disabled?

pavanky commented 8 years ago

Also an output of nvidia-smi and nvidia-smi -a when MPS is failing would be good.

traktofon commented 8 years ago

@pavanky, this is the code I use to get af::info output:

#include <arrayfire.h>
#include <cstdio>
#include <cstdlib>

using namespace af;

int main(int argc, char *argv[])
    try {
        // Select a device and display arrayfire info
        int device = argc > 1 ? atoi(argv[1]) : 0;
    } catch (af::exception& e) {
        fprintf(stderr, "%s\n", e.what());
    return 0;

And I compile it with:

g++ -Wall -std=c++11 -o afinfo.x -I$AF_PATH/include -L$AF_PATH/lib -laf

With MPS running, the output is:

$ ./afinfo.x ArrayFire Exception (Device out of memory:101): In function cuda::DeviceManager::DeviceManager() In file src/backend/cuda/platform.cpp:359 CUDA Error (2): out of memory

In function void af::setDevice(int) In file src/api/cpp/device.cpp:91 terminate called after throwing an instance of 'af::exception' what(): ArrayFire Exception (Device out of memory:101): In function cuda::DeviceManager::DeviceManager() In file src/backend/cuda/platform.cpp:359 CUDA Error (2): out of memory

In function void af::setDevice(int) In file src/api/cpp/device.cpp:91 Aborted

Without MPS running, the output is:

$ ./afinfo.x ArrayFire v3.3.2 (CUDA, 64-bit Linux, build default) Platform: CUDA Toolkit 7.5, Driver: 352.63 [0] Tesla K20c, 4800 MB, CUDA Compute 3.5 -1- Tesla K20c, 4800 MB, CUDA Compute 3.5 -2- Tesla K20c, 4800 MB, CUDA Compute 3.5

NOTE: The GPUs are set to compute mode EXCLUSIVE_PROCESS as that is recommended by Nvidia for when MPS is used. I did also test with the DEFAULT compute mode, but the results are the same (i.e. unified backend and opencl backend failing as reported above).
