CUDA backend: API error "700 - CUDA_ERROR_LAUNCH_FAILED" during resize of multivector

I am trying to implement some custom CUDA kernels to speed up integration of a system of ODEs using VexCL and odeint-v2. With some kernels this works very well, but with my largest system of equations I am encountering problems when odeint-v2 asks for the state vector to be resized following a step.

I have extracted the corresponding kernel in standalone form: https://gist.github.com/ds283/8016216. However, as far as I can determine, it's not the kernel which cause the problem here – although it takes a long time to compile, it executes ok – but rather the state vector. This is a vex::multivector<double, 164>.

When https://gist.github.com/ds283/8016216 is compiled and run, I get

1. GeForce GTX 680MX
time t = 0
libc++abi.dylib: terminating with uncaught exception of type vex::backend::cuda::error: /usr/local/include/vexcl/backend/cuda/device_vector.hpp:100
    CUDA Driver API Error (700 - CUDA_ERROR_LAUNCH_FAILED)

This happens both on a GeForce GTX 680MX on an iMac and a GeForce GTX650M on a MacBook Pro. Both are running OS X 10.9.1 and CUDA 5.5.28. On both cards, this kernel runs in blocks of 8 threads with 25792 bytes of shared memory per block; the maximum shared memory per block on these cards in 48kb. On the 680MX the grid size is 32 blocks, and on the 650M it is 8 blocks.

Running in the debugger shows that this exception is raised from the calling sequence

bool boost::numeric::odeint::controlled_runge_kutta<boost::numeric::odeint::runge_kutta_dopri5<vex::multivector<double, 164ul>, double, vex::multivector<double, 164ul>, double, boost::numeric::odeint::vector_space_algebra, boost::numeric::odeint::default_operations, boost::numeric::odeint::initially_resizer>, boost::numeric::odeint::default_error_checker<double, boost::numeric::odeint::vector_space_algebra, boost::numeric::odeint::default_operations>, boost::numeric::odeint::initially_resizer, boost::numeric::odeint::explicit_error_stepper_fsal_tag>::resize_m_xnew_impl<vex::multivector<double, 164ul> >(vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/stepper/controlled_runge_kutta.hpp:848
boost::numeric::odeint::adjust_size_by_resizeability<boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>, void>, vex::multivector<double, 164ul> >(boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>, void>&, vex::multivector<double, 164ul> const&, boost::integral_constant<bool, true>) at /opt/local/include/boost/numeric/odeint/util/resizer.hpp:35
void boost::numeric::odeint::resize<vex::multivector<double, 164ul>, vex::multivector<double, 164ul> >(vex::multivector<double, 164ul>&, vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/util/resize.hpp:53
boost::numeric::odeint::resize_impl<vex::multivector<double, 164ul>, vex::multivector<double, 164ul>, void>::resize(vex::multivector<double, 164ul>&, vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/external/vexcl/vexcl_resize.hpp:73
vex::multivector<double, 164ul>::resize(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator<vex::backend::cuda::command_queue> > const&, unsigned long) at /usr/local/include/vexcl/multivector.hpp:287
vex::vector<double>::resize(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator<vex::backend::cuda::command_queue> > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:490
vex::vector<double>::vector(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator<vex::backend::cuda::command_queue> > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:384
vex::vector<double>::vector(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator<vex::backend::cuda::command_queue> > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:383
vex::vector<double>::allocate_buffers(unsigned int, double const*) at /usr/local/include/vexcl/vector.hpp:802
vex::backend::cuda::device_vector<double>::device_vector<double>(vex::backend::cuda::command_queue const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/backend/cuda/device_vector.hpp:111
vex::backend::cuda::device_vector<double>::device_vector<double>(vex::backend::cuda::command_queue const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/backend/cuda/device_vector.hpp:100
vex::backend::cuda::check(cudaError_enum, char const*, int) at /usr/local/include/vexcl/backend/cuda/error.hpp:135

The failing CUDA API call is apparently the invocation of cuMemAlloc() in device_vector.hpp

        /// Allocates memory buffer on the device associated with the given queue.
        template <typename H>
        device_vector(const command_queue &q, size_t n,
                const H *host = 0, mem_flags flags = MEM_READ_WRITE)
            : n(n)
        {
            (void)flags;

            if (n) {
                q.context().set_current();

                CUdeviceptr ptr;
                cuda_check( cuMemAlloc(&ptr, n * sizeof(T)) );

                buffer.reset(reinterpret_cast<char*>(static_cast<size_t>(ptr)), detail::deleter() );

                if (host) {
                    if (std::is_same<T, H>::value)
                        write(q, 0, n, reinterpret_cast<const T*>(host), true);
                    else
                        write(q, 0, n, std::vector<T>(host, host + n).data(), true);
                }
            }
        }

I'm not quite clear what this means because CUDA_ERROR_LAUNCH_FAILED doesn't seem to be an error code which cuMemAlloc() should return.

I have no problems with a very analogous kernel which uses a vex::multivector<double, 20> as the state. Is there any reason to think that a large multivector should run into resizing problems of this type?

Usually this kind of error comes from the previous kernel launch. Can you check if this is true (e.g. by inserting ctx.finish() before resizing)?

If that is true, does the kernel come from vexcl, or is it your own?

Error 700 could mean e.g. that incorrect parameters are passed to a kernel.

On Dec 18, 2013 6:41 AM, "ds283" notifications@github.com wrote:

I am trying to implement some custom CUDA kernels to speed up integration of a system of ODEs using VexCL and odeint-v2. With some kernels this works very well, but with my largest system of equations I am encountering problems when odeint-v2 asks for the state vector to be resized following a step.

I have extracted the corresponding kernel in standalone form: https://gist.github.com/ds283/8016216. However, as far as I can determine, it's not the kernel which cause the problem here – although it takes a long time to compile, it executes ok – but rather the state vector. This is a vex::multivector<double, 164>.

When https://gist.github.com/ds283/8016216 is compiled and run, I get

GeForce GTX 680MX time t = 0 libc++abi.dylib: terminating with uncaught exception of type vex::backend::cuda::error: /usr/local/include/vexcl/backend/cuda/device_vector.hpp:100 CUDA Driver API Error (700 - CUDA_ERROR_LAUNCH_FAILED)

This happens both on a GeForce GTX 680MX on an iMac and a GeForce GTX650M on a MacBook Pro. Both are running OS X 10.9.1 and CUDA 5.5.28. On both cards, this kernel runs in blocks of 8 threads with 25792 bytes of shared memory per block; the maximum shared memory per block on these cards in 48kb. On the 680MX the grid size is 32 blocks, and on the 650M it is 8 blocks.

Running in the debugger shows that this exception is raised from the calling sequence

bool boost::numeric::odeint::controlled_runge_kutta<boost::numeric::odeint::runge_kutta_dopri5<vex::multivector<double, 164ul>, double, vex::multivector<double, 164ul>, double, boost::numeric::odeint::vector_space_algebra, boost::numeric::odeint::default_operations, boost::numeric::odeint::initially_resizer>, boost::numeric::odeint::default_error_checker<double, boost::numeric::odeint::vector_space_algebra, boost::numeric::odeint::default_operations>, boost::numeric::odeint::initially_resizer, boost::numeric::odeint::explicit_error_stepper_fsal_tag>::resize_m_xnew_impl<vex::multivector<double, 164ul> >(vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/stepper/controlled_runge_kutta.hpp:848

boost::numeric::odeint::adjust_size_by_resizeability<boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>, void>, vex::multivector<double, 164ul> (boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>, void>&, vex::multivector<double, 164ul> const&, boost::integral_constant<bool, true>) at /opt/local/include/boost/numeric/odeint/util/resizer.hpp:35 void boost::numeric::odeint::resize<vex::multivector<double, 164ul>, vex::multivector<double, 164ul> >(vex::multivector<double, 164ul>&, vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/util/resize.hpp:53 boost::numeric::odeint::resize_impl<vex::multivector<double, 164ul>, vex::multivector<double, 164ul>, void>::resize(vex::multivector<double, 164ul>&, vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/external/vexcl/vexcl_resize.hpp:73 vex::multivector<double, 164ul>::resize(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator > const&, unsigned long) at /usr/local/include/vexcl/multivector.hpp:287

vex::vector::resize(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:490

vex::vector::vector(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:384

vex::vector::vector(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator > const&, unsigned long, double const_, unsigned int) at /usr/local/include/vexcl/vector.hpp:383 vex::vector::allocatebuffers(unsigned int, double const) at /usr/local/include/vexcl/vector.hpp:802

vex::backend::cuda::device_vector::device_vector(vex::backend::cuda::command_queue const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/backend/cuda/device_vector.hpp:111

vex::backend::cuda::device_vector::device_vector(vex::backend::cuda::commandqueue const&, unsigned long, double const, unsigned int) at /usr/local/include/vexcl/backend/cuda/device_vector.hpp:100 vex::backend::cuda::check(cudaErrorenum, char const, int) at /usr/local/include/vexcl/backend/cuda/error.hpp:135

The failing CUDA API call is apparently the invocation of cuMemAlloc() in device_vector.hpp
    /// Allocates memory buffer on the device associated with the
given queue. template device_vector(const command_queue &q, size_t n, const H *host = 0, mem_flags flags = MEM_READ_WRITE) : n(n) { (void)flags;
        if (n) {
            q.context().set_current();

            CUdeviceptr ptr;
            cuda_check( cuMemAlloc(&ptr, n * sizeof(T)) );
buffer.reset(reinterpret_cast<char*>(static_cast(ptr)), detail::deleter() );
            if (host) {
                if (std::is_same<T, H>::value)
                    write(q, 0, n, reinterpret_cast<const T*>(host),
true); else write(q, 0, n, std::vector(host, host + n).data(), true); } } }

I'm not quite clear what this means because CUDA_ERROR_LAUNCH_FAILED doesn't seem to be an error code which cuMemAlloc() should return.

I have no problems with a very analogous kernel which uses a vex::multivector<double, 20> as the state. Is there any reason to think that a large multivector should run into resizing problems of this type?

— Reply to this email directly or view it on GitHub.

I've just noticed you have a complete example in that gist. I'll try to run it later today.

Btw, you should be able to use cuda runtime to launch your kernels. This could be more convenient in some cases. (see thrust-sort example). Although your kernel looks generated. On Dec 18, 2013 7:02 AM, "Denis Demidov" dennis.demidov@gmail.com wrote:

Usually this kind of error comes from the previous kernel launch. Can you check if this is true (e.g. by inserting ctx.finish() before resizing)?

If that is true, does the kernel come from vexcl, or is it your own?

Error 700 could mean e.g. that incorrect parameters are passed to a kernel.

On Dec 18, 2013 6:41 AM, "ds283" notifications@github.com wrote:
I am trying to implement some custom CUDA kernels to speed up integration of a system of ODEs using VexCL and odeint-v2. With some kernels this works very well, but with my largest system of equations I am encountering problems when odeint-v2 asks for the state vector to be resized following a step.

I have extracted the corresponding kernel in standalone form: https://gist.github.com/ds283/8016216. However, as far as I can determine, it's not the kernel which cause the problem here – although it takes a long time to compile, it executes ok – but rather the state vector. This is a vex::multivector<double, 164>.

When https://gist.github.com/ds283/8016216 is compiled and run, I get

GeForce GTX 680MX time t = 0 libc++abi.dylib: terminating with uncaught exception of type vex::backend::cuda::error: /usr/local/include/vexcl/backend/cuda/device_vector.hpp:100 CUDA Driver API Error (700 - CUDA_ERROR_LAUNCH_FAILED)

This happens both on a GeForce GTX 680MX on an iMac and a GeForce GTX650M on a MacBook Pro. Both are running OS X 10.9.1 and CUDA 5.5.28. On both cards, this kernel runs in blocks of 8 threads with 25792 bytes of shared memory per block; the maximum shared memory per block on these cards in 48kb. On the 680MX the grid size is 32 blocks, and on the 650M it is 8 blocks.

Running in the debugger shows that this exception is raised from the calling sequence

bool boost::numeric::odeint::controlled_runge_kutta<boost::numeric::odeint::runge_kutta_dopri5<vex::multivector<double, 164ul>, double, vex::multivector<double, 164ul>, double, boost::numeric::odeint::vector_space_algebra, boost::numeric::odeint::default_operations, boost::numeric::odeint::initially_resizer>, boost::numeric::odeint::default_error_checker<double, boost::numeric::odeint::vector_space_algebra, boost::numeric::odeint::default_operations>, boost::numeric::odeint::initially_resizer, boost::numeric::odeint::explicit_error_stepper_fsal_tag>::resize_m_xnew_impl<vex::multivector<double, 164ul> >(vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/stepper/controlled_runge_kutta.hpp:848

boost::numeric::odeint::adjust_size_by_resizeability<boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>, void>, vex::multivector<double, 164ul> (boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>, void>&, vex::multivector<double, 164ul> const&, boost::integral_constant<bool, true>) at /opt/local/include/boost/numeric/odeint/util/resizer.hpp:35 void boost::numeric::odeint::resize<vex::multivector<double, 164ul>, vex::multivector<double, 164ul> >(vex::multivector<double, 164ul>&, vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/util/resize.hpp:53 boost::numeric::odeint::resize_impl<vex::multivector<double, 164ul>, vex::multivector<double, 164ul>, void>::resize(vex::multivector<double, 164ul>&, vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/external/vexcl/vexcl_resize.hpp:73 vex::multivector<double, 164ul>::resize(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator > const&, unsigned long) at /usr/local/include/vexcl/multivector.hpp:287

vex::vector::resize(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:490

vex::vector::vector(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:384

vex::vector::vector(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator > const&, unsigned long, double const_, unsigned int) at /usr/local/include/vexcl/vector.hpp:383 vex::vector::allocatebuffers(unsigned int, double const) at /usr/local/include/vexcl/vector.hpp:802

vex::backend::cuda::device_vector::device_vector(vex::backend::cuda::command_queue const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/backend/cuda/device_vector.hpp:111

vex::backend::cuda::device_vector::device_vector(vex::backend::cuda::commandqueue const&, unsigned long, double const, unsigned int) at /usr/local/include/vexcl/backend/cuda/device_vector.hpp:100 vex::backend::cuda::check(cudaErrorenum, char const, int) at /usr/local/include/vexcl/backend/cuda/error.hpp:135

The failing CUDA API call is apparently the invocation of cuMemAlloc() in device_vector.hpp
    /// Allocates memory buffer on the device associated with the
given queue. template device_vector(const command_queue &q, size_t n, const H *host = 0, mem_flags flags = MEM_READ_WRITE) : n(n) { (void)flags;
        if (n) {
            q.context().set_current();

            CUdeviceptr ptr;
            cuda_check( cuMemAlloc(&ptr, n * sizeof(T)) );
buffer.reset(reinterpret_cast<char*>(static_cast(ptr)), detail::deleter() );
            if (host) {
                if (std::is_same<T, H>::value)
                    write(q, 0, n, reinterpret_cast<const T*>(host),
true); else write(q, 0, n, std::vector(host, host + n).data(), true); } } }

I'm not quite clear what this means because CUDA_ERROR_LAUNCH_FAILED doesn't seem to be an error code which cuMemAlloc() should return.

I have no problems with a very analogous kernel which uses a vex::multivector<double, 20> as the state. Is there any reason to think that a large multivector should run into resizing problems of this type?

— Reply to this email directly or view it on GitHub.

I had to do this for the kernel in rhs_functor::operator() to compile:

diff --git a/launch_failure_kernel.cpp b/launch_failure_kernel.cpp
index 00235e2..44e29e5 100644
--- a/launch_failure_kernel.cpp
+++ b/launch_failure_kernel.cpp
@@ -110,10 +110,12 @@ void rhs_functor::operator()(const state& x, state& dxdt, double t)
     for(unsigned int d = 0; d < this->ctx.size(); d++)
       {
         kernel.emplace_back(this->ctx.queue(d),
+#if defined(_MSC_VER) || defined(__APPLE__)
         "typedef unsigned char       uchar;\n"
         "typedef unsigned int        uint;\n"
         "typedef unsigned short      ushort;\n"
         "typedef unsigned long long  ulong;\n"
+#endif
         "extern \"C\" __global__ void threepffused( ulong n,\n"
         "                                         double Mp,\n"
         "                                         double M_phi, double M_chi,\n"

Then I inserted this->ctx.finish() here:

diff --git a/launch_failure_kernel.cpp b/launch_failure_kernel.cpp
index 44e29e5..3fb695e 100644
--- a/launch_failure_kernel.cpp
+++ b/launch_failure_kernel.cpp
@@ -2268,4 +2268,5 @@ void rhs_functor::operator()(const state& x, state& dxdt, double t)

         kernel[d](this->ctx.queue(d));
       }
+    this->ctx.finish();
   }

and I got the 700 error at this point. So the culprit is the kernel. cuda-memcheck shows that there are invalid shared memory writes:

$ cuda-memcheck ./launch_failure_kernel
========= CUDA-MEMCHECK
1. Tesla K20c
time t = 0
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (4,0,0) in block (0,0,0)
=========     Address 0x00013800 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (3,0,0) in block (0,0,0)
=========     Address 0x00013200 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x00012c00 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x00012600 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x00012000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Program hit error 719 on CUDA API call to cuStreamSynchronize 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuStreamSynchronize + 0x17a) [0x13966a]
=========     Host Frame:./launch_failure_kernel [0xded0]
=========     Host Frame:./launch_failure_kernel [0x11214]
=========     Host Frame:./launch_failure_kernel [0xc09c]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
terminate called after throwing an instance of 'vex::backend::cuda::error'
  what():  /home/demidov/work/vexcl/vexcl/backend/cuda/context.hpp:196
    CUDA Driver API Error (Unknown error)
========= Error: process didn't terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found

Another suggestion: I notice that you use multivectors with a lot of components, which leads to kernels with a lot of parameters. What do you think of replacing vex::multivector<T,N> of size n with a vex::vector<T> of size N * n, where individual components are placed in continuous chunks one after another?

I have two alternatives for convenient access to the components of such vector (you will need commit ac826465a7e043a20a5914143b85e2b517a956f8 for both to work). First one uses slices:

    vex::vector<double> x(ctx, 3 * n);
    auto X = vex::tag<0>(x);

    vex::slicer<2> slice( vex::extents[3][n] );

    // Alias individual components for convenience.
    auto x0 = slice[0](X);
    auto x1 = slice[1](X);
    auto x2 = slice[2](X);

    // write individual components:
    x0 = 1;
    x1 = 2;
    x2 = 3;

    // Do the fused call:
    vex::tie(x0, x1, x2) = std::tie(sin(x0), cos(x1), x1 - x0);

Another alternative uses permutations:

    vex::vector<double> x(ctx, 3 * n);
    auto X = vex::tag<0>(x);

    // Give second parameter to element_index so it knows its size.
    auto idx = vex::tag<1>( vex::element_index(0, n) );
    auto N   = vex::tag<2>( n );

    // Alias individual components for convenience.
    auto x0 = vex::permutation(idx        )(X);
    auto x1 = vex::permutation(idx + N    )(X);
    auto x2 = vex::permutation(idx + N * 2)(X);

    // write individual components:
    x0 = 1;
    x1 = 2;
    x2 = 3;

    // Do the fused call:
    vex::tie(x0, x1, x2) = std::tie(sin(x0), cos(x1), x1 - x0);

Gist https://gist.github.com/ddemidov/8018055 shows both approaches in a complete example.

The variant with permutations is more effective, because it is less general, uses less arithmetic operations, and uses less kernels arguments. Compare the fused kernel for the sliced expressions:

extern "C" __global__ void vexcl_multivector_kernel
(
  ulong n,
  double * prm_tag_0_1,
  ulong lhs_1_slice_start,
  ulong lhs_1_slice_length0,
  long lhs_1_slice_stride0,
  ulong lhs_1_slice_length1,
  long lhs_1_slice_stride1,
  ulong rhs_1_slice_start,
  ulong rhs_1_slice_length0,
  long rhs_1_slice_stride0,
  ulong rhs_1_slice_length1,
  long rhs_1_slice_stride1,
  ulong lhs_2_slice_start,
  ulong lhs_2_slice_length0,
  long lhs_2_slice_stride0,
  ulong lhs_2_slice_length1,
  long lhs_2_slice_stride1,
  ulong rhs_2_slice_start,
  ulong rhs_2_slice_length0,
  long rhs_2_slice_stride0,
  ulong rhs_2_slice_length1,
  long rhs_2_slice_stride1,
  ulong lhs_3_slice_start,
  ulong lhs_3_slice_length0,
  long lhs_3_slice_stride0,
  ulong lhs_3_slice_length1,
  long lhs_3_slice_stride1,
  ulong rhs_3_slice_start,
  ulong rhs_3_slice_length0,
  long rhs_3_slice_stride0,
  ulong rhs_3_slice_length1,
  long rhs_3_slice_stride1,
  ulong rhs_4_slice_start,
  ulong rhs_4_slice_length0,
  long rhs_4_slice_stride0,
  ulong rhs_4_slice_length1,
  long rhs_4_slice_stride1
)
{
  for
  (
    size_t idx = blockDim.x * blockIdx.x + threadIdx.x, grid_size = blockDim.x * gridDim.x;
    idx < n;
    idx += grid_size
  )
  {
    double buf_1 = sin( prm_tag_0_1[rhs_1_slice_func(rhs_1_slice_start, rhs_1_slice_length0, rhs_1_slice_stride0, rhs_1_slice_length1, rhs_1_slice_stride1, idx)] );
    double buf_2 = cos( prm_tag_0_1[rhs_2_slice_func(rhs_2_slice_start, rhs_2_slice_length0, rhs_2_slice_stride0, rhs_2_slice_length1, rhs_2_slice_stride1, idx)] );
    double buf_3 = ( prm_tag_0_1[rhs_3_slice_func(rhs_3_slice_start, rhs_3_slice_length0, rhs_3_slice_stride0, rhs_3_slice_length1, rhs_3_slice_stride1, idx)] - prm_tag_0_1[rhs_4_slice_func(rhs_4_slice_start, rhs_4_slice_length0, rhs_4_slice_stride0, rhs_4_slice_length1, rhs_4_slice_stride1, idx)] );
    prm_tag_0_1[lhs_1_slice_func(lhs_1_slice_start, lhs_1_slice_length0, lhs_1_slice_stride0, lhs_1_slice_length1, lhs_1_slice_stride1, idx)] = buf_1;
    prm_tag_0_1[lhs_2_slice_func(lhs_2_slice_start, lhs_2_slice_length0, lhs_2_slice_stride0, lhs_2_slice_length1, lhs_2_slice_stride1, idx)] = buf_2;
    prm_tag_0_1[lhs_3_slice_func(lhs_3_slice_start, lhs_3_slice_length0, lhs_3_slice_stride0, lhs_3_slice_length1, lhs_3_slice_stride1, idx)] = buf_3;
  }
}

with the same kernel for the permuted expressions:

extern "C" __global__ void vexcl_multivector_kernel
(
  ulong n,
  double * prm_tag_0_1,
  ulong prm_tag_1_1,
  ulong prm_tag_2_1,
  int lhs_3_slice_3
)
{
  for
  (
    size_t idx = blockDim.x * blockIdx.x + threadIdx.x, grid_size = blockDim.x * gridDim.x;
    idx < n;
    idx += grid_size
  )
  {
    double buf_1 = sin( prm_tag_0_1[(prm_tag_1_1 + idx)] );
    double buf_2 = cos( prm_tag_0_1[( (prm_tag_1_1 + idx) + prm_tag_2_1 )] );
    double buf_3 = ( prm_tag_0_1[( (prm_tag_1_1 + idx) + prm_tag_2_1 )] - prm_tag_0_1[(prm_tag_1_1 + idx)] );
    prm_tag_0_1[(prm_tag_1_1 + idx)] = buf_1;
    prm_tag_0_1[( (prm_tag_1_1 + idx) + prm_tag_2_1 )] = buf_2;
    prm_tag_0_1[( (prm_tag_1_1 + idx) + ( prm_tag_2_1 * lhs_3_slice_3 ) )] = buf_3;
  }
}

Thanks for this. I was confused by the error being caught on return from cuMemAlloc(). It was an address calculation error in the kernel.

I agree it would probably be preferable to package the state as a single vex::vector rather than a high-dimensional vex::multivector – the number of kernel parameters already causes a problem with the OpenCL backend. I will look into the slicer and permutation options. (Unfortunately, for actual calculations I think I will be stuck with writing custom kernels because my system of ODEs is complex enough to cause enormous resource usage in the compiler.)

Same approach should work for the custom kernels.

By the way, I remember that first CUDA versions had a limit of 256 bytes for the total size of kernel parameters. I am not sure if this limit became higher or is nonexistent these days. But you could in theory get same problems with number of parameters for CUDA as well as for OpenCL.

ddemidov / vexcl

CUDA backend: API error "700 - CUDA_ERROR_LAUNCH_FAILED" during resize of multivector #87