Conflict with 'do' loop from CUDArt

joaquimg commented 9 years ago

I found the following problem while usinf CURAND combined with CUDArt´s ´do´ loop

if I load:

using CUDArt using CURAND

and then I run either:

d_a = curand(Float64, 1000) a = to_host(d_a)

OR:

result = devices(dev->capability(dev)[1]>=2) do devlist end

I can repeat either block as many time I want (as long as its just one of them).

However if I run both combined (or alternatedly in any order):

d_a = curand(Float64, 1000) a = to_host(d_a) result = devices(dev->capability(dev)[1]>=2) do devlist end

I get the error in IJulia:

WARNING: CUDA error triggered from: LoadError: "unspecified launch failure" while loading In[3], in expression starting on line 2 in checkerror at C:\Users\joaquimgarcia.julia\v0.4\CUDArt\src\libcudart-6.5.jl:16 in checkerror at C:\Users\joaquimgarcia.julia\v0.4\CUDArt\src\libcudart-6.5.jl:15 in copy! at C:\Users\joaquimgarcia.julia\v0.4\CUDArt\src\arrays.jl:152 in to_host at C:\Users\joaquimgarcia.julia\v0.4\CUDArt\src\arrays.jl:87 in include_string at loading.jl:266 in execute_request_0x535c5df2 at C:\Users\joaquimgarcia.julia\v0.4\IJulia\src\execute_request.jl:177 in eventloop at C:\Users\joaquimgarcia.julia\v0.4\IJulia\src\IJulia.jl:141 in anonymous at task.jl:447

and in cmd:

WARNING: CUDA error triggered from:

in checkerror at C:\Users\joaquimgarcia.julia\v0.4\CUDArt\src\libcudart-6.5.jl :15 in copy! at C:\Users\joaquimgarcia.julia\v0.4\CUDArt\src\arrays.jl:152 in to_host at C:\Users\joaquimgarcia.julia\v0.4\CUDArt\src\arrays.jl:87ERROR: "unspecified launch failure" in checkerror at C:\Users\joaquimgarcia.julia\v0.4\CUDArt\src\libcudart-6.5.jl :16

in the following post, from julia-users, this issue is also commented: https://groups.google.com/forum/#!topic/julia-users/mJjjTyU7cQ0

dfdx commented 9 years ago

Thanks for reporting this issue, tomorrow I'll take a look at possible causes. Note, though, that my GPU-enabled computer is currently in repair, so I'll be able to test solution only in 4-5 days. I'll try to generate some useful ideas earlier.

joaquimg commented 9 years ago

I have a windows and a mac both with enabled GPU, so you can count on me to test whatever you want.

dfdx commented 9 years ago

Can you please try 2 following snippets:

# 1st
d_a = curand(Float64, 1000)
a = to_host(d_a)
sleep(5)
result = devices(dev->capability(dev)[1]>=2) do devlist
end

and

# 2nd
d_a = curand(Float64, 1000)
a = to_host(d_a)
device_reset()
result = devices(dev->capability(dev)[1]>=2) do devlist
end

My current guess is that curand uses default device initialized by call to cudaSetDevice (device() in Julia code), while device() do block either resets them before curand finishes or simply overrides default device while CURAND.jl still uses it.

dfdx commented 9 years ago

In fact, the problem may be in default random number generator that may reference device that has already been reset. Can you replace:

d_a = curand(Float64, 1000)

with

rng = create_generator()
d_a = curand(rng)
destroy_generator(rng)

?

joaquimg commented 9 years ago

Just tested, #1 and #2 as firstly described, none works on the second ran.

about:

#1
rng = create_generator()
d_a = curand(rng,Float64, 1000)
destroy_generator(rng)
a = to_host(d_a)
sleep(5)
result = devices(dev->capability(dev)[1]>=2) do devlist
end

and

#2
rng = create_generator()
d_a = curand(rng,Float64, 1000)
destroy_generator(rng)
a = to_host(d_a)
device_reset()
result = devices(dev->capability(dev)[1]>=2) do devlist
end

both return this different error: in terminal:

WARNING: cuRAND operation failed with status: CURAND_STATUS_INITIALIZATION_FAILED (203)

 in statuscheck at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:28
 in create_generator at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:36
 in create_generator at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:35
ERROR: "cuRAND operation failed with status: CURAND_STATUS_INITIALIZATION_FAILED"
 in statuscheck at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:29
 in create_generator at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:36
 in create_generator at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:35

in IJulia

WARNING: cuRAND operation failed with status: CURAND_STATUS_INITIALIZATION_FAILED (203)

 in 

LoadError: "cuRAND operation failed with status: CURAND_STATUS_INITIALIZATION_FAILED"
while loading In[3], in expression starting on line 1

 in statuscheck at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:29
 in create_generator at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:36
 in create_generator at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:35

statuscheck at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:28
 in create_generator at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:36
 in create_generator at /Users/joaquimdiasgarcia/.julia/v0.4/CURAND/src/wrappers.jl:35
 in include_string at loading.jl:266
 in execute_request_0x535c5df2 at /Users/joaquimdiasgarcia/.julia/v0.4/IJulia/src/execute_request.jl:177
 in eventloop at /Users/joaquimdiasgarcia/.julia/v0.4/IJulia/src/IJulia.jl:141
 in anonymous at task.jl:447

dfdx commented 9 years ago

Forgot to ask: how many devices do you have? I.e. what is the length of devlist?

joaquimg commented 9 years ago

Only 1 device! The list only contains the number zero.

dfdx commented 9 years ago

Ok, one more experiment. Please, enter the following and check a result:

using CUDArt

device()
device()

device(1)
device(1)

device()

reset_device()
reset_device()

device()
device(1)
device()

If I understand it correctly, at least some of these pairs should give an error, but better to make sure.

joaquimg commented 9 years ago

well,

device(1)

gives an error anyway since the only device I have is device(0) also reset_device() gives error because the command is actually device_reset()

finally,

using CUDArt

device()
device()

device(0)
device(0)

device()

device_reset()
device_reset()

device()
device(0)
device()

gives no error

dfdx commented 9 years ago

Ah, sorry for providing incorrect commands. Today I should get back my laptop, so I will be able to make further experiments myself.

dfdx commented 8 years ago

It becomes curiouser and curiouser. I modified your example just a little bit and the error seems to be gone:

using CUDArt
using CURAND

d_a = curand(Float64, 1000)
a = to_host(d_a)

result = devices() do devlist
end

Note that this time call to devices() doesn't take criteria lambda function. I tried calls to curand(), to_host and devices() in all possible combinations, but couldn't make it fail.

Exact criteria function doesn't matter, even stub function dev->true results in an error you described. I'm investigating it.

dfdx commented 8 years ago

Well, it makes sense, because devices() do .. end actually calls a function that simply filters device list, e.g. function in do ... end refers to a criteria function, not a function we want to apply.

However, I was able to get some more interesting error message:

using CUDArt
using CURAND

d_a = curand(Float64, 1000)
a = to_host(d_a)

# stripped out version of `devices(dev->true) do devlist end`
devlist = [0]
CUDArt.init(devlist)
CUDArt.close(devlist)

d_a = curand(Float64, 1000)
a = to_host(d_a)

Last call results in the following stack trace:

!julia>     a = to_host(d_a)                                                                                  
 WARNING: CUDA error triggered from:

  in checkerror at /home/slipslop/.julia/v0.4/CUDArt/src/libcudart-6.5.jl:15
  in copy! at /home/slipslop/.julia/v0.4/CUDArt/src/arrays.jl:152
  in to_host at /home/slipslop/.julia/v0.4/CUDArt/src/arrays.jl:87ERROR: "an illegal memory access was encountered"
  in checkerror at /home/slipslop/.julia/v0.4/CUDArt/src/libcudart-6.5.jl:16
  in copy! at /home/slipslop/.julia/v0.4/CUDArt/src/arrays.jl:152
  in to_host at /home/slipslop/.julia/v0.4/CUDArt/src/arrays.jl:87

dfdx commented 8 years ago

Some more updates. I can reproduce this error in pure C:


#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>

#define CUDA_CALL(expr) {                       \
    int status = (expr);                        \
    if (status != cudaSuccess) {                    \
      printf("Error at %s:%d -- %d\n", __FILE__, __LINE__, status); \
      exit(EXIT_FAILURE);                       \
    }                                   \
  }

#define CURAND_CALL(expr) {                     \
    int status = (expr);                        \
    if (status != CURAND_STATUS_SUCCESS) {              \
      printf("Error at %s:%d -- %d\n", __FILE__, __LINE__, status); \
      exit(EXIT_FAILURE);                       \
    }                                   \
  }

int device()
{
  int dev;
  CUDA_CALL(cudaGetDevice(&dev))
  return dev;
}

void device_reset()
{
  CUDA_CALL(cudaDeviceReset());
}

void print_array(float* a, int n)
{
  int i;
  for(i = 0; i < n; i++) {
    printf("%1.4f ", a[i]);
  }
  printf("\n");
}

int main(int argc, char *argv[])
{
  size_t n = 10;
  int dev;
  float *A;
  float *d_A;
  curandGenerator_t gen;

  // buffer allocation
  A = (float *)calloc(n, sizeof(float));
  CUDA_CALL(cudaMalloc((void **)&d_A, n * sizeof(float)));

  // CURAND
  CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT));
  CURAND_CALL(curandGenerateUniform(gen, d_A, n));
  CUDA_CALL(cudaMemcpy(A, d_A, n * sizeof(float), cudaMemcpyDeviceToHost));
  print_array(A, n);

  // CUDArt.init()
  dev = device();
  printf("Using device #%d\n", dev);

  // CUDArt.close()
  device_reset();

  // CURAND again
  // CUDA_CALL(cudaMalloc((void **)&d_A, n * sizeof(float)));  -- reallocating d_A doesn't change the result
  CURAND_CALL(curandGenerateUniform(gen, d_A, n));
  CUDA_CALL(cudaMemcpy(A, d_A, n * sizeof(float), cudaMemcpyDeviceToHost));
  print_array(A, n);

  return EXIT_SUCCESS;
}

Compiling and running it gives:

$ nvcc curand_test.cu -lcurand -o test.out && ./test.out 
0.7402 0.9210 0.0390 0.9690 0.9251 0.4464 0.6673 0.1099 0.4702 0.5132 
Using device #0
Error at curand_test.cu:73 -- 77

Where 77 stands for cudaErrorIllegalAddress, so exactly the same error.

If, however, we move device_reset() to the end, everything works smoothly. So I have particular doubts cuRAND can actually "survive" after cudaDeviceReset(). At the moment I would suggest for any program using CUDArt.jl and CURAND.jl simply not to use devices() do ... end, but instead manually initializing CUDA with CUDArt.init() and finalizing it at the end with CUDArt.close().

dfdx commented 8 years ago

My bad, in the C code I provided I didn't recreate RNG. With recreated RNG the code works.

But what's interesting, Julia code from one of previous examples works fine for me too!

using CUDArt
using CURAND

rng = create_generator()
d_a = curand(rng,Float64, 1000)
destroy_generator(rng)
a = to_host(d_a)
device_reset()
result = devices(dev->capability(dev)[1]>=2) do devlist
end

I can run this code whatever number of times in REPL without any error. @joaquimg Could you please open a fresh Julia session (without any previous commands) and try it out?

Looks like after CUDA runtime have got memory error, it gets into inconsistent state, and any further operations - including reallocation of memory and recreation of RNG - start behaving incorrectly, producing strange and illogical errors. So any testing should always be done in a fresh session - once an error occurred, all further results are compromised.

joaquimg commented 8 years ago

I started the REPL and right in the second run I got the error:

WARNING: cuRAND operation failed with status: CURAND_STATUS_INITIALIZATION_FAILED (203)

in statuscheck at C:\Users\joaquimgarcia.julia\v0.4\CURAND\src\wrappers.jl:28 in create_generator at C:\Users\joaquimgarcia.julia\v0.4\CURAND\src\wrappers.jl:36 in create_generator at C:\Users\joaquimgarcia.julia\v0.4\CURAND\src\wrappers.jl:35ERROR: "cuRAND operation failed with status: CURAND_STATUS_INITIALIZATION_FAILED" in statuscheck at C:\Users\joaquimgarcia.julia\v0.4\CURAND\src\wrappers.jl:29

2015-11-21 21:05 GMT-02:00 Andrei Zhabinski notifications@github.com:

My bad, in the C code I provided I didn't recreate RNG. With recreated RNG the code works.

But what's interesting, Julia code from one of previous examples works fine for me too!

using CUDArt using CURAND

rng = create_generator() d_a = curand(rng,Float64, 1000) destroy_generator(rng) a = to_host(d_a) device_reset() result = devices(dev->capability(dev)[1]>=2) do devlist end

I can run this code whatever number of times in REPL without any error. @joaquimg https://github.com/joaquimg Could you please open a fresh Julia session (without any previous commands) and try it out?

Looks like after CUDA runtime have got memory error, it gets into inconsistent state, and any further operations - including reallocation of memory and recreation of RNG - start behaving incorrectly, producing strange and illogical errors. So any testing should always be done in a fresh session - once an error occurred, all further results are compromised.

— Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CURAND.jl/issues/3#issuecomment-158689481.

dfdx commented 8 years ago

Can you try C version?

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>

#define CUDA_CALL(expr) {                       \
    int status = (expr);                        \
    if (status != cudaSuccess) {                    \
      printf("Error at %s:%d -- %d\n", __FILE__, __LINE__, status); \
      exit(EXIT_FAILURE);                       \
    }                                   \
  }

#define CURAND_CALL(expr) {                     \
    int status = (expr);                        \
    if (status != CURAND_STATUS_SUCCESS) {              \
      printf("Error at %s:%d -- %d\n", __FILE__, __LINE__, status); \
      exit(EXIT_FAILURE);                       \
    }                                   \
  }

int device()
{
  int dev;
  CUDA_CALL(cudaGetDevice(&dev))
  return dev;
}

void device_reset()
{
  CUDA_CALL(cudaDeviceReset());
}

void print_array(float* a, int n)
{
  int i;
  for(i = 0; i < n; i++) {
    printf("%1.4f ", a[i]);
  }
  printf("\n");
}

int main(int argc, char *argv[])
{
  size_t n = 10;
  int dev;
  float *A;
  float *d_A;
  curandGenerator_t gen;

  // buffer allocation
  A = (float *)calloc(n, sizeof(float));
  CUDA_CALL(cudaMalloc((void **)&d_A, n * sizeof(float)));

  // CURAND
  CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT));
  CURAND_CALL(curandGenerateUniform(gen, d_A, n));
  CUDA_CALL(cudaMemcpy(A, d_A, n * sizeof(float), cudaMemcpyDeviceToHost));
  print_array(A, n);

  // CUDArt.init()
  dev = device();
  printf("Using device #%d\n", dev);

  // CUDArt.close()
  device_reset();

  // CURAND again
  CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT));
  CUDA_CALL(cudaMalloc((void **)&d_A, n * sizeof(float)));  -- reallocating d_A doesn't change the result
  CURAND_CALL(curandGenerateUniform(gen, d_A, n));
  CUDA_CALL(cudaMemcpy(A, d_A, n * sizeof(float), cudaMemcpyDeviceToHost));
  print_array(A, n);

  return EXIT_SUCCESS;
}

Copy and paste this code into file called "curand_test.cu", then compile and run it with:

nvcc curand_test.cu -lcurand -o test.out && ./test.out

joaquimg commented 8 years ago

Ok, I could build the progam, but how do I run that on windows? sorry for the probably stupid question.

2015-11-23 11:16 GMT-02:00 Andrei Zhabinski notifications@github.com:

Can you try C version?

include

include

include

include

define CUDA_CALL(expr) { \
int status = (expr);                        \
if (status != cudaSuccess) {                    \
  printf("Error at %s:%d -- %d\n", **FILE**, **LINE**, status); \
  exit(EXIT_FAILURE);                       \
}                                   \
}

define CURAND_CALL(expr) { \
int status = (expr);                        \
if (status != CURAND_STATUS_SUCCESS) {              \
  printf("Error at %s:%d -- %d\n", **FILE**, **LINE**, status); \
  exit(EXIT_FAILURE);                       \
}                                   \
}

int device() { int dev; CUDA_CALL(cudaGetDevice(&dev)) return dev; }

void device_reset() { CUDA_CALL(cudaDeviceReset()); }

void print_array(float* a, int n) { int i; for(i = 0; i < n; i++) { printf("%1.4f ", a[i]); } printf("\n"); }

int main(int argc, char argv[]) { size_t n = 10; int dev; float A; float *d_A; curandGenerator_t gen;

// buffer allocation A = (float _)calloc(n, sizeof(float)); CUDACALL(cudaMalloc((void *)&d_A, n * sizeof(float)));

// CURAND CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT)); CURAND_CALL( curandGenerateUniform(gen, d_A, n)); CUDA_CALL(cudaMemcpy(A, d_A, n * sizeof(float), cudaMemcpyDeviceToHost)); print_array(A, n);

// CUDArt.init() dev = device(); printf("Using device #%d\n", dev);

// CUDArt.close() device_reset();

// CURAND again CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT)); CUDA_CALL(cudaMalloc((void *)&d_A, n * sizeof(float))); -- reallocating d_A doesn't change the result CURAND_CALL(curandGenerateUniform(gen, d_A, n)); CUDA_CALL(cudaMemcpy(A, d_A, n \ sizeof(float), cudaMemcpyDeviceToHost)); print_array(A, n);

return EXIT_SUCCESS; }

Copy and paste this code into file called "curand_test.cu", then compile and run it with:

nvcc curand_test.cu -lcurand -o test.out && ./test.out

— Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CURAND.jl/issues/3#issuecomment-158930796.

joaquimg commented 8 years ago

Ok, i generated an .exe and got this as result:

0.7402 0.9210 0.0390 0.9690 0.9251 0.4464 0.6673 0.1099 0.4702 0.5132 Using device #0 Error at curand_test.cu:73 -- 203

2015-11-23 20:05 GMT-02:00 Joaquim Garcia kimmldg@gmail.com:

Ok, I could build the progam, but how do I run that on windows? sorry for the probably stupid question.

2015-11-23 11:16 GMT-02:00 Andrei Zhabinski notifications@github.com:
Can you try C version?

include

include

include

include

define CUDA_CALL(expr) { \
int status = (expr);                        \
if (status != cudaSuccess) {                    \
  printf("Error at %s:%d -- %d\n", **FILE**, **LINE**, status); \
  exit(EXIT_FAILURE);                       \
}                                   \
}

define CURAND_CALL(expr) { \
int status = (expr);                        \
if (status != CURAND_STATUS_SUCCESS) {              \
  printf("Error at %s:%d -- %d\n", **FILE**, **LINE**, status); \
  exit(EXIT_FAILURE);                       \
}                                   \
}

int device() { int dev; CUDA_CALL(cudaGetDevice(&dev)) return dev; }

void device_reset() { CUDA_CALL(cudaDeviceReset()); }

void print_array(float* a, int n) { int i; for(i = 0; i < n; i++) { printf("%1.4f ", a[i]); } printf("\n"); }

int main(int argc, char argv[]) { size_t n = 10; int dev; float A; float *d_A; curandGenerator_t gen;

// buffer allocation A = (float _)calloc(n, sizeof(float)); CUDACALL(cudaMalloc((void *)&d_A, n * sizeof(float)));

// CURAND CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT)); CURAND_CALL( curandGenerateUniform(gen, d_A, n)); CUDA_CALL(cudaMemcpy(A, d_A, n * sizeof(float), cudaMemcpyDeviceToHost)); print_array(A, n);

// CUDArt.init() dev = device(); printf("Using device #%d\n", dev);

// CUDArt.close() device_reset();

// CURAND again CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT)); CUDA_CALL(cudaMalloc((void *)&d_A, n * sizeof(float))); -- reallocating d_A doesn't change the result CURAND_CALL(curandGenerateUniform(gen, d_A, n)); CUDA_CALL(cudaMemcpy(A, d_A, n \ sizeof(float), cudaMemcpyDeviceToHost)); print_array(A, n);

return EXIT_SUCCESS; }

Copy and paste this code into file called "curand_test.cu", then compile and run it with:

nvcc curand_test.cu -lcurand -o test.out && ./test.out

— Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CURAND.jl/issues/3#issuecomment-158930796.

dfdx commented 8 years ago

So the same error as previously. Now we know that:

the issue is not in Julia wrappers, but in CUDA libraries themselves
the issue is specific either to Windows, or to your specific machine

Is there anybody with another Windows machine who can test it and see if the issue is reproduced?

joaquimg commented 8 years ago

Thats a good result! I will try on my mac later the week…. You ha been testing it on linux?

Maybe its time to ask for help in julia-users or julia-dev?

On Nov 23, 2015, at 9:09 PM, Andrei Zhabinski notifications@github.com wrote:

So the same error as previously. Now we know that:

the issue is not in Julia wrappers, but in CUDA libraries themselves the issue is specific either to Windows, or to your specific machine Is there anybody with another Windows machine who can test it and see if the issue is reproduced?

— Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CURAND.jl/issues/3#issuecomment-159096808.

dfdx commented 8 years ago

Good idea, I posted a request for experiment in the old topic.

As for my config, yes, I'm on Linux with CUDA 6.5.

joaquimg commented 8 years ago

The C code returned the same on my MAC with cuda 7.0

0.7402 0.9210 0.0390 0.9690 0.9251 0.4464 0.6673 0.1099 0.4702 0.5132 Using device #0 Error at curand_test.cu:72 — 203

On Nov 24, 2015, at 2:19 PM, Andrei Zhabinski notifications@github.com wrote:

Good idea, I posted a request for experiment in the old topic https://groups.google.com/forum/#!searchin/julia-users/curand%2420cudart/julia-users/mJjjTyU7cQ0/GZ6SxTfeAwAJ.

As for my config, yes, I'm on Linux with CUDA 6.5.

— Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CURAND.jl/issues/3#issuecomment-159319580.

joaquimg commented 8 years ago

I just got the C code running on my MAC with cuda 7.5 and someonelse did it in windows with cuda 7.5:

http://stackoverflow.com/questions/33904554/curand-error-while-alternating-with-device-initialization-and-reset-in-cuda-7-0?noredirect=1#comment55573150_33904554

JuliaAttic / CURAND.jl

Conflict with 'do' loop from CUDArt #3

include

include

include

include

define CUDA_CALL(expr) { \

define CURAND_CALL(expr) { \

include

include

include

include

define CUDA_CALL(expr) { \

define CURAND_CALL(expr) { \