ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
http://vexcl.readthedocs.org
MIT License
702 stars 82 forks source link

Failed to create directories / read only file system #276

Closed DABH closed 4 years ago

DABH commented 4 years ago

I'm getting an error from VexCL (via AMGCL) when attempting to do a linear solve on a node with some GPUs (using the VexCL backend of AMGCL):

terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
  what():  boost::filesystem::create_directories: Read-only file system: "/home/foo/.vexcl/e9"

This causes the program to crash. I'm not sure what ~/.vexcl is or how it's used. I tried chmod -R 777 /home/foo/.vexcl but that made no difference so changed it back.

The code works fine on my laptop but not on the server where I've copied my code to. So it could be some issue with the server configuration. Regardless, any ideas on how to troubleshoot or ideally resolve this kind of issue? Thanks in advance for any ideas you might have!

ddemidov commented 4 years ago

VexCL caches the binaries for the compiled kernels in ~/.vexcl, so you don't have to wait for the compilation again next time you run the same program. I think a similar problem was reported to me some time ago (by @mmoelle1) and the resolution was to use a different version of boost (or, in fact, to use boost libraries compiled from sources). Could this work in your case?

Path to the offline cache is defined here:

https://github.com/ddemidov/vexcl/blob/0ef6a6c6ae7acc857861549cd5238a635a7f3ad1/vexcl/backend/common.hpp#L214-L222

You could change it to some other directory and see if that works, but I suspect the problem here is in boost::filesystem and not in the specific directory.

DABH commented 4 years ago

I'm using Boost 1.67.0 which I compiled from source on the target machine. I guess I could try something like 1.71.0 and see if that works? Will let you know. Alternatively yeah I'll try changing the cache path to another filesystem and see if that helps. Thanks!

ddemidov commented 4 years ago

I would also check with ldd if the libboost_filesystem linked is actually the one you compiled yourself. Another thing to check is if you don't mix debug and release compilations in boost and vexcl.

DABH commented 4 years ago

It turns out the filesystem I was attempting to write to was being mounted as read-only, so it was really an error with that server. Apologies.

However, I'm now getting

terminate called after throwing an instance of 'std::runtime_error'
  what():  nvcc invocation failed

which nvcc does show that nvcc is in the path though. Is there any other reason we'd get nvcc invocation failed from vexcl?

Seems to happen for lots of kernels, such as

extern "C" __global__ void vexcl_vector_kernel
(
  ulong n,
  double * prm_1,
  double * prm_2_expr_1,
  ulong * prm_2_slice_1
)
{
  for
  (
    ulong idx = blockDim.x * blockIdx.x + threadIdx.x, grid_size = blockDim.x * gridDim.x;
    idx < n;
    idx += grid_size
  )
  {
    prm_1[idx] = prm_2_expr_1[prm_2_slice_1[idx]];
  }
}
DABH commented 4 years ago

One thought -- on the server I am compiling/running with CUDA 10, whereas on my laptop I'm using CUDA 9. Is VexCL known to work (or not work) with CUDA 10?

ddemidov commented 4 years ago

VexCL compiles its compute kernels on the fly, during runtime. In case of the cuda backend it means that nvcc (cuda compiler) should be available during runtime.

It sounds like you are running on a compute cluster, where you compile your program on the login node, and running it from a batch job on one of compute nodes? If that is the case, is nvcc available on the compute nodes? Do you need to load a cuda module from your batch script?

If nvcc is available on the login node, but not on the compute nodes, you could preload offline vexcl cache (~/.vexcl) by running the program once on the login node. However, cuda/nvidia driver version is one of the things vexcl checks when using the cache, so those should coincide for the cache to work.

If OpenCL is available on the server, you could try to use that instead of CUDA, since OpenCL compiler is embedded into the OpenCL library, and you don't need any external executables during runtime.

Is VexCL known to work (or not work) with CUDA 10?

yes, it should work with cuda 10 (I have it installed on my machine).

DABH commented 4 years ago

Thanks so much. I'm still debugging but will keep you posted. I tried out CUDA 9.2.148 just to see, and that yields a different error,

terminate called after throwing an instance of 'vex::backend::cuda::error'
  what():  /home/foo/external_libraries/vexcl/vexcl/backend/cuda/device_vector.hpp:142
        CUDA Driver API Error (Unknown error 700)

It may be the same or similar error being reported by CUDA 10, but just being masked poorly. nvcc is definitely available on the compute nodes; I added which nvcc to my job and it prints out the right paths.

Also, I was able to get a successful run when only using 2 compute nodes instead of 8, on a much smaller problem size. So I wonder if there is possibly some related issue to either the number of nodes/GPUs (I have 6 GPUs per node), or the size of the problem (e.g. out of memory -- though I think I am using enough nodes... [edit: I'm testing with 8 ranks, the linear system I'm solving on each rank takes 12GB RAM, each of the 6 GPUs has 16GB RAM... so assuming that matrix is partitioned across the GPUs is should be way under the memory limit of the GPUs]).

I'll keep trying to get CUDA to work but may give the OpenCL backend a try too. Not sure if it's available but will investigate. Thanks again for all the help.

[edit 2: now getting the 700 error with CUDA 10 as well. no longer seeing the "nvcc invocation failed" errors for now at least...]

ddemidov commented 4 years ago

CUDA error 700 is defined in cuda.h to be:

    /**
     * While executing a kernel, the device encountered a
     * load or store instruction on an invalid memory address.
     * This leaves the process in an inconsistent state and any further CUDA work
     * will return the same error. To continue using CUDA, the process must be terminated
     * and relaunched.
     */
    CUDA_ERROR_ILLEGAL_ADDRESS                = 700,

So it looks like there is an out-of-bounds error somewhere. I would rerun the same tests with the builtin backend to see if it is possible to pinpoint the exact location of the error.

DABH commented 4 years ago

Eventually this error resolved but seems like VexCL is having some issues with AMGCL+MPI: https://github.com/ddemidov/amgcl/issues/140 . Going to close this though since it seems like these specific issues are resolved. Thanks again.