deeplearningais / CUV

Matrix library for CUDA in C++ and Python
www.ais.uni-bonn.de
196 stars 48 forks source link

Pulling to Host Python #7

Open JT17 opened 9 years ago

JT17 commented 9 years ago

I'm trying to convert from a tensor to a numpy array, but whenever I do, I get an error

python: /home/mobile/Downloads/CUV/src/cuv/basics/tensor.hpp:1138: bool cuv::tensor<V, M, L>::copy_memory(const cuv::tensor<V, _M, _L>&, bool, cudaStream_t) [with OM = cuv::dev_memory_space; OL = cuv::row_major; V = float; M = cuv::host_memory_space; L = cuv::row_major; cudaStream_t = CUstream_st*]: Assertion `m_memory.get()' failed. Aborted (core dumped)

Even when it is quite simple such as the example1.py:

import cuv_python as cp import numpy as np

h = np.zeros((1,256)) # create numpy matrix d = cp.dev_tensor_float(h) # constructs by copying numpy_array

h2 = np.zeros((1,256)).copy("F") # create numpy matrix d2 = cp.dev_tensor_float_cm(h2) # creates dev_tensor_float_cm (column-major float) object

cp.fill(d,1) # terse form cp.apply_nullary_functor(d,cp.nullary_functor.FILL,1) # verbose form

h=d.np # pull and convert to numpy print type(h) assert(np.sum(h) == 256) assert(cp.sum(d) == 256) d.dealloc() # explicitly deallocate memory (optional)

I don't have any problems with the C++ examples but whenever a program runs into something such as h=d.np I get that assertion failed error. Any ideas?

temporaer commented 9 years ago

can you try initializing the device first, using e.g. cp.initCUDA(0)? The following works for me:

import cuv_python as cp
cp.initCUDA(0)
x=cp.dev_tensor_float([4,5])
x.np
JT17 commented 9 years ago

When I run the same code I get the failed assertion error still. When you don't initialize does yours work?

temporaer commented 9 years ago

Mine works even w/o initialization.

----- Ursprüngliche Mail -----

When I run the same code I get the failed assertion error still. When you don't initialize does yours work?


Reply to this email directly or view it on GitHub: https://github.com/deeplearningais/CUV/issues/7#issuecomment-50728645

JT17 commented 9 years ago

Okay. I can also run the basic C++ examples if that means anything. When the m_memory.get() call fails that is when the host is trying to get the data off of the device right?

temporaer commented 9 years ago

Yes. The assertion is very basic; it just verifies that the internal pointer to the data is not NULL. I'm at a loss why this should behave differently in the python bindings than in native c++. Especially, since you seem to be able to perform other operations. Maybe a similar issue as last time, can you check import cuv_python as cp; print cp.__path__? Is the result current? Does the .so python library in that directory link to a current version of libcuv (you can find the path using ldd _cuv_python.so)?

JT17 commented 9 years ago

So my cuv_python uses libcuv.so.0 but I also have a libcuv.so.0.9 I'm not sure which one the C++ version uses but both libcuv.so.0 and libcuv.so.0.9 but could this be the issue? Both versions are from the same build

temporaer commented 9 years ago

One should be a symbolic link to the other. On Do., Juli 31, 2014 at 11:01 vorm., JT17 notifications@github.com wrote:So my cuv_python uses libcuv.so.0 but I also have a libcuv.so.0.9 I'm not sure which one the C++ version uses but both libcuv.so.0 and libcuv.so.0.9 but could this be the issue? Both versions are from the same build

—Reply to this email directly or view it on GitHub.

tambetm commented 9 years ago

I'm having the same problem when running first DBM example from https://github.com/deeplearningais/CUV/tree/master/examples/rbm - the training goes well, but just before it finishes it outputs:

....................................................................................................Iter:  99900 Err: 0.850475 |W| = 553689.00  1.3250e-05s/img ; 7.5472e+04 img/s
/tmp/dbm
python: /home/hpc_tambet/include/cuv/basics/tensor.hpp:1138: bool cuv::tensor<V, M, L>::copy_memory(const cuv::tensor<V, _M, _L>&, bool, CUstream_st*) [with OM = cuv::dev_memory_space, OL = cuv::column_major, V = float, M = cuv::host_memory_space, L = cuv::column_major]: Assertion `m_memory.get()' failed.
...................................................................................................Finished Layer  0

The same happens with MLP example, but immediately after running it:

Calculating statistics for minibatch...
/home/hpc_tambet/src/CUV/examples/rbm/minibatch_provider.py:28: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if teacher != None:
.......................................................................................................................................................done.
/tmp/mlp_rprop: Epoch  1 / 200
python: /home/hpc_tambet/include/cuv/basics/tensor.hpp:1138: bool cuv::tensor<V, M, L>::copy_memory(const cuv::tensor<V, _M, _L>&, bool, CUstream_st*) [with OM = cuv::dev_memory_space, OL = cuv::column_major, V = float, M = cuv::host_memory_space, L = cuv::column_major]: Assertion `m_memory.get()' failed.

I checked my _cuv_python.so and it appears to be linked correctly:

(sandbox)[hpc_tambet@juur cuv_python]$ ldd _cuv_python.so
        linux-vdso.so.1 =>  (0x00007fff5e5ff000)
        libcuv.so.0 => /home/hpc_tambet/lib/libcuv.so.0 (0x00007f207e1ad000)
        libboost_date_time.so.1.57.0 => /home/hpc_tambet/lib/libboost_date_time.so.1.57.0 (0x00007f207df99000)
        libboost_python.so.1.57.0 => /home/hpc_tambet/lib/libboost_python.so.1.57.0 (0x00007f207dd48000)
        libpython2.7.so.1.0 => /usr/lib64/libpython2.7.so.1.0 (0x00007f207d96c000)
        libcublas.so.5.0 => /usr/local/cuda/lib64/libcublas.so.5.0 (0x00007f2079f74000)
        libblas.so.3 => /usr/lib64/libblas.so.3 (0x00007f2079d1d000)
        libX11.so.6 => /usr/lib64/libX11.so.6 (0x00007f20799e0000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f20797c2000)
        libpng12.so.0 => /usr/lib64/libpng12.so.0 (0x00007f207959c000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f2079386000)
        libtp_cudaconv2.so.0 => /home/hpc_tambet/lib/libtp_cudaconv2.so.0 (0x00007f2078923000)
        libcblas.so.3 => /usr/lib64/atlas/libcblas.so.3 (0x00007f2078703000)
        libtp_theano.so.0 => /home/hpc_tambet/lib/libtp_theano.so.0 (0x00007f207843a000)
        libcudart.so.5.0 => /usr/local/cuda/lib64/libcudart.so.5.0 (0x00007f20781df000)
        libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f207727b000)
        libboost_unit_test_framework.so.1.57.0 => /home/hpc_tambet/lib/libboost_unit_test_framework.so.1.57.0 (0x00007f2076fb3000)
        libboost_serialization.so.1.57.0 => /home/hpc_tambet/lib/libboost_serialization.so.1.57.0 (0x00007f2076d48000)
        libboost_thread.so.1.57.0 => /home/hpc_tambet/lib/libboost_thread.so.1.57.0 (0x00007f2076b26000)
        libboost_system.so.1.57.0 => /home/hpc_tambet/lib/libboost_system.so.1.57.0 (0x00007f2076923000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f207661c000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f2076398000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2076182000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f2075ded000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f2075be5000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f20759e1000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f20757dd000)
        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007f20754eb000)
        libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x00007f20752cc000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f20824b4000)
        libatlas.so.3 => /usr/lib64/atlas/libatlas.so.3 (0x00007f2074bbe000)
        libXau.so.6 => /usr/lib64/libXau.so.6 (0x00007f20749ba000)

I'm running it on our university cluster, so I had to compile Boost, PyUblas and CUV all by myself.

I also got some errors while running ctest, maybe these are related?

7/20 Testing: theano_ops
7/20 Test: theano_ops
Command: "/home/hpc_tambet/src/CUV/build/debug/src/tests/theano_ops"
Directory: /home/hpc_tambet/src/CUV/build/debug/src/tests
"theano_ops" start time: Jan 06 17:14 EET
Output:
----------------------------------------------------------
Testing on device=0
Running 2 test cases...
init cuda and py
 1 dim tensor finished
init cuda and py
unknown location(0): fatal error in "test_flip_dims": memory access violation at address: 0x00000001: no mapping at fault address
/home/hpc_tambet/src/CUV/src/tests/lib_theano_ops.cpp(276): last checkpoint

*** 1 failure detected in test suite "example"
<end of output>
Test time =   4.64 sec
----------------------------------------------------------
Test Failed.
"theano_ops" end time: Jan 06 17:14 EET
"theano_ops" time elapsed: 00:00:04
----------------------------------------------------------

In addition to theano_ops also optimize, conv_op, lib_cimg, lib_sep_conv, hog and nose_tests didn't pass. But as all other tests seemed to work (including rbm), then I decided to give it a try.

The Linux I'm using is "Scientific Linux CERN SLC release 6.6 (Carbon)" with CUDA toolkit 5.0 and NVidia Tesla K20 GPU.