crisbal / docker-torch-rnn

Docker images for using torch-rnn
https://hub.docker.com/r/crisbal/torch-rnn/
153 stars 36 forks source link

'THCudaCheck FAIL' Using Cuda7.5 Docker Image #1

Open spadavec opened 8 years ago

spadavec commented 8 years ago

After installing the NVIDIA docker image, and loading the Torch RNN docker via:

nvidia-docker run --rm -ti crisbal/torch-rnn:cuda7.5 bash

and preprocessing via

root@3da15ad69af8:~/torch-rnn# python scripts/preprocess.py --input_txt data/library.txt --output_h5 data/library.h5 --output_json data/library.json

Attempting to train the system results in the following:

root@3da15ad69af8:~/torch-rnn# th train.lua -input_h5 data/library.h5 -input_json data/library.json Running with CUDA on GPU 0
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9234/cutorch/lib/THC/THCGeneral.c line=608 error=8 : invalid device function /root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/Container.lua:67: In 2 module of nn.Sequential: ./LSTM.lua:128: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-9234/cutorch/lib/THC/THCGeneral.c:608 stack traceback: [C]: in function 'resize' ./LSTM.lua:128: in function <./LSTM.lua:118> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' train.lua:130: in function 'opfunc' /root/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'adam' train.lua:187: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above. stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' train.lua:130: in function 'opfunc' /root/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'adam' train.lua:187: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

crisbal commented 8 years ago

I think this is an issue that one needs to report to the main torch-rnn repo (https://github.com/jcjohnson/torch-rnn) and not on this one.

First of all, are you for sure running a CUDA video card? If yes, let's try something, what happens if you run nvidia-smi inside the container? Does it show any relevant info?

spadavec commented 8 years ago

@crisbal thanks for the heads up--i will post this to the torch-rnn repo instead. For what its worth, i do have a gpu installed:

root@9be35619d034:~/torch-rnn# nvidia-smi Mon Jul 11 19:17:26 2016
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.27 Driver Version: 367.27 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A | | 28% 41C P8 7W / 180W | 725MiB / 8113MiB | 1% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+

crisbal commented 8 years ago

Let me know if in the end it is my fault or their :)

One random thought I had: since you have a 1080 maybe it uses some new kind of CUDA that maybe it is not well supported by either nvidia-docker or torch.

spadavec commented 8 years ago

@crisbal it looks like the issue is that a newer version of CUDA is needed:

https://github.com/jcjohnson/torch-rnn/issues/122

Did you have any plans to make a CUDA8 version of the docker? Thanks for all the work you've done!

crisbal commented 8 years ago

As soon as I get my hands on a Cuda machine and on fast Internet I will. Sorry I can't do it ASAP.

On Tue, Jul 12, 2016, 06:50 spadavec notifications@github.com wrote:

@crisbal https://github.com/crisbal it looks like the issue is that a newer version of CUDA is needed:

jcjohnson/torch-rnn#122 https://github.com/jcjohnson/torch-rnn/issues/122

Did you have any plans to make a CUDA8 version of the docker? Thanks for all the work you've done!

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/crisbal/docker-torch-rnn/issues/1#issuecomment-231936674, or mute the thread https://github.com/notifications/unsubscribe/ACmgZhqgcoWja9U4O3BL8Clff0Bd7u2iks5qUx0kgaJpZM4JGsle .

xoryouyou commented 7 years ago

@spadavec I had the same issue and build this today https://hub.docker.com/r/xoryouyou/torch-rnn/

HandsomeDevilv112 commented 7 years ago

I got this error today as I'm using a 1080 and have cuda 8 installed. @xoryouyou, I tried the command on the page you posted, but I'm getting an error docker pull xoryouyou/torch-rnn Using default tag: latest Error response from daemon: manifest for xoryouyou/torch-rnn:latest not found

xoryouyou commented 7 years ago

@HandsomeDevilv112 yeah the images it was only tagged as 1.0 and not latest I updated it.

HandsomeDevilv112 commented 7 years ago

@xoryouyou: Cool! Much obliged. That seems to have done the trick. My apologies if there was a way for me to fix that myself and I just didn't catch it.

valentinvieriu commented 7 years ago

@xoryouyou Do you think you can share the Docker file also? I want to have a look on how you build your image. I'm trying to use https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile but it does not compile It fails at this section:

RUN git clone https://github.com/jcjohnson/torch-rnn && \
    pip install -r torch-rnn/requirements.txt
xoryouyou commented 7 years ago

@valentinvieriu sorry I currently don't have access to that machine where I build the torch-rnn but i'll see if I can recreate your issue.

valentinvieriu commented 7 years ago

This is the issue that pops out: ''' copying h5py/tests/hl/test_file.py -> build/lib.linux-x86_64-2.7/h5py/tests/hl running build_ext Traceback (most recent call last): File "", line 1, in File "/tmp/pip-build-MyYa9y/h5py/setup.py", line 140, in cmdclass = CMDCLASS, File "/usr/lib/python2.7/distutils/core.py", line 151, in setup dist.run_commands() File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands self.run_command(cmd) File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command cmd_obj.run() File "/usr/lib/python2.7/dist-packages/wheel/bdist_wheel.py", line 179, in run self.run_command('build') File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command self.distribution.run_command(command) File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command cmd_obj.run() File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run self.run_command(cmd_name) File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command self.distribution.run_command(command) File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command cmd_obj.run() File "/tmp/pip-build-MyYa9y/h5py/setup_build.py", line 140, in run from Cython.Build import cythonize ImportError: No module named Cython.Build ''' This is by uisng https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile

as said it fails at the:

RUN git clone https://github.com/jcjohnson/torch-rnn && \
    pip install -r torch-rnn/requirements.txt

section

Any help is appreciated. I'm not very familiar with the dependencies, I plan only to use this as a tool.

Thank you @xoryouyou

valentinvieriu commented 7 years ago

Ok for future references, this fixed the building issue on ubuntu 16.04 replace

RUN git clone https://github.com/jcjohnson/torch-rnn && \
    pip install -r torch-rnn/requirements.txt

from https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile with

#torch-rnn and python requirements
# we use https://github.com/jcjohnson/torch-rnn/blob/master/requirements.txt as a quideline
WORKDIR /root
RUN apt-get install -y cython
RUN pip install --upgrade pip
RUN pip install Cython==0.23.4
RUN pip install numpy==1.10.4
RUN pip install argparse==1.2.1
RUN HDF5_DIR=/usr/lib/x86_64-linux-gnu/hdf5/serial/ pip install h5py==2.5.0
RUN pip install six==1.10.0
RUN git clone https://github.com/jcjohnson/torch-rnn

I will work on a Docker image and share it with the rest when it's finished

xoryouyou commented 7 years ago

@valentinvieriu I am currently building with the crisbal/docker-torch-rnn image on arch and it looks to build fine. Will report when done.

xoryouyou commented 7 years ago

Build on Linux 4.12.8-2-ARCH #1 SMP PREEMPT Fri Aug 18 14:08:02 UTC 2017 x86_64 GNU/Linux with Docker version 17.06.0-ce, build 3dfb8343 build_log.txt