AaronJackson / vrn

:man: Code for "Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression"
http://aaronsplace.co.uk/papers/jackson2017recon/
MIT License
4.51k stars 745 forks source link

cuda runtime error (2) : out of memory at #107

Closed OswaldoBornemann closed 5 years ago

OswaldoBornemann commented 5 years ago

i have three gpu but just one(11G) is in free. I have write the code below in 'face-aligment/utils.lua'

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2'

When i run './run.sh', the output show

Scanning directory for data...
Found 5 images
5 images require a face detector
Initialising python libs...
Initialising detector...
Cropped and scaled AFLW_image00046.jpg
Cropped and scaled AFLW_image00095.jpg
Cropped and scaled AFLW_image00190.jpg
Cropped and scaled AFLW_image00656.jpg
Cropped and scaled asj.jpg
THCudaCheck FAIL file=/home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/zeng_ruihong/torch/install/bin/luajit: /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:11: cuda runtime error (2) : out of memory at /home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
        [C]: in function 'resize'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:11: in function 'torch_Storage_type'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:57: in function 'recursiveType'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:160: in function 'type'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:41: in function 'recursiveType'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:160: in function 'cuda'
        process.lua:18: in main chunk
        [C]: in function 'dofile'
        ...hong/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50
ls: 无法访问'*.raw': 没有那个文件或目录                     
OswaldoBornemann commented 5 years ago

I change gpu to cpu, then the output is below:

zeng_ruihong@GPU-server:~/vrn$ ./run.sh
Scanning directory for data...
Found 5 images
5 images require a face detector
Initialising python libs...
Initialising detector...
/home/zeng_ruihong/torch/install/bin/luajit: ...eng_ruihong/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
.../zeng_ruihong/torch/install/share/lua/5.1/cudnn/init.lua:171: assertion failed!
stack traceback:
        [C]: in function 'assert'
        .../zeng_ruihong/torch/install/share/lua/5.1/cudnn/init.lua:171: in function 'toDescriptor'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:123: in function 'createIODescriptors'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:188: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
        [C]: in function 'xpcall'
        ...eng_ruihong/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        ...ng_ruihong/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
        ..._ruihong/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
        ..._ruihong/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
        main.lua:63: in main chunk
        [C]: in function 'dofile'
        ...hong/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        ...eng_ruihong/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        ...ng_ruihong/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
        ..._ruihong/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
        ..._ruihong/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
        main.lua:63: in main chunk
        [C]: in function 'dofile'
        ...hong/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50
Cropped and scaled AFLW_image00046.jpg
Cropped and scaled AFLW_image00095.jpg
Cropped and scaled AFLW_image00190.jpg
Cropped and scaled AFLW_image00656.jpg
Cropped and scaled asj.jpg
Processed AFLW_image00190.
Processed asj.
Processed AFLW_image00095.
Processed AFLW_image00656.
Processed AFLW_image00046.
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server 
AaronJackson commented 5 years ago

CPU mode isn't supported. Run nvidia-smi and confirm that you do actually have memory free on the GPU.

OswaldoBornemann commented 5 years ago

may i ask how could i specific the gpu device that torch7 used ? i am new to lua.Thans @AaronJackson

AaronJackson commented 5 years ago

You got the idea in Python but to make it global from for all applications running from the current shell, you need to export the device you want to use. i.e:

export CUDA_VISIBLE_DEVICES=2
./run.sh
OswaldoBornemann commented 5 years ago

@AaronJackson Follow your instruction, i got the same error:

zeng_ruihong@GPU-server:~/vrn$ export CUDA_VISIBLE_DEVICES=2
zeng_ruihong@GPU-server:~/vrn$ ./run.sh
Scanning directory for data...
Found 5 images
5 images require a face detector
Initialising python libs...
Initialising detector...
THCudaCheck FAIL file=/home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/zeng_ruihong/torch/install/bin/luajit: .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: cuda runtime error (2) : out of memory at /home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
        [C]: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function <.../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:245>
        [C]: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
        ...
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:353: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        ..._ruihong/torch/install/share/lua/5.1/nngraph/gmodule.lua:495: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:409: in function 'load'
        main.lua:29: in main chunk
        [C]: in function 'dofile'
        ...hong/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50
Cropped and scaled AFLW_image00046.jpg
Cropped and scaled AFLW_image00095.jpg
Cropped and scaled AFLW_image00190.jpg
Cropped and scaled AFLW_image00656.jpg
Cropped and scaled asj.jpg
THCudaCheck FAIL file=/home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/zeng_ruihong/torch/install/bin/luajit: /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:11: cuda runtime error (2) : out of memory at /home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
        [C]: in function 'resize'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:11: in function 'torch_Storage_type'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:57: in function 'recursiveType'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:160: in function 'type'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:41: in function 'recursiveType'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:160: in function 'cuda'
        process.lua:18: in main chunk
        [C]: in function 'dofile'
        ...hong/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server 
AaronJackson commented 5 years ago

Show me the output of nvidia-smi please

tsungruihon writes:

@AaronJackson Follow your instruction, i got the same error:

zeng_ruihong@GPU-server:~/vrn$ export CUDA_VISIBLE_DEVICES=2
zeng_ruihong@GPU-server:~/vrn$ ./run.sh
Scanning directory for data...
Found 5 images
5 images require a face detector
Initialising python libs...
Initialising detector...
THCudaCheck FAIL file=/home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/zeng_ruihong/torch/install/bin/luajit: .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: cuda runtime error (2) : out of memory at /home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
        [C]: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function <.../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:245>
        [C]: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
        ...
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:353: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
        ..._ruihong/torch/install/share/lua/5.1/nngraph/gmodule.lua:495: in function 'read'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
        .../zeng_ruihong/torch/install/share/lua/5.1/torch/File.lua:409: in function 'load'
        main.lua:29: in main chunk
        [C]: in function 'dofile'
        ...hong/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50
Cropped and scaled AFLW_image00046.jpg
Cropped and scaled AFLW_image00095.jpg
Cropped and scaled AFLW_image00190.jpg
Cropped and scaled AFLW_image00656.jpg
Cropped and scaled asj.jpg
THCudaCheck FAIL file=/home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/zeng_ruihong/torch/install/bin/luajit: /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:11: cuda runtime error (2) : out of memory at /home/zeng_ruihong/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
        [C]: in function 'resize'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:11: in function 'torch_Storage_type'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:57: in function 'recursiveType'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:160: in function 'type'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType'
        /home/zeng_ruihong/torch/install/share/lua/5.1/nn/utils.lua:41: in function 'recursiveType'
        ...e/zeng_ruihong/torch/install/share/lua/5.1/nn/Module.lua:160: in function 'cuda'
        process.lua:18: in main chunk
        [C]: in function 'dofile'
        ...hong/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server
: cannot connect to X server 

-- Aaron Jackson - M6PIU http://aaronsplace.co.uk/

OswaldoBornemann commented 5 years ago

@AaronJackson

Mon Oct 15 10:06:03 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:02:00.0 Off |                  N/A |
| 56%   83C    P2   112W / 250W |  11787MiB / 12207MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 00000000:04:00.0 Off |                  N/A |
| 87%   88C    P2   201W / 250W |  11462MiB / 12207MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 00000000:84:00.0 Off |                  N/A |
| 22%   35C    P8    18W / 250W |     11MiB / 12207MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     81088      C   python                                     11774MiB |
|    1    105752      C   python                                     11451MiB |
+-----------------------------------------------------------------------------+
AaronJackson commented 5 years ago

Ah, it has been a while since I looked at that script. The variable is exported in run.sh, so if you open it and change the CUDA_VISIBLE_DEVICES line to 2, you should be good to go.

OswaldoBornemann commented 5 years ago

@AaronJackson glad to hear that. I am very grateful. Thanks!! Now the output is below:

zeng_ruihong@GPU-server:~/vrn$ ./run.sh
Scanning directory for data...
Found 5 images
5 images require a face detector
Initialising python libs...
Initialising detector...
Cropped and scaled AFLW_image00046.jpg
Cropped and scaled AFLW_image00095.jpg
Cropped and scaled AFLW_image00190.jpg
Cropped and scaled AFLW_image00656.jpg
Cropped and scaled asj.jpg
Processed AFLW_image00190.
Processed asj.
Processed AFLW_image00095.
Processed AFLW_image00656.
Processed AFLW_image00046.
qt.qpa.screen: QXcbConnection: Could not connect to display
Could not connect to any X display.
qt.qpa.screen: QXcbConnection: Could not connect to display
Could not connect to any X display.
qt.qpa.screen: QXcbConnection: Could not connect to display
Could not connect to any X display.
qt.qpa.screen: QXcbConnection: Could not connect to display
Could not connect to any X display.
qt.qpa.screen: QXcbConnection: Could not connect to display
Could not connect to any X display.
OswaldoBornemann commented 5 years ago

when i wrote export QT_QPA_PLATFORM='offscreen', now the output is

zeng_ruihong@GPU-server:~/vrn$ ./run.sh
Scanning directory for data...
Found 5 images
5 images require a face detector
Initialising python libs...
Initialising detector...
Cropped and scaled AFLW_image00046.jpg
Cropped and scaled AFLW_image00095.jpg
Cropped and scaled AFLW_image00190.jpg
Cropped and scaled AFLW_image00656.jpg
Cropped and scaled asj.jpg
Processed AFLW_image00190.
Processed asj.
Processed AFLW_image00095.
Processed AFLW_image00656.
Processed AFLW_image00046.
./run.sh: 行 90: 91420 段错误               (核心已转储) python ../vis.py --image ../$INPUT/scaled/$fname.jpg --volume $fname.raw
./run.sh: 行 90: 91424 段错误               (核心已转储) python ../vis.py --image ../$INPUT/scaled/$fname.jpg --volume $fname.raw
./run.sh: 行 90: 91428 段错误               (核心已转储) python ../vis.py --image ../$INPUT/scaled/$fname.jpg --volume $fname.raw
./run.sh: 行 90: 91432 段错误               (核心已转储) python ../vis.py --image ../$INPUT/scaled/$fname.jpg --volume $fname.raw
./run.sh: 行 90: 91436 段错误               (核心已转储) python ../vis.py --image ../$INPUT/scaled/$fname.jpg --volume $fname.raw  

@AaronJackson

AaronJackson commented 5 years ago

Well the vis script can't display anything because there is no X server. Either connect with X11 forwarding or modify the scripts to use raw2obj instead of vis.

OswaldoBornemann commented 5 years ago

thanks @AaronJackson . Everything is fine now. But now i when i open the obj.file, the object is all black but not colored. How could i output texture image ?