AaronJackson / vrn

:man: Code for "Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression"
http://aaronsplace.co.uk/papers/jackson2017recon/
MIT License
4.52k stars 742 forks source link

(Invalid numpy data type 9) Segmentation Fault, Python Mess? #59

Closed aminemarref closed 2 years ago

aminemarref commented 6 years ago

Hello,

I went through the threads about not being able to run vrn.sh and getting a segmentation fault but I could not find a solution there. After two weeks of trying to run the script, I am giving up and reaching for help.

I installed a fresh Ubuntu 16.04 for this purpose on an i5 machine with GeForce GTX 1050. I followed the install instructions to the letter (including the supported Cuda/Cudnn versions). In particular, the required Python libraries where installed this way (in case the fresh installation comes with conflicting Python modules):

sudo apt remove python-matplotlib 
sudo apt remove python-numpy 
sudo apt auto-remove
pip install --user dlib matplotlib numpy visvis imageio

When everything finished I got the following error running "vrn.sh".

amine@Dell-Optiplex-990:~/Work/vrn$ ./run.sh
./run.sh: line 31: 27379 Segmentation fault      (core dumped) th main.lua -model 2D-FAN-300W.t7 -input ../$INPUT/ -detectFaces true -mode generate -output ../$INPUT/ -device gpu -outputFormat txt
ls: cannot access '*.txt': No such file or directory
Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.5ls: cannot access '*.raw': No such file or directory

Stepping through the code in Torch yields:

amine@Dell-Optiplex-990:~/Work/vrn/face-alignment$ th main.lua 
Segmentation fault (core dumped)
amine@Dell-Optiplex-990:~/Work/vrn/face-alignment$ th

  ______             __   |  Torch7 
 /_  __/__  ________/ /   |  Scientific computing for Lua. 
  / / / _ \/ __/ __/ _ \  |  Type ? for help 
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch 
                          |  http://torch.ch 

th> require 'torch'
    // Stuff
                                                                      [0.0017s]
th> require 'nn'
    // Stuff
                                                                      [0.0664s]
th> require 'nngraph'
    // Stuff
                                                                      [0.0031s]
th> require 'paths'
    // Stuff
                                                                      [0.0007s]
th> require 'image'
    // Stuff
                                                                      [0.0042s]
th> require 'xlua'
    // Stuff
                                                                      [0.0002s]
th> local utils = require 'utils'
Fatal Python error: ceval: tstate mix-up
Segmentation fault (core dumped)

Another Torch stepping right after the previous one yields (Notice that the error tstate mix-up disappears):

amine@Dell-Optiplex-990:~/Work/vrn/face-alignment$ th

  ______             __   |  Torch7 
 /_  __/__  ________/ /   |  Scientific computing for Lua. 
  / / / _ \/ __/ __/ _ \  |  Type ? for help 
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch 
                          |  http://torch.ch 

th> require 'torch'
    // Stuff
                                                                      [0.0009s]
th> require 'nn'
    // Stuff
                                                                      [0.0533s]
th> require 'nngraph'
    // Stuff
                                                                      [0.0031s]
th> require 'paths'
    // Stuff
                                                                      [0.0012s]
th> require 'image'
    // Stuff
                                                                      [0.0042s]
th> require 'xlua'
    // Stuff
                                                                      [0.0001s]
th> local utils = require 'utils'
Segmentation fault (core dumped)
amine@Dell-Optiplex-990:~/Work/vrn/face-alignment$ 

Another Torch run complained about some null state, but I could not reproduce the error to attach it here.

So my conclusion was that I am unable to get passed line 8 of "Main.lua".

I decided to install the Python libraries using another route : installing numpy and matplotlib through apt (Ignore the unnecessary steps, after two weeks of being annoyed I developed the habit of not trusting the Python/Linux relationship):

pip uninstall dlib 
pip uninstall matplotlib 
pip uninstall numpy 
pip uninstall visvis 
pip uninstall imageio
sudo apt remove python-matplotlib 
sudo apt remove python-numpy 
sudo apt auto-remove
pip install --user dlib visvis imageio
sudo apt install python-matplotlib python-numpy

After this, I re-installed Torch, THPP, and Fblualib, and when I run the script "vrn.sh" I get:

amine@Dell-Optiplex-990:~/Work/vrn$ ./run.sh
Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.5...ork/usr/local/torch/install/share/lua/5.1/trepl/init.lua:389: module 'matio' not found:No LuaRocks module found for matio
    no field package.preload['matio']
    no file '/home/amine/.luarocks/share/lua/5.1/matio.lua'
    no file '/home/amine/.luarocks/share/lua/5.1/matio/init.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/lua/5.1/matio.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/lua/5.1/matio/init.lua'
    no file './matio.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/luajit-2.1.0-beta1/matio.lua'
    no file '/usr/local/share/lua/5.1/matio.lua'
    no file '/usr/local/share/lua/5.1/matio/init.lua'
    no file '/home/amine/.luarocks/lib/lua/5.1/matio.so'
    no file '/home/amine/Work/usr/local/torch/install/lib/lua/5.1/matio.so'
    no file '/home/amine/Work/usr/local/torch/install/lib/matio.so'
    no file './matio.so'
    no file '/usr/local/lib/lua/5.1/matio.so'
    no file '/usr/local/lib/lua/5.1/loadall.so' 
warning: <matio> could not be loaded (is it installed?) 
...ork/usr/local/torch/install/share/lua/5.1/trepl/init.lua:389: module 'npy4th' not found:No LuaRocks module found for npy4th
    no field package.preload['npy4th']
    no file '/home/amine/.luarocks/share/lua/5.1/npy4th.lua'
    no file '/home/amine/.luarocks/share/lua/5.1/npy4th/init.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/lua/5.1/npy4th.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/lua/5.1/npy4th/init.lua'
    no file './npy4th.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/luajit-2.1.0-beta1/npy4th.lua'
    no file '/usr/local/share/lua/5.1/npy4th.lua'
    no file '/usr/local/share/lua/5.1/npy4th/init.lua'
    no file '/home/amine/.luarocks/lib/lua/5.1/npy4th.so'
    no file '/home/amine/Work/usr/local/torch/install/lib/lua/5.1/npy4th.so'
    no file '/home/amine/Work/usr/local/torch/install/lib/npy4th.so'
    no file './npy4th.so'
    no file '/usr/local/lib/lua/5.1/npy4th.so'
    no file '/usr/local/lib/lua/5.1/loadall.so' 
warning: <npy4th> could not be loaded (is it installed?)    
Scanning directory for data...  
Found 5 images  
5 images require a face detector    
Initialising python libs... 
Initialising detector...    
/home/amine/Work/usr/local/torch/install/bin/luajit: main.lua:51: Invalid numpy data type 9
stack traceback:
    [C]: in function 'detect'
    main.lua:51: in main chunk
    [C]: in function 'dofile'
    ...ocal/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00405d50
ls: cannot access '*.txt': No such file or directory
Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.5ls: cannot access '*.raw': No such file or directory

So now the script executes up to line 51 of "Main.lua" then complains about "Invalid numpy data type" --- an error for which I found exactly two Google-search entries; none of which were terribly useful for my limited understanding.

At this stage I could not think anymore. It looks as if the numpy/mayplotlib libraries installed through apt get me further in code execution but the complaint in line 51 is mysterious.

For the sake of completeness (or verbosity), I show the initial install process.

# Install NVIDIA Driver.
    $ sudo apt-get purge nvidia* 
    $ sudo add-apt-repository ppa:graphics-drivers
    $ sudo apt-get update 
    $ sudo apt remove libappstream3 [If previous command complains]
    $ sudo apt-get update [If previous command needed]
    $ sudo apt-get install nvidia-390 [At the time of doing this, 390 is the latest driver]
    $ reboot
    $ lsmod | grep nvidia 
# Install Cuda, Cudnn
   $ sudo ./1_cuda_8.0.61_375.26_linux.run 
   $ sudo ./2_cuda_8.0.61.2_linux.run 
   $ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
   $ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
   $ sudo chmod a+r /usr/local/cuda/include/cudnn.h
   $ sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
# VRN
    # Install some dependencies for later. 
        sudo apt install libgoogle-glog-dev libboost-all-dev
        sudo apt update && sudo apt -y upgrade
                sudo apt install python-pip
        sudo apt install cmake
        sudo apt remove python-matplotlib 
        sudo apt remove python-numpy 
        sudo apt auto-remove
        pip install --user dlib matplotlib numpy visvis imageio
    # Install the Torch distribution.
        mkdir -p $HOME/Work/usr/{local,src}
        cd $HOME/Work/usr/local
        sudo apt install git
        git clone https://github.com/torch/distro.git
        mv distro torch
        cd torch
        sudo ./install-deps
        sudo ./install.sh
        source $HOME/Work/usr/local/torch/install/bin/torch-activate
    # Install THPP and fb.python for the face alignment code
        cd $HOME/Work/usr/src
        git clone https://github.com/1adrianb/thpp.git
        cd thpp/thpp        
        sudo THPP_NOFB=1 ./build.sh
    # Install fb.python.
        cd $HOME/Work/usr/src
        git clone https://github.com/facebook/fblualib.git
        cd fblualib/fblualib/python     
        luarocks make rockspec/*
    # vrn.
        cd $HOME/Work
        git clone --recursive https://github.com/AaronJackson/vrn.git
        cd vrn
        ./download.sh
        ./run.sh

S.O.S.

AaronJackson commented 6 years ago

Thanks for the easy to read github issue. :+1:

Please try removing these lines from utils.lua

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib.patches as patches
aminemarref commented 6 years ago

Thank you for your prompt reply,

I performed the required modifications to the file utils.lua, re-installed numpy and matplotlib through pip, recompiled torch, thpp, and fblualib; and now running run.sh yields:

amine@Dell-Optiplex-990:~/Work/vrn$ ./run.sh
Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.5...ork/usr/local/torch/install/share/lua/5.1/trepl/init.lua:389: module 'matio' not found:No LuaRocks module found for matio
    no field package.preload['matio']
    no file '/home/amine/.luarocks/share/lua/5.1/matio.lua'
    no file '/home/amine/.luarocks/share/lua/5.1/matio/init.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/lua/5.1/matio.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/lua/5.1/matio/init.lua'
    no file './matio.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/luajit-2.1.0-beta1/matio.lua'
    no file '/usr/local/share/lua/5.1/matio.lua'
    no file '/usr/local/share/lua/5.1/matio/init.lua'
    no file '/home/amine/.luarocks/lib/lua/5.1/matio.so'
    no file '/home/amine/Work/usr/local/torch/install/lib/lua/5.1/matio.so'
    no file '/home/amine/Work/usr/local/torch/install/lib/matio.so'
    no file './matio.so'
    no file '/usr/local/lib/lua/5.1/matio.so'
    no file '/usr/local/lib/lua/5.1/loadall.so' 
warning: <matio> could not be loaded (is it installed?) 
...ork/usr/local/torch/install/share/lua/5.1/trepl/init.lua:389: module 'npy4th' not found:No LuaRocks module found for npy4th
    no field package.preload['npy4th']
    no file '/home/amine/.luarocks/share/lua/5.1/npy4th.lua'
    no file '/home/amine/.luarocks/share/lua/5.1/npy4th/init.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/lua/5.1/npy4th.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/lua/5.1/npy4th/init.lua'
    no file './npy4th.lua'
    no file '/home/amine/Work/usr/local/torch/install/share/luajit-2.1.0-beta1/npy4th.lua'
    no file '/usr/local/share/lua/5.1/npy4th.lua'
    no file '/usr/local/share/lua/5.1/npy4th/init.lua'
    no file '/home/amine/.luarocks/lib/lua/5.1/npy4th.so'
    no file '/home/amine/Work/usr/local/torch/install/lib/lua/5.1/npy4th.so'
    no file '/home/amine/Work/usr/local/torch/install/lib/npy4th.so'
    no file './npy4th.so'
    no file '/usr/local/lib/lua/5.1/npy4th.so'
    no file '/usr/local/lib/lua/5.1/loadall.so' 
warning: <npy4th> could not be loaded (is it installed?)    
Scanning directory for data...  
Found 5 images  
5 images require a face detector    
Initialising python libs... 
Initialising detector...    
/home/amine/Work/usr/local/torch/install/bin/luajit: main.lua:51: Invalid numpy data type 9
stack traceback:
    [C]: in function 'detect'
    main.lua:51: in main chunk
    [C]: in function 'dofile'
    ...ocal/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00405d50
ls: cannot access '*.txt': No such file or directory
Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.5ls: cannot access '*.raw': No such file or directory
amine@Dell-Optiplex-990:~/Work/vrn$ python -c "import numpy; print(numpy.version.version); print(numpy.__file__)"
1.14.2
/home/amine/.local/lib/python2.7/site-packages/numpy/__init__.pyc
amine@Dell-Optiplex-990:~/Work/vrn$ 

So on the plus side, I am able to execute Main.lua further down using the pip-installed packages i.e. without recourse to the apt python packages; but on the minus side, I am again hit by the Invalid numpy data type error. Sigh...

AaronJackson commented 6 years ago

Hmm, yes a few people have had that error. I'm not sure what causes it. Are you running Ubuntu? It seems to happen to Ubuntu users.

aminemarref commented 6 years ago

I have made yet another fresh CentOS 7 installation (configured for development workstation) and followed the vrn installation guide, and got the usual segmentation-fault error. When I deleted the three lines from utils.lua, I got the invalid-numpy data-type error. I think the error is reproducible on CentOS 7.

I noticed that when executing the command sudo ./install-deps which is part of Torch's installation, the packages numpy and matplotlib get installed although they were initially installed via pip as shown in the following standard output's extract:

Dependencies Resolved

================================================================================
 Package                  Arch   Version                             Repository
                                                                           Size
================================================================================
Installing:
 python-ipython           noarch 3.2.1-1.el7                         epel  13 k
Installing for dependencies:
 PyQt4                    x86_64 4.10.1-13.el7                       base 2.9 M
 agg                      x86_64 2.5-18.el7                          base 145 k
 atlas                    x86_64 3.10.1-12.el7                       base 4.5 M
 blas                     x86_64 3.4.2-8.el7                         base 399 k
 kde-filesystem           x86_64 4-47.el7                            base  48 k
 lapack                   x86_64 3.4.2-8.el7                         base 5.4 M
 numpy                    x86_64 1:1.7.1-11.el7                      base 2.8 M
 pexpect                  noarch 2.3-11.el7                          base 142 k
 phonon                   x86_64 4.6.0-10.el7                        base 205 k
 phonon-backend-gstreamer x86_64 2:4.6.3-3.el7                       base 140 k
 python-ipython-console   noarch 3.2.1-1.el7                         epel 1.6 M
 python-ipython-gui       noarch 3.2.1-1.el7                         epel 177 k
 python-matplotlib        x86_64 1.2.0-15.el7                        base  26 M
 python-mistune           x86_64 0.8.3-1.el7                         epel 137 k
 python-nose              noarch 1.3.7-1.el7                         base 276 k
 python-path              noarch 5.2-1.el7                           epel  47 k
 python-pillow            x86_64 2.0.0-19.gitd1c6db8.el7             base 438 k
 python-pygments          noarch 1.4-10.el7                          base 599 k
 python-repoze-lru        noarch 0.4-3.el7                           epel  13 k
 python-simplegeneric     noarch 0.8-7.el7                           epel  12 k
 python-zmq               x86_64 14.3.1-1.el7                        epel 468 k
 python2-jsonschema       noarch 2.5.1-3.el7                         epel  75 k
 sip                      x86_64 4.14.6-4.el7                        base 122 k
 t1lib                    x86_64 5.1.2-14.el7                        base 166 k
 texlive-base             noarch 2:2012-38.20130427_r30134.el7       base 325 k
 texlive-dvipng           noarch 2:svn26689.1.14-38.el7              base  44 k
 texlive-dvipng-bin       x86_64 2:svn26509.0-38.20130427_r30134.el7 base  63 k
 texlive-kpathsea         noarch 2:svn28792.0-38.el7                 base 140 k
 texlive-kpathsea-bin     x86_64 2:svn27347.0-38.20130427_r30134.el7 base  40 k
 texlive-kpathsea-lib     x86_64 2:2012-38.20130427_r30134.el7       base  78 k

Transaction Summary
================================================================================

I thought I would remove them before building torch, thpp, and fblualib. So I deleted torch, and re-performed the following steps.

        # Install the Torch distribution.
        $ cd $HOME/Work/usr/local
        $ git clone https://github.com/torch/distro.git
        $ mv distro torch
        $ cd torch
        $ sudo ./install-deps
        [New!] $ sudo yum remove numpy [hit tab for full package name]
        [New!] $ sudo yum remove python-matplotlib [hit tab for full package name]
        $ sudo ./install.sh
        $ source $HOME/Work/usr/local/torch/install/bin/torch-activate
    # Install THPP and fb.python for the face alignment code
        $ cd $HOME/Work/usr/src
        $ git clone https://github.com/1adrianb/thpp.git
        $ cd thpp/thpp
        $ export Torch_DIR="/home/amine/Work/usr/local/torch/pkg/torch/build/cmake-exports" [if needed]
        $ export Torch_DIR="/home/amine/Work/usr/local/torch/install/share/cmake/torch" [xor if needed]
        $ THPP_NOFB=1 ./build.sh [sudo does not work here]
    # Install fb.python.
        $ cd $HOME/Work/usr/src
        $ git clone https://github.com/facebook/fblualib.git
        $ cd fblualib/fblualib/python
        $ luarocks make rockspec/*

The above process did not complain about the two packages I deleted and finished successfully. When I run vrn.sh (after removing the three suggested lines), I got the same error.

I have been putting too much focus on Python's libraries setup because it appeared to me from reading the related threads that Python's configuration is the culprit.

Anyway, that's about what my time and expertise allow me to do. I hope someone can share with me the exact versions/configurations of anything related (OS, Python, etc.) and in which month, in which day, at what time, and what the exact Cartesian coordinates of the coffee cup on the desk were for the successful installation of this tool chain (ideally the coffee brand as well).

Cheers.

AaronJackson commented 6 years ago

I managed to debug this on someones Ubuntu 14.04 workstation today. The changes required to get it working are to face-alignment/utils.lua

-       local detections = py.reval('[np.asarray([d.left(), d.top(), d.right(), d.bottom()]) for i, d in enumerate(dets)]',{dets=dets})                     
+       local detections = py.reval('[np.asarray([d.left(), d.top(), d.right(), d.bottom()],dtype=float) for i, d in enumerate(dets)]',{dets=dets})  

If you are also having this problem on CentOS then try the above. Hopefully it'll sort it out.

aminemarref commented 6 years ago

Thanks Aaron,

I confirm that this fix (on file facedetection_dlib.lua by the way) works both on Ubuntu 16 and CentOS 7.

So to recap, the following change has been performed on vrn/face-alignment/utils.lua:

REMOVE LINE: from mpl_toolkits.mplot3d import Axes3D
REMOVE LINE: import matplotlib.pyplot as plt
REMOVE LINE: import matplotlib.patches as patches

and the following change has been performed on vrn/face-alignment/facedetection_dlib.lua:

REPLACE LINE: local detections = py.reval('[np.asarray([d.left(), d.top(), d.right(), d.bottom()]) for i, d in enumerate(dets)]',{dets=dets})                     
BY LINE: local detections = py.reval('[np.asarray([d.left(), d.top(), d.right(), d.bottom()],dtype=float) for i, d in enumerate(dets)]',{dets=dets})  

Now I get out-of-CUDA-memory issues, but that's a story for another thread perhaps.

Cheers.

AaronJackson commented 6 years ago

Nice! That's for confirming that this fix works. What GPU are you using? If you have 2GB then you can run the face-alignment network but not the 3D reconstruction network. However, the 3D reconstruction will work fairly well on the CPU anyway, so you can change gpu to cpu in the run.sh file.

aminemarref commented 6 years ago

Yep you guessed it right, I have a teeny-tiny GPU memory :-) I changed the device from gpu to cpu and it works great now. You saved me a lot of hair-tearing troubleshooting :-) Thanks a lot.

AaronJackson commented 6 years ago

:+1: I'm going to leave this open for a while to stop people asking the same question. :)