cjweeks / tensorflow-cmake

Integrate TensorFlow with CMake projects effortlessly
MIT License
330 stars 83 forks source link

Build fail with the external-project example #1

Closed antogerva closed 8 years ago

antogerva commented 8 years ago

First, I found this project thanks to this post on stackoverflow.

Environment info

OS: I'm running on Docker with an Ubuntu 14.04.4 LTS x86_64 (gcc/g++ version is 4.8.4) No CUDA or cuDNN installed

Please note, I already managed to compile Tensorflow with Bazel before using this tutorial.

I tried to run the examples with both branch master and r0.10 from tensorflow and I got the same issue. Here's what I did:

$ sudo apt-get install cmake
$ sudo apt-get install autoconf automake libtool curl make g++ unzip  # Protobuf Dependencies
$ sudo apt-get install python-numpy swig python-dev python-wheel      # TensorFlow Dependencies

$ #run the example
$ cd ~/
$ rm -rf ~/git
$ mkdir -p ~/git
$ cd ~/git

$ # Note: using -b r0.10 lead to same result
$ git clone https://github.com/tensorflow/tensorflow 
$ cd tensorflow

$ # Add build rule for libtensorflow_all.so
$ printf '\n# Added build rule
cc_binary(
    name = "libtensorflow_all.so",
    linkshared = 1,
    linkopts = ["-Wl,--version-script=tensorflow/tf_version_script.lds"],
    deps = [
        "//tensorflow/cc:cc_ops",
        "//tensorflow/core:framework_internal",
        "//tensorflow/core:tensorflow",
    ],
)\n' >> tensorflow/BUILD

$ bazel clean                                   # Clean project
$ export CC="/usr/bin/gcc"                      # Set location of C compiler
$ export CXX="/usr/bin/g++"                     # Set location of C++ compiler
$ bazel build tensorflow:libtensorflow_all.so   # Rebuild project

$ #This run the ./configure with python location /usr/bin/python and no Google Cloud Platform support and no GPU support
$ printf '\nN\nN\n' | ./configure 

$ bazel build tensorflow:libtensorflow_all.so
$ sudo cp bazel-bin/tensorflow/libtensorflow_all.so /usr/local/lib

$ sudo mkdir -p /usr/local/include/google/tensorflow
$ sudo cp -r tensorflow /usr/local/include/google/tensorflow/
$ sudo find /usr/local/include/google/tensorflow/tensorflow -type f  ! -name "*.h" -delete

$ sudo cp bazel-genfiles/tensorflow/core/framework/*.h  /usr/local/include/google/tensorflow/tensorflow/core/framework
$ sudo cp bazel-genfiles/tensorflow/core/kernels/*.h  /usr/local/include/google/tensorflow/tensorflow/core/kernels
$ sudo cp bazel-genfiles/tensorflow/core/lib/core/*.h  /usr/local/include/google/tensorflow/tensorflow/core/lib/core
$ sudo cp bazel-genfiles/tensorflow/core/protobuf/*.h  /usr/local/include/google/tensorflow/tensorflow/core/protobuf
$ sudo cp bazel-genfiles/tensorflow/core/util/*.h  /usr/local/include/google/tensorflow/tensorflow/core/util
$ sudo cp bazel-genfiles/tensorflow/cc/ops/*.h  /usr/local/include/google/tensorflow/tensorflow/cc/ops

$ sudo cp -r third_party /usr/local/include/google/tensorflow/
$ sudo rm -r /usr/local/include/google/tensorflow/third_party/py
$ sudo rm -r /usr/local/include/google/tensorflow/third_party/avro

$ cd ~/git
$ git clone https://github.com/cjweeks/tensorflow-cmake.git
$ cd tensorflow-cmake

$ # This will generate / copy Eigen.cmake, Eigen_VERSION.cmake, Protobuf.cmake, and Protobuf_VERSION.cmake
$ ./eigen.sh generate external ~/git/tensorflow examples/external-project/cmake/Modules examples/external-project/cmake/Modules
$ ./protobuf.sh generate external ~/git/tensorflow examples/external-project/cmake/Modules examples/external-project/cmake/Modules

$ cd examples/external-project/

$ mkdir build
$ cd build
$ cmake ..
$ make

Then it fail, here's the last output of the make command.

...
[ 94%] Completed 'Protobuf'
[ 94%] Built target Protobuf
Scanning dependencies of target external-project
[100%] Building CXX object CMakeFiles/external-project.dir/main.cc.o
In file included from /usr/local/include/google/tensorflow/tensorflow/core/public/session.h:22:0,
                 from /root/git/tensorflow-cmake/examples/external-project/main.cc:1:
/usr/local/include/google/tensorflow/tensorflow/core/framework/graph.pb.h:146:3: error: 'PROTOBUF_DEPRECATED_ATTR' does not name a type
   PROTOBUF_DEPRECATED_ATTR void clear_version();
   ^
/usr/local/include/google/tensorflow/tensorflow/core/framework/graph.pb.h:147:3: error: 'PROTOBUF_DEPRECATED_ATTR' does not name a type
   PROTOBUF_DEPRECATED_ATTR static const int kVersionFieldNumber = 3;
   ^
/usr/local/include/google/tensorflow/tensorflow/core/framework/graph.pb.h:148:3: error: 'PROTOBUF_DEPRECATED_ATTR' does not name a type
   PROTOBUF_DEPRECATED_ATTR ::google::protobuf::int32 version() const;
   ^
/usr/local/include/google/tensorflow/tensorflow/core/framework/graph.pb.h:149:3: error: 'PROTOBUF_DEPRECATED_ATTR' does not name a type
   PROTOBUF_DEPRECATED_ATTR void set_version(::google::protobuf::int32 value);
   ^
/usr/local/include/google/tensorflow/tensorflow/core/framework/graph.pb.h:443:37: error: no 'void tensorflow::GraphDef::clear_version()' member function declared in class 'tensorflow::GraphDef'
 inline void GraphDef::clear_version() {
                                     ^
/usr/local/include/google/tensorflow/tensorflow/core/framework/graph.pb.h:446:54: error: no 'google::protobuf::int32 tensorflow::GraphDef::version() const' member function declared in class 'tensorflow::GraphDef'
 inline ::google::protobuf::int32 GraphDef::version() const {
                                                      ^
/usr/local/include/google/tensorflow/tensorflow/core/framework/graph.pb.h:450:66: error: no 'void tensorflow::GraphDef::set_version(google::protobuf::int32)' member function declared in class 'tensorflow::GraphDef'
 inline void GraphDef::set_version(::google::protobuf::int32 value) {
                                                                  ^
make[2]: *** [CMakeFiles/external-project.dir/main.cc.o] Error 1
make[1]: *** [CMakeFiles/external-project.dir/all] Error 2
make: *** [all] Error 2

Look there's an error with the set_version function. You can see the full output over there.

Furthermore, there's an empty <SOURCE_DIR> folder inside the build folder:

$ ls -la ~/git/tensorflow-cmake/examples/external-project/build/
total 48
drwxr-xr-x 5 root root  4096 Aug  8 11:42 .
drwxr-xr-x 6 root root  4096 Aug  8 11:42 ..
drwxr-xr-x 2 root root  4096 Aug  8 11:42 <SOURCE_DIR>
-rw-r--r-- 1 root root 13113 Aug  8 11:42 CMakeCache.txt
drwxr-xr-x 8 root root  4096 Aug  8 11:54 CMakeFiles
-rw-r--r-- 1 root root  5672 Aug  8 11:42 Makefile
-rw-r--r-- 1 root root  1678 Aug  8 11:42 cmake_install.cmake
drwxr-xr-x 3 root root  4096 Aug  8 11:42 root

This look a bit wrong... shouldn't the <SOURCE_DIR> point to a project source directory? Maybe there's a configuration missing with those .cmake files. I did this a quick search but I'm unsure about what to do exactly in order to fix the issue:

$ grep -R '<SOURCE_DIR>' ~/git/tensorflow-cmake/
/root/git/tensorflow-cmake/examples/external-project/cmake/Modules/Protobuf.cmake:        INSTALL_DIR <SOURCE_DIR>
/root/git/tensorflow-cmake/Protobuf.cmake:        INSTALL_DIR <SOURCE_DIR>

Thought, maybe this isn't directly relevant to the set_version problem.

That being said, for the record my main goal is to debug tensorflow-cmake/examples/external-project/main.cc with the CLion debugger... I'm not there yet, but if I can get the cmake setup to works, I could probably do some progress about that.

cjweeks commented 8 years ago

The PROTOBUF_DEPRECATED_ATTR macro only exists in earlier versions of Protobuf; it has recently been changed to GOOGLE_PROTOBUF_DEPRECATED_ATTR. However, TensorFlow requires this earlier version and makes references to PROTOBUF_DEPRECATED_ATTR. The protobuf.sh script scans the tensorflow repository for references to Protobuf, trying to find the commit hash to reset to. The external project builds and runs correctly in my environment (Ubuntu 16.04 LTS), but I will attempt to reproduce using your system.

The <Source_DIR> issue is unrelated and stems from a small error in Protobuf.cmake, which I will soon correct.

cjweeks commented 8 years ago

I believe I have resolved this problem.

Cause

There was an error in the CMake modules for the external project; the external/include directory was not being included. This caused the project to search for the required files in either /usr/local/include or /usr/include. If you had a different version of profobuf installed on the machine, those header files were used instead, and this resulted in the PROTOBUF_DEPRECATED_ATTR macro not being found.

Fix

I made some minor changes to the CMake logic to include the required directories. I tested the fix on my Ubuntu 16.04 LTS machine as well as a Docker container using Ubuntu 14.04.4 LTS; they both worked as expected.

@antogerva, please pull the latest version of this repository and test this yourself.

antogerva commented 8 years ago

Ok, I pulled the latest version and now it works. I have been able to run the example project correctly. So, I'm closing this issue.

That being said, in the CLion IDE, if I try to execute the "external-project" target, I get the expected result if I copy the graph.pb in the bin folder(and with additional casting on few call over the Tensorflow API). Thought, if I try to debug the "external-project" target, it won't break on any line with breakpoint. Maybe I'm missing a debug flag somewhere? It seem the breakpoints are ignored if I run the project directly with gdb. I tried using cmake -DCMAKE_BUILD_TYPE=Debug .. but look like this isn't enough.

Also, I believe there's no "fast build" target directly available from cmake. So, every time I try to build the "external-project" target, Eigen and Protobuf get rebuild as well.

In any case, I'll probably do more tests on the matter later this week and if I'm stuck, I'll create a new issue about these problems.

cjweeks commented 8 years ago

Regarding the Graph File

The C++ file in the example projects is quite naive; it tries to open graph.pb in the current directory. The default behavior of CLion is to execute the program from the bin directory, which forced you to copy the graph file. Simply changing the working directory in the build configuration works as well:

external-project-config

Regarding the Debugger

I debugged the external project with CLions's built in gdb (7.11.1). The program did stop at each breakpoint, although it was slow at times. I am currently using CLion 2016.2 (build #CL-162-1236-16). If you find any more information as to the source of your debugger problems, let me know.

Regarding the Lack of Fast Builds

This is the problem with the External Project model as opposed to installing to a local directory. However, once CMake has built Eigen and Protobuf the first time, the process should be much faster in future builds. For Eigen, CMake sees that it has already downloaded and built Eigen, and since the hashes are the same, no action is taken. Protobuf is slightly different; CMake first realizes that it has already been downloaded and the commit hash is the one requested. CMake still, however, performs the configure step, which takes about 10 - 30 seconds to complete. Once the configure step is finished, Protobuf is then rebuilt, although this takes almost no time at all, since it has already been compiled.

The first build of the external project can take anywhere from 15 minutes to an hour, depending on the speed of your computer. However, subsequent builds should take less than a minute, as Protobuf and Eigen have already been compiled (but the configure step for Protobuf is still executed, which does take time).

Is this the behavior you are experiencing?

antogerva commented 8 years ago

Thanks for the detailed answer. Now about the graph.pb, indeed the workspace was the issue, so there's no need to copy the graph.pb file. My bad about that.

Regarding the debugger, I have exactly the same version/build-version CLion and gdb. Yet when I go to Run > Debug 'external-project', it simply won't break on any of my breakpoints.

Furthermore, I get the same result from /usr/bin/gdb (still version 7.7.1, on my ubuntu system). Here's an attempt to break on line 19:

$ gdb ./bin/external-project -q 
Reading symbols from ./bin/external-project...done.
(gdb) break 19
Breakpoint 1 at 0x409fc5: file /root/git/tensorflow-cmake/examples/external-project/main.cc, line 19.
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000409fc5 in main(int, char**) at /root/git/tensorflow-cmake/examples/external-project/main.cc:19
(gdb) run
Starting program: /root/git/tensorflow-cmake/examples/external-project/bin/external-project 
warning: Error disabling address space randomization: Operation not permitted
Success: 42!
During startup program exited normally.
(gdb) 

So as you can see, during this session, I get the result output directly without breaking on any breakpoints. I've also tried with other line numbers to break without success. (note: the warning about address space randomization can be fix with set disable-randomization off within gdb so this isn't relevant).

My guess is the makefile generated by cmake fail to produce the correct symbols, yet I'm not sure how to deal with this issue. One possible idea might be to add a dbg flag to bazel, such as bazel build -c dbg tensorflow:libtensorflow_all.so, but if you didn't need to do this, then I'll need to figure out why your setup actually works.

Finally, about the fast builds, you're right. Once the first build is done, the next builds simply do a small reconfigure task and the amount of time taken is reasonable.

cjweeks commented 8 years ago

If your example of command line gdb is on your Docker container, have you tried running the container with the --privileged flag? I encountered your exact error at first:

# 'tfcf' is an Ubuntu 14.04.2 LTS image with TensorFlow and tensorflow-cmake
$ sudo docker run -t -i cjweeks/tfcf  
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from bin/external-project...done.
(gdb) break 19
Breakpoint 1 at 0x409fc5: file /root/git/tensorflow-cmake/examples/external-project/main.cc, line 19.
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000409fc5 in main(int, char**) 
                                                   at /root/git/tensorflow-cmake/examples/external-project/main.cc:19
(gdb) run
Starting program: /root/git/tensorflow-cmake/examples/external-project/bin/external-project 
warning: Error disabling address space randomization: Operation not permitted
Success: 42!
During startup program exited normally.
(gdb) quit

But after inserting the flag, I got the expected result:

$ sudo docker run --privileged -t -i cjweeks/tfcf
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from bin/external-project...done.
(gdb) break 19
Breakpoint 1 at 0x409fc5: file /root/git/tensorflow-cmake/examples/external-project/main.cc, line 19.
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000409fc5 in main(int, char**) 
                                                   at /root/git/tensorflow-cmake/examples/external-project/main.cc:19
(gdb) run
Starting program: /root/git/tensorflow-cmake/examples/external-project/bin/external-project 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffed08)
    at /root/git/tensorflow-cmake/examples/external-project/main.cc:20
20      tf::Status status = tf::NewSession(tf::SessionOptions(), &session);
(gdb) continue 
Continuing.
[New Thread 0x7fffef586700 (LWP 12)]
[New Thread 0x7fffeed85700 (LWP 13)]
[New Thread 0x7fffee584700 (LWP 14)]
[New Thread 0x7fffedd83700 (LWP 15)]
[New Thread 0x7fffed582700 (LWP 16)]
[New Thread 0x7fffecd81700 (LWP 17)]
[New Thread 0x7fffec580700 (LWP 18)]
[New Thread 0x7fffebd7f700 (LWP 19)]
[New Thread 0x7fffeb57e700 (LWP 20)]
Success: 42!
[Thread 0x7fffebd7f700 (LWP 19) exited]
[Thread 0x7fffec580700 (LWP 18) exited]
[Thread 0x7fffecd81700 (LWP 17) exited]
[Thread 0x7fffed582700 (LWP 16) exited]
[Thread 0x7fffedd83700 (LWP 15) exited]
[Thread 0x7fffee584700 (LWP 14) exited]
[Thread 0x7fffeed85700 (LWP 13) exited]
[Thread 0x7fffef586700 (LWP 12) exited]
[Thread 0x7ffff7feb780 (LWP 8) exited]
[Inferior 1 (process 8) exited normally]
(gdb) quit

Let me know if this fixes your problem. Regarding CLion's debugger, I really have no idea, especially since mine works perfectly well with the same versions.

antogerva commented 8 years ago

Wow, I had no idea about the --privileged flag. Indeed this the fixed my issue with debugging! Thanks!

cjweeks commented 8 years ago

No problem!