finalfusion / finalfusion-tensorflow-ops

Tensorflow Op for finalfusion embeddings
https://finalfusion.github.io/
Other
2 stars 0 forks source link

Add initialization and close Ops. #2

Closed sebpuetz closed 4 years ago

sebpuetz commented 4 years ago

These builds fail with

[100%] Linking CXX shared library libfinalfusion_tf.dylib

cd /Users/travis/build/finalfusion/finalfusion-tensorflow/build/finalfusion-tf && /usr/local/Cellar/cmake/3.12.3/bin/cmake -E cmake_link_script CMakeFiles/finalfusion_tf.dir/link.txt --verbose=1

/Applications/Xcode-10.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
 -DNDEBUG -dynamiclib -Wl,-headerpad_max_install_names  -o libfinalfusion_tf.dylib -install_name 
@rpath/libfinalfusion_tf.dylib CMakeFiles/finalfusion_tf.dir/lookup/FFLookup.cc.o 
CMakeFiles/finalfusion_tf.dir/kernel/FFLookupKernels.cc.o CMakeFiles/finalfusion_tf.dir
/ops/FFLookupOps.cc.o ../finalfusion-cxx/src/libfinalfusion_cxx.a -L/Users/travis/build/finalfusion
/finalfusion-tensorflow/venv/lib/python3.7/site-packages/tensorflow 
-l:libtensorflow_framework.1.dylib ../finalfusion-cxx/finalfusion-ffi/finalfusion-ffi/release/libfinalfusion.a 
-ldl -lm -lpthread 

ld: library not found for -l:libtensorflow_framework.1.dylib
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [finalfusion-tf/libfinalfusion_tf.dylib] Error 1
make[1]: *** [finalfusion-tf/CMakeFiles/finalfusion_tf.dir/all] Error 2
make: *** [all] Error 2
The command "ci/script.sh" exited with 2.

on OSX. I can't really look into this since I don't have access to a Mac around here. @danieldk Could you check whether the library is actually located where tf.sysconfig.get_link_flags() specifies it and if the name is correct?

danieldk commented 4 years ago
$ python -c 'import tensorflow as tf; print(tf.sysconfig.get_link_flags())'
['-L/nix/store/fcj24bdwdyv5hdnkz436cqlhxlzm8lhj-python3.7-tensorflow-1.13.1/lib/python3.7/site-packages/tensorflow', '-ltensorflow_framework']
$ ls /nix/store/fcj24bdwdyv5hdnkz436cqlhxlzm8lhj-python3.7-tensorflow-1.13.1/lib/python3.7/site-packages/tensorflow
__init__.py  aux-bin   core  libtensorflow_framework.so  tools
__pycache__  compiler  examples  lite
_api         contrib   include   python

So the path is correct. The f-up that you are bumping into is that apparently in the Python module on macOS libtensorflow_framework uses .so as an extension? The compiler will look for a .dylib with -l.

sebpuetz commented 4 years ago

Weird, that's how it looks like on travis-ci:

$ ci/script.sh
+ls /Users/travis/build/finalfusion/finalfusion-tensorflow/venv/lib/python3.7/site-packages/tensorflow
__init__.py             include
__pycache__             libtensorflow_framework.1.14.0.dylib
_api                    libtensorflow_framework.1.dylib
compiler                libtensorflow_framework.dylib
contrib                 lite
core                    python
examples                tools

So the file is there, under the correct name, too.

(see https://travis-ci.org/finalfusion/finalfusion-tensorflow/jobs/584347682)

danieldk commented 4 years ago

Just to add, this Nix derivation was built from the upstream Tensorflow wheel on macOS, so no difference there:

$ nix eval nixpkgs.python3Packages.tensorflow.src.urls
[ "https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.13.1-py3-none-any.whl" ]

(On Linux, the Python package is built from source in current nixpkgs.)

danieldk commented 4 years ago

Weird, that's how it looks like on travis-ci:

Oh, maybe that changed with Tensorflow 1.14.0? So did you try what

python -c 'import tensorflow as tf; print(tf.sysconfig.get_link_flags())'

Gives on Travis-CI?

danieldk commented 4 years ago

By the way, that link flag in the Travis CI output is really weird:

-l:libtensorflow_framework.1.dylib

It should be:

-ltensorflow_framework

which is also what it gives on my Mac (as above).

sebpuetz commented 4 years ago

Weird, that's how it looks like on travis-ci:

Oh, maybe that changed with Tensorflow 1.14.0? So that you try what

python -c 'import tensorflow as tf; print(tf.sysconfig.get_link_flags())'

Gives on Travis-CI?

I just looked at the verbose output of CMake, the flags are -L/Users/travis/build/finalfusion/finalfusion-tensorflow/venv/lib/python3.7/site-packages/tensorflow -l:libtensorflow_framework.1.dylib. Then I added ls /Users/travis/build/finalfusion/finalfusion-tensorflow/venv/lib/python3.7/site-packages/tensorflow to the beginning of the CI script, and the .1.dylib file shows up in that location.

Fwiw, the passing linux build also appends the file-ending to the `-l' flag.

There's definitely a difference between versions, pip installs 1.14 per default. Iirc, 1.13 doesn't compile for another reason, ResourceBase::DebugString is not const until 1.14.

sebpuetz commented 4 years ago

Could it be that clang and gcc act differently when looking for libraries to link? That's another difference between linux and osx builds.

danieldk commented 4 years ago

Fwiw, the passing linux build also appends the file-ending to the `-l' flag.

It's a different linker, so it probably puts up with different things. However, I am wondering where the weird link flags come from, because on Linux I also get ugly flags:

$ python -c 'import tensorflow as tf; print(tf.sysconfig.get_link_flags())'
['-L/nix/store/g7v4cvncc832c3sxapmam4d3bhal5fi8-python3.7-tensorflow-1.14.0/lib/python3.7/site-packages/tensorflow', '-l:libtensorflow_framework.so.1']

Maybe some regression in 1.14? IIRC they added the proper OS-specific suffix and SO version in 1.14. Maybe they didn't adjust the get_link_flags() logic?

danieldk commented 4 years ago

https://github.com/tensorflow/tensorflow/pull/30656

Maybe for now: just get the library path from Tensorflow, and always use the -ltensorflow_framework library flag?

danieldk commented 4 years ago

(BTW, they do weird things in that PR, adding the SO version is not needed.)

sebpuetz commented 4 years ago

They definitely touched the link flags in 1.14 since that's where they introduced the .1 suffix for 1.x releases to differentiate from 2.x.

I just tried -ltensorflow_framework (while getting the folder from the python package) locally, with that change libtensorflow_framework is not linked correctly:

$ python -c "import tensorflow as tf; tf.load_op_library('./finalfusion-tf/libfinalfusion_tf.so')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: libtensorflow_framework.so: cannot open shared object file: No such file or directory
$ ldd finalfusion-tf/libfinalfusion_tf.so                                                
    linux-vdso.so.1 (0x00007fffbef85000)
    libtensorflow_framework.so => not found
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff4665f1000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff4663d2000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff466049000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff465cab000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff465a93000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff4656a2000)
    /lib64/ld-linux-x86-64.so.2 (0x00007ff466b76000)

So it looks like we need the .1 suffix on linux to get correct linkage?

danieldk commented 4 years ago

I just tried -ltensorflow_framework (while getting the folder from the python package) locally, with that change libtensorflow_framework is not linked correctly:

$ python -c "import tensorflow as tf; tf.load_op_library('./finalfusion-tf/libfinalfusion_tf.so')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: libtensorflow_framework.so: cannot open shared object file: No such file or directory

So it looks like we need the .1 suffix on linux to get correct linkage?

Ah yes, now I understand why they added the suffix in the PR. Normally, ldconfig would generate the .so file. E.g. if you have libfoo.so.2.3, ldconfig will generate symlinks libfoo.so.2 and libfoo.so. But since this is hidden in a Python package, the symlinks are never generated.

tl;dr: don't hide your libraries in Python packages when you want people to link against them ;).

sebpuetz commented 4 years ago

So, what's a possible work-around? Roll back to importing the lib as a target in CMake and locate it through tf.sysconfig.get_lib() and CMake's find_library()?

danieldk commented 4 years ago

So, what's a possible work-around? Roll back to importing the lib as a target in CMake and locate it through tf.sysconfig.get_lib() and CMake's find_library()?

Oh, I thought just using the lib dir as is and then explicitly add -ltensorflow_framework.so.1. TF2 is still a moving target anyway, and once there is a 1.14.1 or 1.15.0, we can rely on the output of Tensorflow again given that PR.

sebpuetz commented 4 years ago

How does that solve the failing OSX builds tho? If I'm not misunderstanding anything, that'll resemble the behaviour of taking the linkflags straight out of tf.sysconfig. pip-installed 1.14 (on travis) doesn't have a .so.1 either.

Do you suggest to target 1.13 instead until a fix might be released?

danieldk commented 4 years ago

I was thinking you could get the library extension through CMake, but that's still tricky, since in macOS the library version comes first.

Do you suggest to target 1.13 instead until a fix might be released?

Or target Linux until there is a fix (as an alternative).

sebpuetz commented 4 years ago

This could work as a temporary solution:

l_flags = tf.sysconfig.get_link_flags()
if platform.system() == "Darwin":
    l_flags[1] = "-l:"+l_flags[1].strip(".dylib").strip("-l:lib")
print(" ".join(l_flags), tf.sysconfig.get_include(), tf.sysconfig.CXX11_ABI_FLAG, sep=";", end="")
sebpuetz commented 4 years ago

Now we're almost there, builds are now properly linked but I hardcoded libfinalfusion.so to be loaded in python. Once that's fixed, we should see confirmation that the Op can be built and loaded. We could then add a pytest that confirms that opening and closing the resource works.

sebpuetz commented 4 years ago

I keep forgetting about that.

I just pushed a fixed version and an additional commit that adds a pytest to verify that we can initialize and close embeddings.

Something we still have to consider is whether we want a pure Python wrapper around the library. That would have implications on where we'd add documentation for the ops. If we directly expose the generated ops, we'll have to add documentation in the REGISTER_OP invocations. Otherwise we could simply add docstrings in a Python package.