Closed sebpuetz closed 4 years ago
$ python -c 'import tensorflow as tf; print(tf.sysconfig.get_link_flags())'
['-L/nix/store/fcj24bdwdyv5hdnkz436cqlhxlzm8lhj-python3.7-tensorflow-1.13.1/lib/python3.7/site-packages/tensorflow', '-ltensorflow_framework']
$ ls /nix/store/fcj24bdwdyv5hdnkz436cqlhxlzm8lhj-python3.7-tensorflow-1.13.1/lib/python3.7/site-packages/tensorflow
__init__.py aux-bin core libtensorflow_framework.so tools
__pycache__ compiler examples lite
_api contrib include python
So the path is correct. The f-up that you are bumping into is that apparently in the Python module on macOS libtensorflow_framework
uses .so
as an extension? The compiler will look for a .dylib
with -l
.
Weird, that's how it looks like on travis-ci:
$ ci/script.sh
+ls /Users/travis/build/finalfusion/finalfusion-tensorflow/venv/lib/python3.7/site-packages/tensorflow
__init__.py include
__pycache__ libtensorflow_framework.1.14.0.dylib
_api libtensorflow_framework.1.dylib
compiler libtensorflow_framework.dylib
contrib lite
core python
examples tools
So the file is there, under the correct name, too.
(see https://travis-ci.org/finalfusion/finalfusion-tensorflow/jobs/584347682)
Just to add, this Nix derivation was built from the upstream Tensorflow wheel on macOS, so no difference there:
$ nix eval nixpkgs.python3Packages.tensorflow.src.urls
[ "https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.13.1-py3-none-any.whl" ]
(On Linux, the Python package is built from source in current nixpkgs.)
Weird, that's how it looks like on travis-ci:
Oh, maybe that changed with Tensorflow 1.14.0? So did you try what
python -c 'import tensorflow as tf; print(tf.sysconfig.get_link_flags())'
Gives on Travis-CI?
By the way, that link flag in the Travis CI output is really weird:
-l:libtensorflow_framework.1.dylib
It should be:
-ltensorflow_framework
which is also what it gives on my Mac (as above).
Weird, that's how it looks like on travis-ci:
Oh, maybe that changed with Tensorflow 1.14.0? So that you try what
python -c 'import tensorflow as tf; print(tf.sysconfig.get_link_flags())'
Gives on Travis-CI?
I just looked at the verbose output of CMake, the flags are -L/Users/travis/build/finalfusion/finalfusion-tensorflow/venv/lib/python3.7/site-packages/tensorflow -l:libtensorflow_framework.1.dylib
. Then I added ls /Users/travis/build/finalfusion/finalfusion-tensorflow/venv/lib/python3.7/site-packages/tensorflow
to the beginning of the CI script, and the .1.dylib
file shows up in that location.
Fwiw, the passing linux build also appends the file-ending to the `-l' flag.
There's definitely a difference between versions, pip installs 1.14 per default. Iirc, 1.13 doesn't compile for another reason, ResourceBase::DebugString
is not const
until 1.14.
Could it be that clang and gcc act differently when looking for libraries to link? That's another difference between linux and osx builds.
Fwiw, the passing linux build also appends the file-ending to the `-l' flag.
It's a different linker, so it probably puts up with different things. However, I am wondering where the weird link flags come from, because on Linux I also get ugly flags:
$ python -c 'import tensorflow as tf; print(tf.sysconfig.get_link_flags())'
['-L/nix/store/g7v4cvncc832c3sxapmam4d3bhal5fi8-python3.7-tensorflow-1.14.0/lib/python3.7/site-packages/tensorflow', '-l:libtensorflow_framework.so.1']
Maybe some regression in 1.14? IIRC they added the proper OS-specific suffix and SO version in 1.14. Maybe they didn't adjust the get_link_flags()
logic?
https://github.com/tensorflow/tensorflow/pull/30656
Maybe for now: just get the library path from Tensorflow, and always use the -ltensorflow_framework
library flag?
(BTW, they do weird things in that PR, adding the SO version is not needed.)
They definitely touched the link flags in 1.14 since that's where they introduced the .1
suffix for 1.x
releases to differentiate from 2.x
.
I just tried -ltensorflow_framework
(while getting the folder from the python package) locally, with that change libtensorflow_framework
is not linked correctly:
$ python -c "import tensorflow as tf; tf.load_op_library('./finalfusion-tf/libfinalfusion_tf.so')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: libtensorflow_framework.so: cannot open shared object file: No such file or directory
$ ldd finalfusion-tf/libfinalfusion_tf.so
linux-vdso.so.1 (0x00007fffbef85000)
libtensorflow_framework.so => not found
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff4665f1000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff4663d2000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff466049000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff465cab000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff465a93000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff4656a2000)
/lib64/ld-linux-x86-64.so.2 (0x00007ff466b76000)
So it looks like we need the .1
suffix on linux to get correct linkage?
I just tried
-ltensorflow_framework
(while getting the folder from the python package) locally, with that changelibtensorflow_framework
is not linked correctly:$ python -c "import tensorflow as tf; tf.load_op_library('./finalfusion-tf/libfinalfusion_tf.so')" Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) tensorflow.python.framework.errors_impl.NotFoundError: libtensorflow_framework.so: cannot open shared object file: No such file or directory
So it looks like we need the
.1
suffix on linux to get correct linkage?
Ah yes, now I understand why they added the suffix in the PR. Normally, ldconfig
would generate the .so
file. E.g. if you have libfoo.so.2.3
, ldconfig
will generate symlinks libfoo.so.2
and libfoo.so
. But since this is hidden in a Python package, the symlinks are never generated.
tl;dr: don't hide your libraries in Python packages when you want people to link against them ;).
So, what's a possible work-around? Roll back to importing the lib as a target in CMake and locate it through tf.sysconfig.get_lib()
and CMake's find_library()
?
So, what's a possible work-around? Roll back to importing the lib as a target in CMake and locate it through
tf.sysconfig.get_lib()
and CMake'sfind_library()
?
Oh, I thought just using the lib dir as is and then explicitly add -ltensorflow_framework.so.1
. TF2 is still a moving target anyway, and once there is a 1.14.1 or 1.15.0, we can rely on the output of Tensorflow again given that PR.
How does that solve the failing OSX builds tho? If I'm not misunderstanding anything, that'll resemble the behaviour of taking the linkflags straight out of tf.sysconfig
. pip-installed 1.14 (on travis) doesn't have a .so.1
either.
Do you suggest to target 1.13 instead until a fix might be released?
I was thinking you could get the library extension through CMake, but that's still tricky, since in macOS the library version comes first.
Do you suggest to target 1.13 instead until a fix might be released?
Or target Linux until there is a fix (as an alternative).
This could work as a temporary solution:
l_flags = tf.sysconfig.get_link_flags()
if platform.system() == "Darwin":
l_flags[1] = "-l:"+l_flags[1].strip(".dylib").strip("-l:lib")
print(" ".join(l_flags), tf.sysconfig.get_include(), tf.sysconfig.CXX11_ABI_FLAG, sep=";", end="")
Now we're almost there, builds are now properly linked but I hardcoded libfinalfusion.so
to be loaded in python. Once that's fixed, we should see confirmation that the Op can be built and loaded.
We could then add a pytest that confirms that opening and closing the resource works.
I keep forgetting about that.
I just pushed a fixed version and an additional commit that adds a pytest to verify that we can initialize and close embeddings.
Something we still have to consider is whether we want a pure Python wrapper around the library. That would have implications on where we'd add documentation for the ops. If we directly expose the generated ops, we'll have to add documentation in the REGISTER_OP
invocations. Otherwise we could simply add docstrings in a Python package.
These builds fail with
on OSX. I can't really look into this since I don't have access to a Mac around here. @danieldk Could you check whether the library is actually located where
tf.sysconfig.get_link_flags()
specifies it and if the name is correct?