Closed sebpuetz closed 4 years ago
Not sure what's going on here. OSX build succeeds, local builds work but linux CI build breaks down with segfault in the graph-mode-test.
Apparently there's a mismatch between tensorflow-rocm
and tensorflow
. I also get the segfault when I install tensorflow
via pip instead of tensorflow-rocm
.
Whoops, pressed a button ;).
If I comment the SetShapeFn
out in ops/FFLookupOps.cc
, the segfault goes away. So it looks like something's messed up with that.
Following this description, it's possible to get shapes through ::tensorflow::shape_inference::InferenceContext
. Unfortunately, any usage of c
in the following leads to segfaults:
.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
ShapeHandle strings_shape = c->input(1);
ShapeHandle output_shape;
int embedding_len;
TF_RETURN_IF_ERROR(c->GetAttr("embedding_len", &embedding_len));
TF_RETURN_IF_ERROR(
c->Concatenate(strings_shape, c->Vector(embedding_len), &output_shape)
);
ShapeHandle embeds = c->output(0);
TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 0, &embeds));
c->set_output(0, output_shape);
return Status::OK();
});
https://github.com/tensorflow/tensorflow/issues/29951#issuecomment-505729905 downgrading the compiler to g++ 4.8 supposedly fixes it.
A maintainer in the thread suggested following the instructions at https://github.com/tensorflow/custom-op which means using a docker with the correct environment. That docker comes with gcc-4.8 which doesn't support make_unique
yet. Which is quite confusing to me since they state that these are the environments under which the official wheels are built, but the tf codebase contains quite a lot of make_unique
s. I'm somewhat confused by this whole situation.
It might make sense to restructure the project and follow the custom-op
example. Although that wouldn't solve my confusion about the compiler version / c++ features either.
It would be great if you could also take a quick look at this and give your opinion!
edit: Seems to work both in the custom-op docker container and on Ubuntu with gcc 4.8.4, set(CMAKE_CXX_STANDARD 11)
and replacing the make_unique
s with unique_ptr<T>(new T)
. Making this work on CI requires changing finalfusion-cxx
accordingly...
That docker comes with gcc-4.8
That is old. But they probably want to build with an old glibc/gcc for compatibility with older distributions.
which doesn't support
make_unique
yet. Which is quite confusing to me since they state that these are the environments under which the official wheels are built, but the tf codebase contains quite a lot ofmake_unique
s. I'm somewhat confused by this whole situation.
They probably get make_unique
from abseil:
It might make sense to restructure the project and follow the
custom-op
example. Although that wouldn't solve my confusion about the compiler version / c++ features either.
The ABI between compiler versions is not guaranteed to be stable. So, if Tensorflow is compiled with a different version than an op, there may be subtle ABI incompatibilities. You generally don't notice. If you are lucky, you get segmentation faults, if you are unlucky, there is silent data corruption. This is why Rust prefers statically linking of units compiled with the same compiler version.
At any rate, when you compile with a newer g++, you could try to force an older ABI with -fabi-version
. According to
https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html
you want -fabi-version=7
. Of course, assuming that Tensorflow wasn't compiled with a different ABI version. If setting the ABI version works, I think this is vastly preferred over complicating builds with Docker images (which just plain sucks) or ancient Linux distributions (though Ubuntu 18.04 is not ancient, of course).
The ABI between compiler versions is not guaranteed to be stable. So, if Tensorflow is compiled with a different version than an op, there may be subtle ABI incompatibilities. You generally don't notice. If you are lucky, you get segmentation faults, if you are unlucky, there is silent data corruption. This is why Rust prefers statically linking of units compiled with the same compiler version.
At any rate, when you compile with a newer g++, you could try to force an older ABI with
-fabi-version
. According tohttps://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html
you want
-fabi-version=7
. Of course, assuming that Tensorflow wasn't compiled with a different ABI version. If setting the ABI version works, I think this is vastly preferred over complicating builds with Docker images (which just plain sucks) or ancient Linux distributions (though Ubuntu 18.04 is not ancient, of course).
There already is a CXX11_ABI_FLAG
that we are setting. Without setting this to the correct value (we get that from the python package), symbols from libtensorflow_framework
aren't found.
I tried setting various -fabi-version
s, but none solve the segfaulting. The only approach working so far was using g++-4.8, which worked on both Ubuntu 18.04 and 14.04. Is there any obvious downside to this?
The custom op repo also states that newer (manylinux2010-compatible) wheels are built with Ubuntu 16.04 docker images, which comes with g++-5.4.
Fwiw, not setting the ABI in cmake should allow people to compile against self-compiled tf packages with a compiler of their choice.
This AWS project also pins gcc to 4.8 to build their custom ops: https://github.com/aws/sagemaker-tensorflow-extensions
There already is a
CXX11_ABI_FLAG
that we are setting. Without setting this to the correct value (we get that from the python package), symbols fromlibtensorflow_framework
aren't found.
That's another know. This sets "_GLIBCXX_USE_CXX11_ABI", which uses newer C++11-compatible declarations of some classes. This changes the C++ standard library API and thus ABI.
So, the library can change the ABI and the compiler can change the ABI (e.g. by changing alignment preferences).
I tried setting various
-fabi-version
s, but none solve the segfaulting.
I think it is standard for Google to compile C++ without exceptions, so besides setting the ABI you might also want to add -fno-exceptions
.
Fwiw, not setting the ABI in cmake should allow people to compile against self-compiled tf packages with a compiler of their choice.
It could just be a CMake option.
More info on the upstream library:
$ readelf -p .comment /nix/store/s27l1m937bsw60fcdv3cabzbcdy4bpy4-python3.7-tensorflow-1.14.0/lib/python3.7/site-packages/tensorflow/libtensorflow_framework.so.1
String dump of section '.comment':
[ 1] GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
So, they are actually using gcc 5.4.0. (Which confirms the statement.) Did you try to compile on CI with 5.4 and no additional flags?
That's different for me:
readelf -p .comment /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.1
String dump of section '.comment':
[ 1] GCC: (Ubuntu 4.8.5-4ubuntu8~14.04.2) 4.8.5
5.4 is what is supposedly used for packages built after August 1st 19. I haven't tried 5.4.0 yet.
That's different for me:
readelf -p .comment /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.1 String dump of section '.comment': [ 1] GCC: (Ubuntu 4.8.5-4ubuntu8~14.04.2) 4.8.5
5.4 is what is supposedly used for packages built after August 1st 19. I haven't tried 5.4.0 yet.
Maybe they used different build environments for different Python versions? You are on 3.6, while I am on 3.7.
I tried setting various
-fabi-version
s, but none solve the segfaulting.I think it is standard for Google to compile C++ without exceptions, so besides setting the ABI you might also want to add
-fno-exceptions
.
We're throwing an exception in the constructor if constructing the embeddings in Rust fails. So with that flag we can't build unless we introduce some Status
type.
From what I can tell there is also no such flag set in the custom op example repo: https://github.com/tensorflow/custom-op/blob/master/Makefile
TF_CFLAGS := $(shell $(PYTHON_BIN_PATH) -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))')
TF_LFLAGS := $(shell $(PYTHON_BIN_PATH) -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))')
CFLAGS = ${TF_CFLAGS} -fPIC -O2 -std=c++11
LDFLAGS = -shared ${TF_LFLAGS}
ZERO_OUT_TARGET_LIB = tensorflow_zero_out/python/ops/_zero_out_ops.so
# zero_out op for CPU
zero_out_op: $(ZERO_OUT_TARGET_LIB)
$(ZERO_OUT_TARGET_LIB): $(ZERO_OUT_SRCS)
$(CXX) $(CFLAGS) -o $@ $^ ${LDFLAGS}
Maybe they used different build environments for different Python versions? You are on 3.6, while I am on 3.7.
I verified that, 3.7.3 is built on 16.04 with g++-5.4
So then the probably absolutely correct way would be to build for Python 3.6 using gcc 4.8 and for Python 3.7 using gcc 5.4. (Since there are two compiler ABI changes in between. Not sure whether they matter in this case.)
I found some more info on this at https://github.com/tensorflow/tensorflow/issues/27067 and https://github.com/tensorflow/community/pull/77. For now, matching the compiler version and flags seems like the only way to guarantee working builds.
We can get the compiler version from the python package, too:
import tensorflow as tf
tf.version.COMPILER_VERSION
returns this, so I guess we can use that to select the correct compiler, too.
With https://github.com/tensorflow/community/pull/77 it would be possible to build packages independent of what was used to compile pip tensorflow.
I found some more info on this at tensorflow/tensorflow#27067 and tensorflow/community#77. For now, matching the compiler version and flags seems like the only way to guarantee working builds.
We can get the compiler version from the python package, too:
python
import tensorflow as tf tf.version.COMPILER_VERSION
returns this, so I guess we can use that to select the correct compiler, too.
With tensorflow/community#77 it would be possible to build packages independent of what was used to compile pip tensorflow.
I put together a setup.py by following what uber is doing with horovod (https://github.com/horovod/horovod/blob/master/setup.py), they set a min compiler version through tf.version.COMPILER_VERSION
and an optional max version by checking whether the major version of COMPILER_VERSION == 4
. Then they select the highest compatible gcc / g++
version found in $PATH
and export those as CC / CXX
. If either CC
or CXX
are exported at the invocation of setup.py, the explicitly chosen compiler is used instead.
That'll be part of a future PR, so I think the only question for this PR is how to set up the CI and whether what I did is correct for our purposes.
I put together a setup.py by following what uber is doing with horovod (https://github.com/horovod/horovod/blob/master/setup.py), they set a min compiler version through
tf.version.COMPILER_VERSION
and an optional max version by checking whether the major version ofCOMPILER_VERSION == 4
. Then they select the highest compatiblegcc / g++
version found in$PATH
and export those asCC / CXX
. If eitherCC
orCXX
are exported at the invocation of setup.py, the explicitly chosen compiler is used instead.
Sounds good, especially with the possibility to override the compiler.
That'll be part of a future PR, so I think the only question for this PR is how to set up the CI and whether what I did is correct for our purposes.
I think for this PR then, it would be good enough to use a fixed compiler version (corresponding to what the Python module needs).
I think for this PR then, it would be good enough to use a fixed compiler version (corresponding to what the Python module needs).
This PR now has two Linux builds, one for python 3.6 and one for 3.7, for the 3.6 build I export g++-4.8 as CXX
, for 3.7 I just leave the default compiler (7.4) since there's no ABI incompatibility. I think for building the shared library, setting the compiler version explicitly is fine, we just might want to document this somewhere. Automated detection and setup could be left with setup.py
.
Does that sound good to you?
Excellent!
This PR adds functionality to do embedding lookups in tensorflow.