finalfusion / finalfusion-tensorflow-ops

Tensorflow Op for finalfusion embeddings
https://finalfusion.github.io/
Other
2 stars 0 forks source link

Add Lookup Op and unit tests. #3

Closed sebpuetz closed 4 years ago

sebpuetz commented 4 years ago

This PR adds functionality to do embedding lookups in tensorflow.

sebpuetz commented 4 years ago

Not sure what's going on here. OSX build succeeds, local builds work but linux CI build breaks down with segfault in the graph-mode-test.

sebpuetz commented 4 years ago

Apparently there's a mismatch between tensorflow-rocm and tensorflow. I also get the segfault when I install tensorflow via pip instead of tensorflow-rocm.

danieldk commented 4 years ago

Whoops, pressed a button ;).

sebpuetz commented 4 years ago

If I comment the SetShapeFn out in ops/FFLookupOps.cc, the segfault goes away. So it looks like something's messed up with that.

sebpuetz commented 4 years ago

Following this description, it's possible to get shapes through ::tensorflow::shape_inference::InferenceContext. Unfortunately, any usage of c in the following leads to segfaults:

    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      ShapeHandle strings_shape = c->input(1);
      ShapeHandle output_shape;
      int embedding_len;
      TF_RETURN_IF_ERROR(c->GetAttr("embedding_len", &embedding_len));
      TF_RETURN_IF_ERROR(
        c->Concatenate(strings_shape, c->Vector(embedding_len), &output_shape)
      );
      ShapeHandle embeds = c->output(0);
      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 0, &embeds));
      c->set_output(0, output_shape);
      return Status::OK();
    });
sebpuetz commented 4 years ago

https://github.com/tensorflow/tensorflow/issues/29951#issuecomment-505729905 downgrading the compiler to g++ 4.8 supposedly fixes it.

sebpuetz commented 4 years ago

A maintainer in the thread suggested following the instructions at https://github.com/tensorflow/custom-op which means using a docker with the correct environment. That docker comes with gcc-4.8 which doesn't support make_unique yet. Which is quite confusing to me since they state that these are the environments under which the official wheels are built, but the tf codebase contains quite a lot of make_uniques. I'm somewhat confused by this whole situation.

It might make sense to restructure the project and follow the custom-op example. Although that wouldn't solve my confusion about the compiler version / c++ features either.

It would be great if you could also take a quick look at this and give your opinion!

edit: Seems to work both in the custom-op docker container and on Ubuntu with gcc 4.8.4, set(CMAKE_CXX_STANDARD 11) and replacing the make_uniques with unique_ptr<T>(new T). Making this work on CI requires changing finalfusion-cxx accordingly...

danieldk commented 4 years ago

That docker comes with gcc-4.8

That is old. But they probably want to build with an old glibc/gcc for compatibility with older distributions.

which doesn't support make_unique yet. Which is quite confusing to me since they state that these are the environments under which the official wheels are built, but the tf codebase contains quite a lot of make_uniques. I'm somewhat confused by this whole situation.

They probably get make_unique from abseil:

https://abseil.io/

It might make sense to restructure the project and follow the custom-op example. Although that wouldn't solve my confusion about the compiler version / c++ features either.

The ABI between compiler versions is not guaranteed to be stable. So, if Tensorflow is compiled with a different version than an op, there may be subtle ABI incompatibilities. You generally don't notice. If you are lucky, you get segmentation faults, if you are unlucky, there is silent data corruption. This is why Rust prefers statically linking of units compiled with the same compiler version.

At any rate, when you compile with a newer g++, you could try to force an older ABI with -fabi-version. According to

https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

you want -fabi-version=7. Of course, assuming that Tensorflow wasn't compiled with a different ABI version. If setting the ABI version works, I think this is vastly preferred over complicating builds with Docker images (which just plain sucks) or ancient Linux distributions (though Ubuntu 18.04 is not ancient, of course).

sebpuetz commented 4 years ago

The ABI between compiler versions is not guaranteed to be stable. So, if Tensorflow is compiled with a different version than an op, there may be subtle ABI incompatibilities. You generally don't notice. If you are lucky, you get segmentation faults, if you are unlucky, there is silent data corruption. This is why Rust prefers statically linking of units compiled with the same compiler version.

At any rate, when you compile with a newer g++, you could try to force an older ABI with -fabi-version. According to

https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

you want -fabi-version=7. Of course, assuming that Tensorflow wasn't compiled with a different ABI version. If setting the ABI version works, I think this is vastly preferred over complicating builds with Docker images (which just plain sucks) or ancient Linux distributions (though Ubuntu 18.04 is not ancient, of course).

There already is a CXX11_ABI_FLAG that we are setting. Without setting this to the correct value (we get that from the python package), symbols from libtensorflow_framework aren't found.

I tried setting various -fabi-versions, but none solve the segfaulting. The only approach working so far was using g++-4.8, which worked on both Ubuntu 18.04 and 14.04. Is there any obvious downside to this?

The custom op repo also states that newer (manylinux2010-compatible) wheels are built with Ubuntu 16.04 docker images, which comes with g++-5.4.

Fwiw, not setting the ABI in cmake should allow people to compile against self-compiled tf packages with a compiler of their choice.

sebpuetz commented 4 years ago

This AWS project also pins gcc to 4.8 to build their custom ops: https://github.com/aws/sagemaker-tensorflow-extensions

danieldk commented 4 years ago

There already is a CXX11_ABI_FLAG that we are setting. Without setting this to the correct value (we get that from the python package), symbols from libtensorflow_framework aren't found.

That's another know. This sets "_GLIBCXX_USE_CXX11_ABI", which uses newer C++11-compatible declarations of some classes. This changes the C++ standard library API and thus ABI.

So, the library can change the ABI and the compiler can change the ABI (e.g. by changing alignment preferences).

I tried setting various -fabi-versions, but none solve the segfaulting.

I think it is standard for Google to compile C++ without exceptions, so besides setting the ABI you might also want to add -fno-exceptions.

Fwiw, not setting the ABI in cmake should allow people to compile against self-compiled tf packages with a compiler of their choice.

It could just be a CMake option.

danieldk commented 4 years ago

More info on the upstream library:

$ readelf -p .comment /nix/store/s27l1m937bsw60fcdv3cabzbcdy4bpy4-python3.7-tensorflow-1.14.0/lib/python3.7/site-packages/tensorflow/libtensorflow_framework.so.1

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

So, they are actually using gcc 5.4.0. (Which confirms the statement.) Did you try to compile on CI with 5.4 and no additional flags?

sebpuetz commented 4 years ago

That's different for me:

readelf -p .comment /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.1 

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 4.8.5-4ubuntu8~14.04.2) 4.8.5

5.4 is what is supposedly used for packages built after August 1st 19. I haven't tried 5.4.0 yet.

danieldk commented 4 years ago

That's different for me:

readelf -p .comment /home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.1 

String dump of section '.comment':
  [     1]  GCC: (Ubuntu 4.8.5-4ubuntu8~14.04.2) 4.8.5

5.4 is what is supposedly used for packages built after August 1st 19. I haven't tried 5.4.0 yet.

Maybe they used different build environments for different Python versions? You are on 3.6, while I am on 3.7.

sebpuetz commented 4 years ago

I tried setting various -fabi-versions, but none solve the segfaulting.

I think it is standard for Google to compile C++ without exceptions, so besides setting the ABI you might also want to add -fno-exceptions.

We're throwing an exception in the constructor if constructing the embeddings in Rust fails. So with that flag we can't build unless we introduce some Status type.

From what I can tell there is also no such flag set in the custom op example repo: https://github.com/tensorflow/custom-op/blob/master/Makefile

TF_CFLAGS := $(shell $(PYTHON_BIN_PATH) -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))')
TF_LFLAGS := $(shell $(PYTHON_BIN_PATH) -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))')

CFLAGS = ${TF_CFLAGS} -fPIC -O2 -std=c++11
LDFLAGS = -shared ${TF_LFLAGS}

ZERO_OUT_TARGET_LIB = tensorflow_zero_out/python/ops/_zero_out_ops.so
# zero_out op for CPU
zero_out_op: $(ZERO_OUT_TARGET_LIB)

$(ZERO_OUT_TARGET_LIB): $(ZERO_OUT_SRCS)
    $(CXX) $(CFLAGS) -o $@ $^ ${LDFLAGS}

Maybe they used different build environments for different Python versions? You are on 3.6, while I am on 3.7.

I verified that, 3.7.3 is built on 16.04 with g++-5.4

danieldk commented 4 years ago

So then the probably absolutely correct way would be to build for Python 3.6 using gcc 4.8 and for Python 3.7 using gcc 5.4. (Since there are two compiler ABI changes in between. Not sure whether they matter in this case.)

sebpuetz commented 4 years ago

I found some more info on this at https://github.com/tensorflow/tensorflow/issues/27067 and https://github.com/tensorflow/community/pull/77. For now, matching the compiler version and flags seems like the only way to guarantee working builds.

We can get the compiler version from the python package, too:

import tensorflow as tf
tf.version.COMPILER_VERSION

returns this, so I guess we can use that to select the correct compiler, too.

With https://github.com/tensorflow/community/pull/77 it would be possible to build packages independent of what was used to compile pip tensorflow.

sebpuetz commented 4 years ago

I found some more info on this at tensorflow/tensorflow#27067 and tensorflow/community#77. For now, matching the compiler version and flags seems like the only way to guarantee working builds.

We can get the compiler version from the python package, too:

python

import tensorflow as tf tf.version.COMPILER_VERSION

returns this, so I guess we can use that to select the correct compiler, too.

With tensorflow/community#77 it would be possible to build packages independent of what was used to compile pip tensorflow.

I put together a setup.py by following what uber is doing with horovod (https://github.com/horovod/horovod/blob/master/setup.py), they set a min compiler version through tf.version.COMPILER_VERSION and an optional max version by checking whether the major version of COMPILER_VERSION == 4. Then they select the highest compatible gcc / g++ version found in $PATH and export those as CC / CXX. If either CC or CXX are exported at the invocation of setup.py, the explicitly chosen compiler is used instead.

That'll be part of a future PR, so I think the only question for this PR is how to set up the CI and whether what I did is correct for our purposes.

danieldk commented 4 years ago

I put together a setup.py by following what uber is doing with horovod (https://github.com/horovod/horovod/blob/master/setup.py), they set a min compiler version through tf.version.COMPILER_VERSION and an optional max version by checking whether the major version of COMPILER_VERSION == 4. Then they select the highest compatible gcc / g++ version found in $PATH and export those as CC / CXX. If either CC or CXX are exported at the invocation of setup.py, the explicitly chosen compiler is used instead.

Sounds good, especially with the possibility to override the compiler.

That'll be part of a future PR, so I think the only question for this PR is how to set up the CI and whether what I did is correct for our purposes.

I think for this PR then, it would be good enough to use a fixed compiler version (corresponding to what the Python module needs).

sebpuetz commented 4 years ago

I think for this PR then, it would be good enough to use a fixed compiler version (corresponding to what the Python module needs).

This PR now has two Linux builds, one for python 3.6 and one for 3.7, for the 3.6 build I export g++-4.8 as CXX, for 3.7 I just leave the default compiler (7.4) since there's no ABI incompatibility. I think for building the shared library, setting the compiler version explicitly is fine, we just might want to document this somewhere. Automated detection and setup could be left with setup.py.

Does that sound good to you?

danieldk commented 4 years ago

Excellent!