bigmlcom / sensenet

0 stars 8 forks source link

Tensorflow 2.9 Upgrade #37

Open charleslparker opened 1 year ago

charleslparker commented 1 year ago

Starting with tensorflow 2.9.x, they've started using setting compiler flag _GLIBCXX_USE_CXX11_ABI by default, which is causing linker errors on the CI builds on github, but works for me locally on a mac, and works on the wintermute linux build server. Specifically, on the github CI builds, everything appears to build fine. But when we run pytest -sv tests/test_tree.py we get the following error:

  ==================================== ERRORS ====================================
  _____________________ ERROR collecting tests/test_tree.py ______________________
  tests/test_tree.py:9: in <module>
      import sensenet.importers
  sensenet/importers.py:42: in <module>
      bigml_tf_module = tensorflow.load_op_library(treelib[0])
  /tmp/tmp.8RK8W1x4vs/venv/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py:54: in load_op_library
      lib_handle = py_tf.TF_LoadLibrary(library_filename)
  E   tensorflow.python.framework.errors_impl.NotFoundError: /tmp/tmp.8RK8W1x4vs/venv/lib/python3.8/site-packages/bigml_tf_tree.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb

The problem here is that the custom tensorflow extension that deals with the internal trees sometimes generated by deepnets has been built with "old ABI" compaibility, whereas TF 2.9.x uses the "new ABI" (see https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html).

It's odd, because I verified that the correct flag gets passed to the compile step here (when using TF 2.9.x):

https://github.com/charleslparker/sensenet/blob/master/setup.py#L54

and purposely overriding it (by replacing the =1 with =0 for the flag in compile_args) causes the same test to break with other linker errors on my local and the linux server. So it's something strange going on with the compile step on github specifically. Maybe the dockers used by CIBuildWheel on github have an old version of libstdc++?

The exact linker error we get is documented here: https://pgaleone.eu/tensorflow/bazel/abi/c++/2021/04/01/tensorflow-custom-ops-bazel-abi-compatibility/

where they say you have to rebuild tensorflow to fix it. I refuse to believe this!

Popping up the stack a bit; this op is only used when deepnets generate these internal trees (e.g., when "tree embedding = True" when you train a deepnet, or you do "Automatic structure search"). This extension has been such a pain so many times that maybe we should remove it.

charleslparker commented 1 year ago

Incidentally, versions need to be updated in two places:

https://github.com/charleslparker/sensenet/blob/master/setup.py#L15 https://github.com/charleslparker/sensenet/blob/master/pyproject.toml#L3