Use dlopen to open Tensorflow library for TF+ZenDNN
Remove extra linkages in libraries
Fix shutdown race condition with gRPC
Motivation
An upcoming change in the Tensorflow version used by tfzednn creates a symbol conflict between it and the inference server due to mismatching protobuf symbols. In the change, the correct protobuf symbols are located in another TF library but these symbols aren't found because the server is already linking protobuf.
Implementation
I and @amuralee-amd explored many options to find a workable solution to resolve the symbol conflict. For future reference, here's what I tried:
Using RTLD_DEEPBIND for all workers. This is a good idea in theory because this library version mismatch occurred now with tfzendnn but it can happen again with other workers. Using this option for loading all workers should isolate them. Unfortunately, this creates a number of problems. std::cout stops working in the loaded shared library and certain functions in libstdc++ raise bad_cast exceptions.
Not linking the workers to libamdinfer.so was done in part to address another issue with using RTLD_DEEPBIND which resulted in some global symbols like the logger not being correctly initialized in the loaded library. By not linking it, the worker would refer back to the version in the global scope instead.
Using dlmopen instead of dlopen to load the library in a different namespace creates different problems. For example, gdb can't easily peer into the loaded library. There are also other posts online discussing the various issues around using dlmopen
Certain fixes I tried worked in some cases but not others. Currently, there are the Python examples (which may or may not start the server from Python), the C++ examples, and the tests. I don't know enough about how the load-time process works in C++ to begin to compare it to how Python is doing it to wrapped library made with Pybind11.
Building the workers with -nodefaultlibs and similar flags to avoid linking the standard library (and hopefully let it resolve from the main scope) also didn't work at compile time. Manually editing the .dynamic section to remove libraries with patchelf also didn't work (though removing libc did have an effect in that free() stopped working).
Using a trampoline utility like Implib.so didn't work either. You get missing symbols possibly related to the vtables but adding the vtables for libtensorflow_cc.so took too long to generate.
I found a Python command os.setdlopenflags() that I needed to use to change the flags used by Python to work with libraries that have been opened with RTLD_DEEPBIND.
Having TF produce a single library instead of two would require a lot of changes to the TF build system and issues with the legal scan - @amuralee-amd
Summary of Changes
dlopen
to open Tensorflow library for TF+ZenDNNMotivation
An upcoming change in the Tensorflow version used by
tfzednn
creates a symbol conflict between it and the inference server due to mismatching protobuf symbols. In the change, the correct protobuf symbols are located in another TF library but these symbols aren't found because the server is already linking protobuf.Implementation
I and @amuralee-amd explored many options to find a workable solution to resolve the symbol conflict. For future reference, here's what I tried:
RTLD_DEEPBIND
for all workers. This is a good idea in theory because this library version mismatch occurred now with tfzendnn but it can happen again with other workers. Using this option for loading all workers should isolate them. Unfortunately, this creates a number of problems.std::cout
stops working in the loaded shared library and certain functions inlibstdc++
raisebad_cast
exceptions.libamdinfer.so
was done in part to address another issue with usingRTLD_DEEPBIND
which resulted in some global symbols like the logger not being correctly initialized in the loaded library. By not linking it, the worker would refer back to the version in the global scope instead.dlmopen
instead ofdlopen
to load the library in a different namespace creates different problems. For example,gdb
can't easily peer into the loaded library. There are also other posts online discussing the various issues around usingdlmopen
-nodefaultlibs
and similar flags to avoid linking the standard library (and hopefully let it resolve from the main scope) also didn't work at compile time. Manually editing the.dynamic
section to remove libraries with patchelf also didn't work (though removinglibc
did have an effect in thatfree()
stopped working).libtensorflow_cc.so
took too long to generate.os.setdlopenflags()
that I needed to use to change the flags used by Python to work with libraries that have been opened withRTLD_DEEPBIND
.