Xilinx / inference-server

https://xilinx.github.io/inference-server/
Apache License 2.0
43 stars 13 forks source link

Use RTLD_DEEPBIND for TF+ZenDNN #111

Closed varunsh-xilinx closed 1 year ago

varunsh-xilinx commented 1 year ago

Summary of Changes

Motivation

An upcoming change in the Tensorflow version used by tfzednn creates a symbol conflict between it and the inference server due to mismatching protobuf symbols. In the change, the correct protobuf symbols are located in another TF library but these symbols aren't found because the server is already linking protobuf.

Implementation

I and @amuralee-amd explored many options to find a workable solution to resolve the symbol conflict. For future reference, here's what I tried:

  1. Using RTLD_DEEPBIND for all workers. This is a good idea in theory because this library version mismatch occurred now with tfzendnn but it can happen again with other workers. Using this option for loading all workers should isolate them. Unfortunately, this creates a number of problems. std::cout stops working in the loaded shared library and certain functions in libstdc++ raise bad_cast exceptions.
  2. Not linking the workers to libamdinfer.so was done in part to address another issue with using RTLD_DEEPBIND which resulted in some global symbols like the logger not being correctly initialized in the loaded library. By not linking it, the worker would refer back to the version in the global scope instead.
  3. Using dlmopen instead of dlopen to load the library in a different namespace creates different problems. For example, gdb can't easily peer into the loaded library. There are also other posts online discussing the various issues around using dlmopen
  4. Certain fixes I tried worked in some cases but not others. Currently, there are the Python examples (which may or may not start the server from Python), the C++ examples, and the tests. I don't know enough about how the load-time process works in C++ to begin to compare it to how Python is doing it to wrapped library made with Pybind11.
  5. Building the workers with -nodefaultlibs and similar flags to avoid linking the standard library (and hopefully let it resolve from the main scope) also didn't work at compile time. Manually editing the .dynamic section to remove libraries with patchelf also didn't work (though removing libc did have an effect in that free() stopped working).
  6. Using a trampoline utility like Implib.so didn't work either. You get missing symbols possibly related to the vtables but adding the vtables for libtensorflow_cc.so took too long to generate.
  7. I found a Python command os.setdlopenflags() that I needed to use to change the flags used by Python to work with libraries that have been opened with RTLD_DEEPBIND.
  8. Having TF produce a single library instead of two would require a lot of changes to the TF build system and issues with the legal scan - @amuralee-amd
gbuildx commented 1 year ago

Build failed!

gbuildx commented 1 year ago

Build failed!

gbuildx commented 1 year ago

Build failed!

varunsh-xilinx commented 1 year ago

This passed after running again