Fahey-McLay / xalt

28 stars 16 forks source link

GPU tracing: potential DCGM and Protobuf incompatibility #36

Closed samcmill closed 5 years ago

samcmill commented 6 years ago

On our cluster we encountered what is hopefully a rare edge case. When GPU tracking is enabled (using DCGM) and the user binary uses the Google Protobuf library and the system Protobuf library is less than version 2.6 (CentOS 7 default is 2.5), then the user will see the following error:

[libprotobuf FATAL google/protobuf/stubs/common.cc:61] This program requires version 2.6.0 of the Protocol Buffer runtime library, but the installed version is 2.5.0.  Please update your library.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "_out/Linux_amd64_release/nvcm.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  This program requires version 2.6.0 of the Protocol Buffer runtime library, but the installed version is 2.5.0.  Please update your library.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "_out/Linux_amd64_release/nvcm.pb.cc".)
Aborted

All of the above conditions must be fulfilled to get this error.

The root cause is that DCGM statically links with Protobuf. If the user binary is dynamically linked with Protobuf, the linker "tricks" DCGM into using the system libprotobuf rather than the statically linked symbols. Protobuf has an internal version compatibility check and will abort with the above error message if the system libprotobuf is less than version statically linked into DCGM (version 2.6).

A bug has been filed against DCGM to change how it links with Protobuf to avoid this issue.

samcmill commented 5 years ago

A workaround to this problem is to build XALT with the --with-staticLibs configure option (available in XALT 2.4 and later). This will link XALT with DCGM statically.

rtmclay commented 5 years ago

Closing this issue