Closed apeforest closed 4 years ago
Saw this on email list and got curious...
Looks like problem is probably this commit: https://github.com/apache/incubator-mxnet/commit/4ed14e2b749743a014121f57b265675fa7b4c06d#diff-875aa4c013dbd73b044531e439e8afdd
Basically MXAPIHandleException
used to be defined inline in the header file, so all consumers had to do was
#include <mxnet/c_api_error.h>
But now as of 3 days ago it is not an inline function anymore. Meaning that consumers need to make sure to link against c_api_error.o to get the symbol.
I don't know enough about the build system that produces these nightly builds (does it use the CMake one or the Makefile one?) ... but my hunch would be that either c_api_error.o is not getting built into libmxnet.so. Or somehow it is, but the order it is presented to the linker is before MXAPIHandleException is used, so that symbol isn't included in libmxnet.so.
@stephenrawls currently a Makefile build is used. You can find it at https://github.com/apache/incubator-mxnet/tree/master/tools/staticbuild We are working on migrating to the cmake build in the future though.
@szha
Thanks @stephenrawls for the analysis. Here are the causes of the problem:
1) Horovod uses MX_API_BEGIN() and MX_API_END() from mxnet/c_api_error.h to catch and throw errors in horovod APIs: https://github.com/horovod/horovod/blob/master/horovod/mxnet/mpi_ops.cc#L224 2) MX_API_BEGIN() is a macro that calls MXAPIHandleException https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/c_api_error.h#L36 3) Before #17128, MXAPIHandleException is an inline function. And therefore when #17128 introduced a new function call NormalizeError() inside MXAPIHandleException it broke Horovod integration because the symbol of NormalizeError is not whitelist by MXNet distribution. 4) #17298 removed NormalizeError() from MXAPIHandleException and made it not inline. https://github.com/apache/incubator-mxnet/pull/17208/files#diff-875aa4c013dbd73b044531e439e8afddR67. This time the error becomes undefined symbol of MXAPIHandleException.
So to summarize, the problem is not that Horovod requires MXAPIHandleException
function to be inline. The rootcause is that MXNet did not export the symbol *MXAPIHandleException*
in its whitelist, but only the symbols that are being used inside MXAPIHandleException
function. It was okay when the function MXAPIHandleException
is inline, but became a problem when it's not. A good practice is to whitelist symbol *MXAPIHandleException*
instead of its internals.
@szha I will create a PR to fix this.
Description
Cannot run horovod with latest nightly wheel. It could mean the 1.6 release have same problem too. The last working nightly wheel was 12/30/2019
Error Message
To Reproduce\
Steps to reproduce
What have you tried to solve it?
Environment