Closed sammy-tri closed 8 years ago
can you point me to the failure on drake005 so I can take a look?
IIRC it was here:
mv: rename nlopt.scm to nlopt.scm.in: No such file or directory make[5]: *\ [nlopt.scm.in] Error 1
i suspect that drake005 is catching a real bug on that one. drake005 has lots of cores, and uses all of them in a parallel build. It is often the machine that catches missing dependencies in cmake files. On quick inspection, this looks like one of them.
I don't think this is likely to be cmake related (though it might have something to do with running in parallel). NLopt compilation is entirely inside its own autoconf world... It's invoked by cmake as part of the superbuild, but while my knowledge of the superbuild is admittedly limited, I'm not sure what cmake would be doing inside the NLopt compilation.
(From my phone, and from memory) The build failed on a missing file (.scm?). That file is autogenerated earlier in the build. But i suspect the dependency between the generator and the consumer is not written properly?
My read-through of the Makefile.am didn't turn up any missing dependencies. From inspection and experimentation, I believe the swig call on L46 generates nlopt.scm
immediately prior to the next line trying to rename it?
I was wondering if it actually the problem was that the drake005 filesystem didn't work? That the swig output was lost because the disk was full?
Or maybe it was that swig was bombing for some other reason? Perhaps an input file was missing, though its dependencies on L45 look pretty tight to me.
Or maybe --without-guile
is actually broken in nlopt upstream? Does turning that off make swig/Makefile.am doomed? Edit: Though I guess this would be contraindicated by other platforms building it successfully.
Second edit: These ideas are all wrong. See below
Aha (I think?)!
The problem may be in nlopt's Makefile.am
rule.
The same rule declares dual targets, nlopt-guile.cpp
and nlopt.scm.in
. When under -j2
or whatever, make
will the rule twice in parallel -- invoking two instances of swig
on L46:
swig -I../api -outdir . -c++ -guile -scmstub -o nlopt-guile.cpp ./nlopt.i
swig -I../api -outdir . -c++ -guile -scmstub -o nlopt.scm.in ./nlopt.i
Each program is writing the same two output files, so when the two instances are run in parallel, they can fight and overwrite each others' files.
For now, perhaps CMake should avoid passing along the jobserver when building nlopt. Once someone confirms me on this, we should with it with nlopt upstream.
http://www.gnu.org/savannah-checkouts/gnu/automake/manual/html_node/Multiple-Outputs.html would seem to be the solution on point, to keep it as one rule.
Probably simpler, though, is to have one rule and command for -c++
and a separate second set for -guile
.
Nice find. That's exactly the type of thing Drake005 has been catching for us
--without-guile is already in our CMakeLists.txt nlopt_BUILD_COMMAND, as is disabling every other external binding I could find (--without-matlab --without-python --without-octave --without-guile). The theory that upstream is broken in that it's building swig related dependencies when it doesn't need seem to use them (it looks like nlopt-guile.cpp is created and not compiled, for example).
How did you verify that swig was being run mulltiple times in parallel builds? When compiling locally, even with large-ish values of -j (say, -j10) I still only see the following bit once:
swig -I../api -outdir . -c++ -guile -scmstub -o nlopt-guile.cpp ./nlopt.i ../api/nlopt.hpp:122: Warning 401: Nothing known about base class 'std::runtime_error'. Ig nored. ../api/nlopt.hpp:127: Warning 401: Nothing known about base class 'std::runtime_error'. Ignored. ../api/nlopt.hpp:296: Warning 503: Can't wrap 'operator =' unless renamed to a valid identifier. rm -f nlopt.scm.in mv nlopt.scm nlopt.scm.in
I'm running: drake-distro% make VERBOSE=1 nlopt -j10
without VERBOSE=1 I don't see anything about the swig or rm invocations. (like when it runs on drake005... Should we consider turning on more verbose output there?) It's also possible that different cmake/make installations on different OSX installs produce different results (as in the case here where the travis builder had no problems and drake005 does).
It would be helpful to have a local reproduction scenario, but I will not be surprised if this is not possible.
Further, I'm unclear on the mechanist of your proposed resolution. I've read the doc linked above, and I believe you're proposing having nlopt.scm.in depend on nlopt-guild.cpp and chaning the rules so that the parallel builder does nothing wrong. Which is fine... Is the proposal that I personally fork nlopt on github until we've (1) confirmed that this is the problem and (2) gotten it merged upstream?
Probably simpler, though, is to have one rule and command for -c++ and a separate second set for -guile.
Actually, I do not understand. -c++ is "enable c++ processing" and occurs on both the guile and python command lines in swig/Makefile.am. It seems to have nothing to do with the part of the build with "rm -f nlopt.scm.in; mv nlopt.scm nlopt.scm.i"?
Also, I have no idea why the rm/mv step occurs at all, so if your Makefile.am spelunking has shed any light on that step, it would probably also be good to know.
I will work though this offline with @sammy-tri.
I've reproduced the issue locally (fresh checkout of NLopt not using the superbuild, compiled with -j a second time so that ccache massively speeds up the compilation).
@jwnimmer-tri is correct that splitting up the build rules makes this problem go away.
Created a PR for NLopt upstream: https://github.com/stevengj/nlopt/pull/62
Hi Sammy,
Russ mentioned that you are working on porting inverse kinematics code to C++. I worked on that before. if you have any questions, please email me at daihongkai@gmail.com. We can meet at TRI or MIT to explain what the inverseKinTraj is doing also.
Best, Hongkai
While getting #1962 integrated, NLopt had to be disabled on apple builds, as it was working in the travis build and failing on drake005. Once drake005 is out of the picture, retest under jenkins and re-enable (or fix the build).