RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.25k stars 1.25k forks source link

runAtlasWalking segfaults on my mac #2165

Closed RussTedrake closed 7 years ago

RussTedrake commented 8 years ago

inside of the instantaneousQPController .. this is a high priority as it could impact the onboarding of valkyrie next week. @rdeits , @patmarion , @gizatt , @tkoolen -- can you see if you can reproduce?

[  7] 0x0000000000000000                                   <unknown-module>+00000000
[  8] 0x0000000132635c10 /Users/russt/drake-distro/drake/systems/controllers/constructQPDataPointerMex.mexmaci64+00039952 _ZN25InstantaneousQPControllerC2ENSt3__110unique_ptrI13RigidBodyTreeNS0_14default_deleteIS2_EEEERKNS0_12basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEE+00000736
[  9] 0x00000001326356ae /Users/russt/drake-distro/drake/systems/controllers/constructQPDataPointerMex.mexmaci64+00038574 mexFunction+00000862
[ 10] 0x0000000111345cc5 /Applications/MATLAB_R2015b.app/bin/maci64/libmex.dylib+00068805 mexRunMexFile+00000085
patmarion commented 8 years ago

Is the segfault repeatable every time? I just ran the runAtlasWalking test several times on my macbook without crashes using master from April 2nd. Now I'll update to today's master and test again.

RussTedrake commented 8 years ago

it’s repeatable every time for me.

patmarion commented 8 years ago

I compiled current master, still no crash for me after two trials. Mac OSX 10.10 (Yosemite). Matlab 2014b.

Enabled options:

WITH_AVL:BOOL=ON WITH_BOT_CORE_LCMTYPES:BOOL=ON WITH_BULLET:BOOL=ON WITH_DIRECTOR:BOOL=ON WITH_EIGEN:BOOL=ON WITH_GOOGLETEST:BOOL=ON WITH_GUROBI:BOOL=ON WITH_LCM:BOOL=ON WITH_LIBBOT:BOOL=ON WITH_NLOPT:BOOL=ON WITH_SEDUMI:BOOL=ON WITH_SIGNALSCOPE:BOOL=ON WITH_SNOPT:BOOL=ON WITH_SPOTLESS:BOOL=ON WITH_SWIGMAKE:BOOL=ON WITH_SWIG_MATLAB:BOOL=ON WITH_XFOIL:BOOL=ON WITH_YAML_CPP:BOOL=ON

$ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -v Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn) Target: x86_64-apple-darwin14.5.0 Thread model: posix

rdeits commented 8 years ago

It also crashes for me (El Capitan, 2015b). Just before the crash, it reports:

objc[63037]: Class JavaLaunchHelper is implemented in both /Applications/MATLAB_R2015b.app/sys/java/jre/maci64/jre/bin/java and /Applications/MATLAB_R2015b.app/sys/java/jre/maci64/jre/lib/jli/libjli.dylib. One of the two will be used. Which one is undefined.
rdeits commented 8 years ago

This change inexplicably fixes the issue for me: https://github.com/rdeits/drake/commit/6de25e992a321521af21cb75ac90314e248409e4

patmarion commented 8 years ago

That looks sketchy :) If you compile in debug mode can you get a line number for the crash? Is it reproducible in any c++ only (non-matlab) tests?

rdeits commented 8 years ago

Actually, the crash doesn't occur if I compile in debug mode. And the only c++-only version of this code lives inside oh-distro and doesn't compile on OSX.

patmarion commented 8 years ago

Perhaps the c++ test could be run under valgrind? Sometimes things that crash in release mode but not in debug mode will at least produce warnings in valgrind.

RussTedrake commented 8 years ago

@rdeits -- thanks for looking into this. agreed that this is very sketchy. should we merge your fix or investigate further?

RussTedrake commented 8 years ago

possibly (probably?) related -- https://github.com/RobotLocomotion/drake/pull/2376 and https://drake-cdash.csail.mit.edu/viewTest.php?onlydelta&buildid=21003 , which was a seemingly unrelated change. it seems very likely that we have a memory ghost in instantaneousQP .

RussTedrake commented 8 years ago

adding @ggould-tri @jamiesnape to the conversation

david-german-tri commented 8 years ago

There are segfaults in constructQPDataPointerMex on Linux as well, in Debug mode only:

https://drake-cdash.csail.mit.edu/viewTest.php?onlyfailed&buildid=76358

tkoolen commented 8 years ago

On OSX in Release mode, I get a segfault with:

[  8] 0x0000000186a36ec0 /Users/twan/code/drake-distro/drake/systems/controllers/constructQPDataPointerMex.mexmaci64+00036544 _ZN25InstantaneousQPControllerC2ENSt3__110unique_ptrI13RigidBodyTreeNS0_14default_deleteIS2_EEEERKNS0_12basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEE+00000736
[  9] 0x0000000186a3695e /Users/twan/code/drake-distro/drake/systems/controllers/constructQPDataPointerMex.mexmaci64+00035166 mexFunction+00000862

I haven't investigated that segfault further. Instead, I tried running in Debug mode, where I didn't get the segfault above, but instead got:

[ 11] 0x00000001b988c557 /Users/twan/Library/Caches/CLion2016.2/cmake/generated/drake-8d5e0e9a/8d5e0e9a/Debug/systems/controllers/libdrakeZMPUtil.dylib+00267607 _ZN5Eigen9DenseBaseINS_5BlockINS_6MatrixIdLin1ELin1ELi0ELin1ELin1EEELi1ELin1ELb0EEEE6resizeEll+00000167

in the matlab dump and

Assertion failed: (rows == this->rows() && cols == this->cols() && "DenseBase::resize() does not actually allow to resize."), function resize, file /Users/twan/code/drake-distro/build/install/include/eigen3/Eigen/src/Core/DenseBase.h, line 257.

on the command line.

This was due to a bug in zmpUtil.cpp (that may have crept in after the bug that this issue started with). We check that the degree of all of the segments of the PiecewisePolynomial zmp_trajectory is d (so all its segments have k = d + 1 coefficients), which is good. Then we compute

PiecewisePolynomial<double> zbar_pp = zmp_trajectory - zmp_tf;

where zmp_tf is a constant matrix. One would think that the degrees of the segments of zbar_pp would be the same as those of zmp_trajectory, but the operator-= for Polynomial calls MakeMonomialsUnique, which erases monomials with coefficients equal to zero, so it actually can change the degree of the polynomial as returned by the GetDegree method, as well as the size of the vector of coefficients. Later on we were doing the following:

MatrixXd poly_coeffs = MatrixXd::Zero(nq, k);
...
auto poly_mat = zbar_pp.getPolynomialMatrix(j);
...
poly_coeffs.row(x) = poly_mat(x).GetCoefficients().transpose();

But the left hand side was a bigger row vector than the right hand side in that assignment due to the erased monomials.

@manuelli and @siyuanfeng-tri could potentially be affected by this bug.

After making some changes to handle this case better, I was able to get to the end of the plan without crashing Matlab. I'll submit a pull request to fix this part soon. The test still doesn't pass though; now I get the following matlab error:

Undefined function or variable 'smooth'.

Error in Biped/plotWalkingTraj (line 23)
  x_smooth(i,:) =
  smooth(x_breaks(i,:),15,'lowess');

Error in Biped/simulateWalking (line 31)
[com, rms_com] = obj.plotWalkingTraj(ytraj,
walking_plan_data);

Error in Biped/runWalkingDemo (line 52)
[ytraj, com, rms_com] =
obj.simulateWalking(walking_plan_data,
walking_options);

Error in runAtlasWalking (line 30)
r.runWalkingDemo(walking_options);
tkoolen commented 8 years ago

matlab error: I just didn't have the curve fitting toolbox.

tkoolen commented 8 years ago

Regarding the segfault in release mode, it indeed appears that this happens when param_sets gets move-assigned (i.e. assigned from an rvalue), but not when it's being copy-assigned from an lvalue (as demonstrated by https://github.com/rdeits/drake/commit/6de25e992a321521af21cb75ac90314e248409e4). There are real differences between the implementation of a std::map's move assignment operator and its copy assignment operator, so it's not that weird that there is a difference in behavior.

I haven't figured it out completely yet, but it could be this: https://stackoverflow.com/questions/38403189/move-assigning-a-stdmap-with-an-aligned-value-type-segfaults.

In any case, there are some pretty pervasive Eigen alignment issues that should be resolved regardless:

Just fixing the issues above didn't get rid of the segfault though. I'll continue later.

tkoolen commented 8 years ago

I can reproduce the minimal example in https://stackoverflow.com/questions/38403189/move-assigning-a-stdmap-with-an-aligned-value-type-segfaults on my machine, and it only results in a segfault in release mode, not in debug, consistent with what we're seeing.

jamiesnape commented 8 years ago

Can you reproduce the issue with -O1 or -O2 compiler flags (Release sets -O3)?

tkoolen commented 8 years ago

For the minimal example: fine with O0, bad with O1, O2 and O3.

I also created the following example involving Eigen:

#include <iostream>
#include <map>
#include <Eigen/Core>

using Eigen::Vector4d;

struct Foo {
  double a;
  Vector4d b; // fixed size vectorizable type
  EIGEN_MAKE_ALIGNED_OPERATOR_NEW;
};

template <typename Key, typename T>
using eigen_aligned_map =
std::map<Key, T, std::less<Key>, Eigen::aligned_allocator<std::pair<Key const, T> > >;

int main() {
//  std::map<int, Foo> m{{0, {1., Vector4d::Random()}}};
//  std::map<int, Foo> n{{1, {1., Vector4d::Random()}}};

  eigen_aligned_map<int, Foo> m{{0, {1., Vector4d::Random()}}};
  eigen_aligned_map<int, Foo> n{{1, {1., Vector4d::Random()}}};

  std::cout << "GOT HERE" << std::endl;
  m = std::move(n);  //  segfault happens here
  std::cout << "GOT HERE TOO" << std::endl;
}

Strangely enough, for this second example, things are fine for O0 and O1, but bad for O2 and O3, regardless of whether I use std::map or eigen_aligned_map. So following Eigen's standard rules for using fixed-size vectorizable types in STL containers (https://eigen.tuxfamily.org/dox-devel/group__TopicStlContainers.html) does not work for this case.

I'm not sure whose bug this is (Eigen's or Clang's). In any case, we can work around it by using the copy-assignment operator instead, as @rdeits noticed already.

rdeits commented 8 years ago

Wow, nice find! I'm astonished that my stupid hack actually did something relevant.

RussTedrake commented 7 years ago

super excited to see the progress on this! thanks @tkoolen !!

tkoolen commented 7 years ago

Update on the move assignment bug report I submitted to Apple:

Engineering has provided the following feedback regarding this issue: Please pass the following on to the developer. There is undefined behavior (UB) in std::map that seems to hit in this case. We are already tracking this issue.