Factors for motion-only bundle adjustment

JeffR1992 commented 2 years ago

Feature

I've read a few VIO/VSLAM papers (e.g. ORB-SLAM 2 and VINS-Mono to name some well known implementations), that mention using "motion-only bundle adjustment" as part of their pipelines. This essentially optimizes the camera pose against assumed known landmark positions (i.e. triangulated landmark positions are kept constant during the optimization), and seems to produce better results than PnP (which uses only the previous and current frames with constant triangulated landmark positions), as it is able to take into account landmark data associations over more than 2 frames.

Motivation

In order to speed up VIO/VSLAM pipelines, it would be great if there was a way for gtsam to include a factor that would allow users to perform motion-only BA (similar to the EdgeStereoSE3ProjectXYZOnlyPose and EdgeSE3ProjectXYZOnlyPose factors provided by g2o).

Alternatives

As my VSLAM pipeline uses a stereo setup, I've tried using the standard GenericStereoFactor provided by gtsam with noise/uncertainty set to zero in the noise model, but this doesn't seem to produce any speed benefits. Are there alternatives to GenericStereoFactor that would allow me to perform motion-only BA? I've had a look at the NonlinearEquality class but it seemed like this might not be what I'm looking for.

dellaert commented 2 years ago

I forget whether there is a specific factor for this, but with expressions this should be fairly easy. Something like:

Point3_ known_point1(Point3(1,2,3));
Point3_ known_point2(Point3(...));
...
Expression<Camera> camera1(Key(1));

...
ExpressionFactorGraph fg;
fg.add(ExpressionFactor(project, camera1, known_point1));
fg.add(ExpressionFactor(project, camera1, known_point2));
...

JeffR1992 commented 2 years ago

Hmm, I'm getting some errors when trying my hand at using Expressions. This is what I have so far:

    gtsam::ExpressionFactorGraph expression_graph;

    gtsam::Expression<gtsam::StereoCamera> stereo_camera(gtsam::Key(0));
    gtsam::Point3 landmark_0(gtsam::Point3(1.5, 2.3, 2.43));

    gtsam::SharedIsotropic noise_model = gtsam::noiseModel::Isotropic::Sigma(2, 5.0);
    expression_graph.addExpressionFactor(stereo_camera, landmark_0, noise_model);

which is currently giving me compilation errors. I've tried looking at the gtsam header files for more info, and also had a look around the internet for some well defined examples that use Expressions and ExpressionFactors for motion-only BA, but unfortunately I haven't been able to find anything.

I feel like my issue lies with gtsam::Expression<gtsam::StereoCamera> stereo_camera(gtsam::Key(0)); so I thought I'd ask some questions regarding this:

1) When instantiating the stereo_camera Expression object, I'm assuming I need to somehow give this object a camera pose in the world frame as a gtsam::Pose3 object. Furthermore, it also seems like I'd need to give the stereo_camera object an intrinsic calibration matrix (e.g. a gtsam::Cal3_S2Stereo object) to make sure that the landmark is within the FOV of the camera. How does gtsam::Key(0) link to a camera's intrinsics, and its pose in the world frame?

2) Looking at the explanation of addExpressionFactor(), it expects an Expression that implements a measurement function h(x), so this would be the stereo_camera Expression since it implements a project() measurement function. However, the StereoCamera class seems to implement many different measurement functions (e.g. project(), backproject() etc.), and I'm wondering how I would tell the addExpressionFactor() function which measurement function to use since I can't call stereo_camera.project() since stereo_camera is an Expression of type StereoCamera, and is not a StereoCamera object itself.

It would be great if an example of using Expressions to perform motion-only BA could be added to the list of examples, as it seems like a lot of the state-of-the-art VIO and VSLAM algorithms make use of this technique.

dellaert commented 2 years ago

Hey @JeffR1992 , maybe you should be the one to create that example, perhaps not with stereo-cameras but just regular cameras, to make things easier. Perhaps that will also help you grok expressions in a simpler case. It should be a straightforward modification (in a different file) of the current SFM+Expressions example...

catproof commented 2 years ago

@JeffR1992 did you end up figuring this out? :)

JeffR1992 commented 2 years ago

Unfortunately I haven't had time to do this. However, after coming back to this having read a few more papers it sounds like motion-only BA is equivalent to structureless BA, in which case the paper titled "On-Manifold Preintegration for Real-Time Visual-Inertial Odometry", which states that it uses "structureless vision factors", is useful to read. @dellaert are "structureless vision factors" equivalent to the SmartStereoProjectionPoseFactor types found in gtsam_unstable/slam? If so would you advise I use these smart factors instead of going down the expression factor route?

dellaert commented 2 years ago

Sure. This is what several people do, although the students/postdocs doing it here at Georgia Tech have long since moved on :-) Some more papers from those folks:

Mining Structure Fragments for Smart Bundle Adjustment, Luca Carlone, Pablo Fernández Alcantarilla, Han-Pang Chiu, Zsolt Kira, and Frank Dellaert, British Machine Vision Conference (BMVC), 2014
Eliminating Conditionally Independent Sets in Factor Graphs: A Unifying Perspective based on Smart Factors, Luca Carlone, Zsolt Kira, Chris Beall, Vadim Indelman, and Frank Dellaert, IEEE International Conference on Robotics and Automation (ICRA), 2014
Incremental Light Bundle Adjustment for Robotics Navigation, Vadim Indelman, Andrew Melim, and Frank Dellaert, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013
Probabilistic Analysis of Incremental Light Bundle Adjustment, Vadim Indelman, Richard Roberts, and Frank Dellaert, IEEE Workshop on Robot Vision (WoRV), 2013

See https://dellaert.github.io/publications/

JeffR1992 commented 2 years ago

Thanks for the links to those papers, I've read (2) which was great, but I'll take a look at the others as well when I get some time.

Apologies if I didn't fully understand your previous reply @dellaert . When you said "although the students/postdocs doing it here at Georgia Tech have long since moved on", were you implying that the use of smart factors for motion-only/structureless BA with GTSAM is not regarded as a "best practice" anymore, and that there are better ways to perform this (e.g. expression factors that you mentioned originally)? Or are you saying that the students/postdocs that worked on smart factors for the GTSAM library have simply graduated or left the lab, but that smart factors are still the best way to go for motion-only/structureless BA? Thanks!

dellaert commented 2 years ago

Ha :-) The latter: the students/postdocs literally moved to other places. The ideas are still valid and are still bing actively used, e.g., @lucacarlone and his team have recently added new smart factor variants to deal with rolling shutter etc.

JeffR1992 commented 2 years ago

Haha, thanks for clearing that up, I'll take a look at smart factors and see if I can get something to work. Thanks for your suggestions!

dellaert commented 2 years ago

OK, I will then close this issue for now.

GabiShahmayster commented 1 year ago

Hi @JeffR1992, I'm interested in implementing motion-only BA using GTSAM as well. Were you able to figure out how to use smart factors (structure-less BA) for the task? Thanks, Gabi

laxnpander commented 1 year ago

@GabiShahmayster I have written a motion-only BA for my own framework. There are some custom types in this code snippet, but it should be straightforward to work out. I have another implementation in ceres, which does the same thing and the results are the same. So it should be somewhat legit. This is for the monocular pinhole case only though.

#include <gtsam/geometry/Point2.h>
#include <gtsam/geometry/Point3.h>
#include <gtsam/nonlinear/Values.h>
#include <gtsam/nonlinear/LevenbergMarquardtOptimizer.h>
#include <gtsam/inference/Symbol.h>

using namespace spear::gtsam::motion;

Problem::Problem()
{
  // Noise for 2D features is 1 px in x and y
  m_noise_fts  = ::gtsam::noiseModel::Isotropic::Sigma(2, 1.0);
  m_huber_loss = ::gtsam::noiseModel::Robust::Create(::gtsam::noiseModel::mEstimator::Huber::Create(1.345), m_noise_fts);

  // Noise for 3D pose
  Vector6d sigmas; sigmas << Vector3d(0.1,0.1,0.1), Vector3d(0.05,0.05,0.05);
  m_noise_pose = ::gtsam::noiseModel::Diagonal::Sigmas(sigmas);
}

spear::SE3r Problem::solve()
{
  std::cout << "Solving motion only problem..." << std::endl;
  ::gtsam::Values initial;
  initial.insert(::gtsam::Symbol('t', 0), m_twc);

  std::cout << "Initial error: " << m_graph.error(initial) << std::endl;
  ::gtsam::Values result = ::gtsam::LevenbergMarquardtOptimizer(m_graph, initial).optimize();
  std::cout << "Final error: " << m_graph.error(result) << std::endl;

  auto twc = result.at<::gtsam::Pose3>(::gtsam::Symbol('t', 0));

  return SE3r{twc.matrix()};
}

void Problem::setCamera(const std::shared_ptr<const Camera> &cam)
{
  // Currently only pinhole is supported
  auto pinhole = std::dynamic_pointer_cast<const Pinhole>(cam);

  // Initialise pose and calibration values + expressions
  m_twc  = ::gtsam::Pose3(pinhole->twc().matrix3x4());
  m_twc_ = std::make_unique<::gtsam::Expression<::gtsam::Pose3>>('t', 0);
  m_K    = ::gtsam::Cal3_S2(pinhole->fx(), pinhole->fy(), 0.0, pinhole->cx(), pinhole->cy());
  m_K_   = std::make_unique<::gtsam::Expression<::gtsam::Cal3_S2>>(m_K);

  m_graph.addExpressionFactor(*m_twc_, m_twc, m_noise_pose);
}

void Problem::addObservation(const Vector2d &kpt, const std::shared_ptr<const Landmark> &lm)
{
  Vector3d X((double)lm->x(), (double)lm->y(), (double)lm->z());
  ::gtsam::Expression<::gtsam::Point2> prediction_ = ::gtsam::uncalibrate(*m_K_, ::gtsam::project(::gtsam::transformTo(*m_twc_, X)));
  m_graph.addExpressionFactor(prediction_, kpt, m_huber_loss);
}

nnop commented 1 year ago

@JeffR1992 @dellaert Regarding the benefits of Motion-only BA over PnP(I think you mean DLT solver?), I think it was not because:

it is able to take into account landmark data associations over more than 2 frames.

For they both take as input the 3D matched map points and 2D keypoints in the current frame. But because the usage of robust nonlinear optimization on reprojection error over solving linear systems. Do you agree?

borglab / gtsam