gaoxiang12 / slambook-en

The English version of 14 lectures on visual SLAM.
GNU General Public License v3.0
1.41k stars 255 forks source link

Question regarding ch. 14 Semantic SLAM #28

Open samehmohamed88 opened 3 years ago

samehmohamed88 commented 3 years ago

Firstly, I would like to extend my gratitude for your excellent work, your dedication and efforts. This book is truly a great resource for many people of different backgrounds.

I have just finished the first reading of the VO chapters, after a deep-dive on the Lie Algebra and Linear Optimization chapters.

My first instincts regarding ORB (Oriented FAST) corners is why not use object detection similar to YOLO v4 or v5.

I skimmed ahead to Chapter 14 and found the section on Semantic SLAM, so my instincts are not very far off. Now my questions are as follows:

  1. Since YOLO object detection is quite fast even on mobile devices now, while also being accurate, do you think that the center of a bounding box detection can serve as a replacement to an Oriented FAST feature point? Maybe the descriptor can be even smaller than 128 bit values binary since there's some semantic meaning?

  2. I was looking at ORB_SLAM2 paper and Github and noticed that the latest code commit was 4 years ago, and the more recent OpenVSLAM was terminated yesterday (Feb 25th 2021). This leads me to believe that the robotics industry (the vacuum robots, drones) are not necessarily using these techniques? Can you please offer some insight on this? Do you believe the trend in industry has shifted towards deep learning already? Or are they possibly using more in-house systems?

  3. Your book was written in 2016, and since then there have been several papers published on Unsupervised Learning of Depth and Ego-Motion, like SfMLearner and Monodepth2. Do you know if there are any SLAM systems that are based on such techniques for the VO step? In your view, is this something the industry has adopted or is it not yet mature enough?

Finally, please allow me to contribute to you work in some way, either financially or with time and effort. I don't see a link for financial contribution, and since I can't read the Chinese version I would not be able to purchase it. Can you please advise on how I can support your work by contributing either my time or some minor funding.

Thanks and keep up the good work!

samehmohamed88 commented 3 years ago

I may have answered a part of my question with a google search on "deep learning for local features". But I would still be very glad if you can provide some comments on my questions based on your experience

dimaxano commented 3 years ago

OFFTOP Also would donate to the author for such a great book!

gaoxiang12 commented 3 years ago

Hi, @aboarya here is just my opinions on your questions:

  1. The common output of the detection networks is the bounding box. Most of the bounding boxes are not accurate at the pixel level. The detection nets are trained with the MaP loss, which measures the overlap area between the detected box with the annotation. It is also difficult to annotate the box at the pixel level. So, if you are interested in using the boxes to estimate camera pose, take a look at the Quadrics SLAM and Cube SLAM where people model the inside object as an ellipsoids/cubes. On the other hand, if you are interested in using the deep-learning-based features to do SLAM, maybe SuperPoint SLAM can help you.
  2. ORB-SLAM is still developing. The most recent one is ORB_SLAM3 where IMU integration is added into VSLAM. DSO from TUM also has many versions (stereo DSO/LDSO/DSO for rolling shutter), but some versions are not open for commercial issues. I think it's the same for many other VSLAM in industries like the VSLAM used in many cellphones (ARCore, ARKit, etc).
  3. Deep learning has been used in SLAM in many ways. You can train an end-to-end VO like DeepVO, learn the depth map from monocular images, use learning-based features for VO matching, point cloud registration, and many other tasks. Some collections on this topic are also available in GitHub: deep-learning-localization-mapping. But most of this work is still experimental, not stable enough for industrial usage. The networks rely heavily on image data, and probably a VO trained in indoor environments is not suitable for self-driving cars. They are not as general as many traditional approaches like ICP or PnP solver.

About the donation, well, I think it is just fine like this. I'm glad to know you love this book. I have some economic income from the Chinese version in recent years (about 30,000 copies I think). Springer will publish the English version someday, but I'm not sure how much time it will take. Maybe you can buy it on amazon someday in the future, but for now, I only have the Chinese version in the paper form.

Also if you find some interesting books in SLAM/robotics, let me know and I'm willing to do some translation works. Thanks for your support!