Inputs: Video scan sequence of the object and query images
Output: Estimated object's pose
Annotate the object's bounding box and the camera poses from the video scans in AR
Reconstruct 3D sparse point cloud model by taking a video sequence of the object with Structure from Motion (SfM), while also constructing the 2D-3D correspondance graphs
Train the GATs with them to directly find 2D-3D correspondance maps
Test the model with a query image as input and find the 2D-3D correspondance map
Use the map to solve the object pose estimation problem with Perspective-N-Point (PnP)
Evaluate the results
Note: Check out the PowerPoint presentation to have more information on each step
Brief description of the techniques that are used:
SfM: technique used to recontruct the 3D structure of an object by taking a sequence of 2D images. This can be done by using corresponding image points in multiple views in combination with known camera calibration and position (usually expressed by a projection matrix) and reconstructing the corresponding 3D points via triangulation.
Feature extraction and matching: feature extraction is handled via SuperPoint, while matching is handled via SuperGlue
Focal Loss function: optimization used to find the correct correspondances out of the possible matches.
PnP: technique that allows to determine the position and the orientation of a camera given its intrinsic parameters and a set of N correspondances between the 3D points and their 2D projections. This allows us to obtain the camera's rotation and translation that best aligns the 3D point with their 2D projection, then use these information, in combination with the known camera world pose, to derive the object's position and orientation in the world coordinate system.
In this case it's combined with RANSAC to significantly improve the reliability of the pose estimation by handling outliers in the correspondances, so that we can mitigate the impact of erroneous correspondences between 2D image points and 3D model points.
Evaluation metrics: 5cm-5deg metric, which deems a predicted pose as correct if the error is below 5cm and 5°. This criteria was further narrowed down to 1cm-1deg and 3cm-3deg following a similar definition to set up more strict metrics for the pose estimation in AR application.
Strenghts:
Model-free object-agnostic approach
Can detect multiple objects at a time, as it retain information from the old scans (as it can be seen in this video)
General model while mantaining robustness, precision and fastness (about 58 ms for any object instances or categories in the validation set)
Applicable for AR
Due to its precision it achieves great results when implementing grasping tasks
Limitations:
OnePose is not very robust when it comes to detecting object that are either low-texture or textureless as it relies on local feature matching (see OnePose++ to see how they managed to solve this problem)
OnePose has some scalabality limits, as it struggles to handle extreme change of scales between the video scans and the test sequence
PowerPoint presentation:
OnePose PowerPoint presentation: OnePose.pptx
See also Dope, as it uses synthetically-generated data to train its network: Dope.pptxNote: I will probably upload a presentation on OnePose++ and improve the one on OnePose in the future
Paper - Code - Supplementary - Dataset
Pipeline and methods:
Inputs: Video scan sequence of the object and query images Output: Estimated object's pose
Note: Check out the PowerPoint presentation to have more information on each step
Brief description of the techniques that are used:
SfM: technique used to recontruct the 3D structure of an object by taking a sequence of 2D images. This can be done by using corresponding image points in multiple views in combination with known camera calibration and position (usually expressed by a projection matrix) and reconstructing the corresponding 3D points via triangulation.
Feature extraction and matching: feature extraction is handled via SuperPoint, while matching is handled via SuperGlue
Focal Loss function: optimization used to find the correct correspondances out of the possible matches.
PnP: technique that allows to determine the position and the orientation of a camera given its intrinsic parameters and a set of N correspondances between the 3D points and their 2D projections. This allows us to obtain the camera's rotation and translation that best aligns the 3D point with their 2D projection, then use these information, in combination with the known camera world pose, to derive the object's position and orientation in the world coordinate system. In this case it's combined with RANSAC to significantly improve the reliability of the pose estimation by handling outliers in the correspondances, so that we can mitigate the impact of erroneous correspondences between 2D image points and 3D model points.
Evaluation metrics: 5cm-5deg metric, which deems a predicted pose as correct if the error is below 5cm and 5°. This criteria was further narrowed down to 1cm-1deg and 3cm-3deg following a similar definition to set up more strict metrics for the pose estimation in AR application.
Strenghts:
Limitations:
PowerPoint presentation:
OnePose PowerPoint presentation: OnePose.pptx See also Dope, as it uses synthetically-generated data to train its network: Dope.pptx Note: I will probably upload a presentation on OnePose++ and improve the one on OnePose in the future