JisuHann commented 3 years ago

Next Best View Prediction

Supervised Learning of the Next-Best-View for 3D Object Reconstruction
Learn-to-Score: Efficient 3D Scene Exploration by Predicting View Utility
Next Best View Planning for Object Recognition in Mobile Robotics
3D Attention-Driven Depth Acquisition for Object Identification

JisuHann commented 3 years ago

1. Supervised Learning of the Next-Best-View for 3D Object Reconstruction

Problem Setup

Application:
- Reconstruction → Recognition
- Modify the oracle algorithm to give us best recognition probability instead.
  - Aggregation module … if we want analogous baseline for recognition
Givens:
- (training) Oracle → needs 3D model.
- Partial Occupancy Grid
  - Only contains reconstructed part
Unknown
- Next best view
Constraint:
- Additional constraint ⇒ Pairwise overlap between adjacent views > threshold
Assumption:
- (1) Tabletop objects
- (2) Opaque objects
  Insights
Trained as a classification problem over reduced set of discretized poses
- Hypothesis: → Reduction ⇒ degradation?
Learning based model that is supervised based on an “oracle” algorithm that they made.
Dataset: argmax over N finite samples
- Large fixed set of poses over which every single instance is evaluated (label: brute force)

JisuHann commented 3 years ago

1. NBV-Net: Supervised Learning of the Next-Best-View for 3D Object Reconstruction

task: 3D Object reconstruction -> emphasize on Planning: NBV prediction
- sensor positioning -> sensing(perception) -> registration -> Planning for the next sensor location
- Planning for the next sensor location: Optimization problem - Search over utility functions(information gain)
  Idea
  
  : Next-best-view planning scheme based on Supervised Learning

Automatic generation of dataset
3D-CNN: used to learn the NBV(sensor pose, classification problem)
- input: uniform probabilistic occupancy grid
  Next-best-view learning problem
  1. sensor positioning: placed at sensor pose(position and orientation)
  2. sensing(Perception): object surface is measured and a point cloud from the shape is obtained-> registered to a single model, stored in a partial model
- Uniform probabilistic occupancy grid(each voxel has an associated probability that represents the likelihood that part or all the object's surface is inside the voxel's volume)
  1. registration
  2. Planning for the next sensor location
- just directly predict the NBV based on the information by the partial model, not find the view that maximizes a metric

Results

Predicting several unknown objects reaching a reconstruction coverage higher than 90 percent

JisuHann commented 3 years ago

4. 3D Attention-Driven Depth Acquisition for Object Identification

task: Fine-grained Object classification

Recurrent 3D attention model (3D-RAM)

: 3D attention model for active 3D object identification with multi-view depth acquisition

input: depth image
output
1. shape classification based on classification hierarchy
2. NBV (most informative region)
trained by synthetic 3D model

Shape hierarchy

organize the shape collection with a hierarchy of coarse-to-fine MV-CNN classifiers
Learning enhanced features particularly effective for the fine-grained task of the current level (by fine-tuning the one inherited from its parent node)

Part-level attention

attention of the views in 3D space pointing to a 3D object
to tolerate object occlusion, robust against partial occlusion
learn at each node focus-driven features

1. View-level attention: MV-RNN

: select NBV for depth acquisition targeting at an object of interest, sequential NBV regression based on RNN

each step
1. acquires a depth image
2. integrates(aggregates) the order-aware information of all past views
3. conducts the shape classification - Shape hierarchy
4. regress the NBV - Recurrent attention model for NBV regression
Recurrent attention model for NBV regression Repeat
1. input processing
2. view(information) aggregation network & view glimpse network
  - devise max-pooling based recurrent units to achieve powerful view aggregation
  - the glimpse integration of depth images and view parameters is postponed after view aggregation
3. Action generation: NBV network
  - input: current state of above recurrent network
  - Output: vector of NBV parameters

View-based observation

to achieve continuous view planning
On the full viewing space parameterized on a viewing sphere around a 3D model in training or an object of interest during testing
input: 2.5D depth image
Output: captured 2.5D depth image for the object of interest

2. Region-level attention

: concentrates on the discriminative regions in each view for part-based recognition

instance-level shape classification, after the view aggregation layer

JisuHann commented 3 years ago

2. Recurrent 3D attentional networks for end-to-end active object recognition

Goal

task: object recognition
1. unobserved views must be sampled and the corresponding data must be synthesized from a learned generative model
2. the object recognition model is typically learned independently of the view planner
  Idea
  
  Multi-view depth-based active object recognition using an attention mechanism
train by 3D shape dataset --> give best views targeting an object of interest for recognizing it
differentiable rendering(depth image to be differentiable with respect to the viewing parameters) → loss backpropagation
Recurrent 3D attentional architecture
- STN(Conv2D + FC loc): for NBV selection in 3D space, end-to-end attentional network to actively predict image locations for 2D object detection and recognition → For our localization network
- SC(Conv2D + FC class): for object recognition
- RNN: simultaneous object recognition, modeling the sequential dependencies between consecutive views
loss: cross-entropy loss
training
- pre-train SC (depth image → NBV selection)
- feature extraction from Conv2D
- randomly select dozen of views from depth image → image classification
- joint training 3D-STN, SC and RNN
- train 3D-STN, images → (class label, update to the current view)
input: a view parameterized in the local spherical coordinate system(using a ray casting algorithm with a random initial view from our selected 50 views)
depth layer: generate depth image
- fill the gap between depth images loss gradient and camera views loss gradient
- Ray casting: determining the hit points of a shape intersected by a ray
- Conv2D: extract its feature from the depth image
aggregates information of past views on RNN hidden layer
FC class: prediction of the categorical label
FC loc: for NBV selection in 3D space, produce an update to the current view for future observations

JisuHann / One-day-One-paper

Active Vision #33

Next Best View Prediction

1. Supervised Learning of the Next-Best-View for 3D Object Reconstruction

Problem Setup

Insights

1. NBV-Net: Supervised Learning of the Next-Best-View for 3D Object Reconstruction

Idea

Next-best-view learning problem

Results

4. 3D Attention-Driven Depth Acquisition for Object Identification

Recurrent 3D attention model (3D-RAM)

Shape hierarchy

Part-level attention

1. View-level attention: MV-RNN

View-based observation

2. Region-level attention

2. Recurrent 3D attentional networks for end-to-end active object recognition

Goal

Idea

Multi-view depth-based active object recognition using an attention mechanism