The depth observation is used to compute a point cloud. Each point in the point cloud is associated with the predicted semantic
categories.
Do we need to use a depth sensor or can we use output of depth NN?? HabitatNav says it is not dependent on a very very accurate reading.
The semantic categories are predicted using a pretrained Mask RCNN [18] on the RGB
observation.
Use a pre-trained COCO model across selected labels. Select these labels.
Each point in the point cloud is then projected in 3D space using differentiable geometric
computations to get the voxel representation. The voxel representation is then converted to the
semantic map.
Not sure how to do this?
Summing over the height dimension of the voxel representation for all obstacles,
all cells, and each category gives different channels of the projected semantic map.
The projected semantic map is then passed through a denoising neural network to get the final semantic map
prediction. The map is aggregated over time using spatial transformations and channel-wise pooling
as described in [10].
The Semantic Mapping module is trained using supervised learning with cross-entropy loss on the semantic segmentation as well as semantic map prediction.
The geometric projection is implemented using differentiable operations such that the loss on the semantic map prediction can be backpropagated through the entire module if desired
Semantic Mapping
The depth observation is used to compute a point cloud. Each point in the point cloud is associated with the predicted semantic categories.
Do we need to use a depth sensor or can we use output of depth NN?? HabitatNav says it is not dependent on a very very accurate reading.
The semantic categories are predicted using a pretrained Mask RCNN [18] on the RGB observation.
Use a pre-trained COCO model across selected labels. Select these labels.
Each point in the point cloud is then projected in 3D space using differentiable geometric computations to get the voxel representation. The voxel representation is then converted to the semantic map.
Not sure how to do this?
Summing over the height dimension of the voxel representation for all obstacles, all cells, and each category gives different channels of the projected semantic map.
The projected semantic map is then passed through a denoising neural network to get the final semantic map prediction. The map is aggregated over time using spatial transformations and channel-wise pooling as described in [10].
The Semantic Mapping module is trained using supervised learning with cross-entropy loss on the semantic segmentation as well as semantic map prediction.
The geometric projection is implemented using differentiable operations such that the loss on the semantic map prediction can be backpropagated through the entire module if desired