Closed rockywind closed 1 year ago
Hi, note that we only invert the pose of the first frame (from which we make the prediction). This inverted pose then gets broadcasted and multiplied with all other frame poses. This means that the relative poses between the different frames / views remains the same, but the input frame is at (0, 0, 0) -> Identity matrix. This is not directly necessary, but makes directly querying the network easier, as you don't have to transform the points to the coordinate system of the input frame.
(Side node: poses have shape (N, V, 4, 4) where N is the batch size and V is the number of frames / views per sample.
Hi, I am confused about that the pose matrix is inverted and multiplied by itself, isn't that the identity matrix?