Closed miraclebiu closed 1 year ago
We don't need to estimate the depth of each image token. We compute the nearest neighbor based on the 2D distance between the 3D grid points and image tokens. We assume that 2D adjacency also reflects 3D adjacency to some extent.
Frankly, there is definitely some error in this, and there is a lot of room for improvement in how to construct non-learnable 2D->3D one-to-one mappings.
Do I understand correctly that it means that for given set of camera extrinsic parameters (so for given dataset), we will always assign the same depth to given pixel regardless of what's on the camera or in lidar?
Yes, your understanding is correct. This is a somewhat rudimentary rule-based mapping approach, as we prefer not to introduce parameters for depth estimation.
Since each pixel in image represent a ray of 3d space, so I don't understand how to calculate the nearest neighbor between pseudo 3d grid points and image token In Section(3D geometric space), do we need to estimate the depth for each image token?