how to infer a single rgb image

astra-vision / MonoScene

[CVPR 2022] "MonoScene: Monocular 3D Semantic Scene Completion": 3D Semantic Occupancy Prediction from a single image

https://astra-vision.github.io/MonoScene/

Apache License 2.0

698 stars 69 forks source link

how to infer a single rgb image #7

Closed akk-123 closed 2 years ago

akk-123 commented 2 years ago

if I understand correct, the Features Line of Sight Projection need know the 3d voxels position then project to 2d images，but when infer a rgb image(not in NYU), we don't know the 3d voxels position, how can I do project?

anhquancao commented 2 years ago

Hi @akk-123, Thank you for your interest in our work! You can sample the positions of the 3D voxels in the camera coordinate, then project them on the image using the camera intrinsics.

akk-123 commented 2 years ago

Hi, thanks for reply the key problem is how the sample the position, the only input is rgb images, there is no extra information, how to sample the position effective? there is no description in paper

anhquancao commented 2 years ago

It's a good question. MonoScene requires the camera intrinsics which allows us to compute the camera viewing frustum of the image. Then, we sample the grid that contains the camera viewing frustum.

Usama3059 commented 2 years ago

Hi @anhquancao Thanks for sharing the fantastic work, Can you please tell me the commands to run to infer on single RGB for NYU? If I don't have the bin file for that Image

anhquancao commented 2 years ago

Hi @Usama3059, Thanks for your interest! You can check the inference code here https://huggingface.co/spaces/CVPR/MonoScene/blob/main/app.py#L47 Specifically adapt the parameters passed in vox2pix in get get_projection function https://huggingface.co/spaces/CVPR/MonoScene/blob/main/helpers.py#L141 to use the camera parameters of NYU as in https://github.com/cv-rits/MonoScene/blob/master/monoscene/data/NYU/nyu_dataset.py#L84

someone-TJ commented 1 year ago

Hi @akk-123 I agree with you,when I look th code,It's seems like sample the position still need by both camera pose and camera intrinsics，such as NYUv2 https://github.com/astra-vision/MonoScene/blob/b3b12cf7e0df12cf68b95b0cc39f1d9ef1041666/monoscene/data/utils/helpers.py#L53 And I am confused with this note in this function

vox_origin: (3,)
        world(NYU)/lidar(SemKITTI) cooridnates of the voxel at index (0, 0, 0)

Is the default voxel origin positioned at (0,0,0) in the world coordinate? But in NYUv2, voxel_origin actually has specific values so if in the other dataset，I only get the rgb and camera intrinsics，and I don't know the location of the voxel origin，How do I use it as a reference point for projection @anhquancao

anhquancao commented 1 year ago

Hi @someone-TJ, A more proper way to do is to transform all the ground-truth into the camera coordinate using voxel_origin and camera_pose for training. Thus, the 3D scene will be in the camera coordinate, which makes remove these two variables from the computation of the projection.

someone-TJ commented 1 year ago

Thanks for your reply, I know what you mean, it is possible to do such a transform, but my doubt is that if monoscene only inputs rgb and then the default camera intrinsicsy is known, you still can't get the voxel coordinates (X,Y,Z) because the depth is lost when doing the camera coordinate system to pixel plane conversion

anhquancao commented 1 year ago

Hi, the projection is used to get the features from the 2D image, then the 3D CNN learns on these features to predict the class of voxels. Thus, the depth will be learned by the network from the dataset.

someone-TJ commented 1 year ago

Hi,my expression may not be clear.I mean assuming the voxel origin is (0,0,0), I know that the relative positions of the other voxels can be obtained from the unit voxel size and scene size, but I don't know the actual distance of the voxel to the camera's optical center in the real scene, so how do I make sure that the projection points calculated by the camera's intrinsics for the voxels correspond to the image pixel points; as you said in the article, the voxels may be projected outside the image？

anhquancao commented 1 year ago

You simply take a scene volume in front of the camera e.g. 4.8m x 4.8m x 3.2m and voxelized it. The voxels projected outside are assigned with zero features.