How to adapt huggingFace repo for inference on own interior (NYU like) rgb images

astra-vision / MonoScene

[CVPR 2022] "MonoScene: Monocular 3D Semantic Scene Completion": 3D Semantic Occupancy Prediction from a single image

https://astra-vision.github.io/MonoScene/

Apache License 2.0

708 stars 69 forks source link

How to adapt huggingFace repo for inference on own interior (NYU like) rgb images #59

Closed OthmanLoutfi closed 1 year ago

OthmanLoutfi commented 1 year ago

Hello, I saw that you gave directions as to how to get inference results on interior images that aren't in NYU or kitti dataset using this repo: https://huggingface.co/spaces/CVPR/MonoScene/tree/main

Could you explain further what we are supposed to modify in the code ?

anhquancao commented 1 year ago

Hi @OthmanLoutfi, You need to perform the reconstruction in the camera coordinate. To do so, you need to transform the 3D ground-truth to the camera coordinates using the GT pose. Then, you can define a 3D volume in front of the camera in the camera coordinate as the output volume. You can expect smth similar here in SceneRF https://github.com/astra-vision/SceneRF#teaser. I will update the code for SceneRF in next months

OthmanLoutfi commented 1 year ago

Hi @OthmanLoutfi, You need to perform the reconstruction in the camera coordinate. To do so, you need to transform the 3D ground-truth to the camera coordinates using the GT pose. Then, you can define a 3D volume in front of the camera in the camera coordinate as the output volume. You can expect smth similar here in SceneRF https://github.com/astra-vision/SceneRF#teaser. I will update the code for SceneRF in next months

So if i understand correctly, we must have volume data for each image (in the form of a .bin file) in order to get inference results. Could using SceneRF help generates those files ?

anhquancao commented 1 year ago

Yes, the current implementation of MonoScene on NYU requires the ground-truth pose from the .bin file to get the corresponding 3D volume. It eases the evaluation process.

If you want to do similar to the demo you need:

For each scene, transform the voxels to the camera coordinate using the ground-truth pose.
Thus, you can define 3D reconstruction volume as the area right in front of the camera, e.g. 4.8m x 4.8m x 3.6m is enough to contain all the transformed 3D ground truth in the camera coordinate space. This removes the need of ground-truth pose from the .bin file. Note that for voxels without ground-truth you can set them to unknown class i.e. 255
Retrain MonoScene unders this setting.

SceneRF does everything in the camera coordinate space. It doesn't require those files

I hope this help.

OthmanLoutfi commented 1 year ago

For each scene, transform the voxels to the camera coordinate using the ground-truth pose.

are these the camera coordinates ? https://github.com/astra-vision/MonoScene/blob/4447ea4f52554482e24d8aa46d290747c030ceba/monoscene/data/utils/helpers.py#L104-L106

Thus, you can define 3D reconstruction volume as the area right in front of the camera, e.g. 4.8m x 4.8m x 3.6m is enough to contain all the transformed 3D ground truth in the camera coordinate space. This removes the need of ground-truth pose from the .bin file.

I'm not sure i understand this, could you elaborate further on what needs to be done at this step ?

anhquancao commented 1 year ago

For NYU, the voxels are in the world coordinates. I don't think changing them is enough, you need to write code to convert the 3D ground-truth to camera coordinates, modify the training pipeline, retrain the model.