Closed OthmanLoutfi closed 1 year ago
Hi @OthmanLoutfi, You need to perform the reconstruction in the camera coordinate. To do so, you need to transform the 3D ground-truth to the camera coordinates using the GT pose. Then, you can define a 3D volume in front of the camera in the camera coordinate as the output volume. You can expect smth similar here in SceneRF https://github.com/astra-vision/SceneRF#teaser. I will update the code for SceneRF in next months
Hi @OthmanLoutfi, You need to perform the reconstruction in the camera coordinate. To do so, you need to transform the 3D ground-truth to the camera coordinates using the GT pose. Then, you can define a 3D volume in front of the camera in the camera coordinate as the output volume. You can expect smth similar here in SceneRF https://github.com/astra-vision/SceneRF#teaser. I will update the code for SceneRF in next months
So if i understand correctly, we must have volume data for each image (in the form of a .bin file) in order to get inference results. Could using SceneRF help generates those files ?
Yes, the current implementation of MonoScene on NYU requires the ground-truth pose from the .bin file to get the corresponding 3D volume. It eases the evaluation process.
If you want to do similar to the demo you need:
SceneRF does everything in the camera coordinate space. It doesn't require those files
I hope this help.
- For each scene, transform the voxels to the camera coordinate using the ground-truth pose.
are these the camera coordinates ? https://github.com/astra-vision/MonoScene/blob/4447ea4f52554482e24d8aa46d290747c030ceba/monoscene/data/utils/helpers.py#L104-L106
- Thus, you can define 3D reconstruction volume as the area right in front of the camera, e.g. 4.8m x 4.8m x 3.6m is enough to contain all the transformed 3D ground truth in the camera coordinate space. This removes the need of ground-truth pose from the .bin file.
I'm not sure i understand this, could you elaborate further on what needs to be done at this step ?
For NYU, the voxels are in the world coordinates. I don't think changing them is enough, you need to write code to convert the 3D ground-truth to camera coordinates, modify the training pipeline, retrain the model.
Hello, I saw that you gave directions as to how to get inference results on interior images that aren't in NYU or kitti dataset using this repo: https://huggingface.co/spaces/CVPR/MonoScene/tree/main
Could you explain further what we are supposed to modify in the code ?