EPFL-VILAB / omnidata

A Scalable Pipeline for Making Steerable Multi-Task Mid-Level Vision Datasets from 3D Scans [ICCV 2021]
Other
412 stars 53 forks source link

The estimated depth results of ScanNet using the pretrained model #12

Closed lzhnb closed 2 years ago

lzhnb commented 2 years ago

Thanks for your amazing work!

I've downloaded the v2 checkpoint and run the demo.py to get the estimated depth. Here is the input RGB image and the shaded depth map. image It performs the details well!

However, I found that the scale of the estimated depth is so small: c12c2e276c3e072c1198b4d33fce781 In our acknowledge, the range of the estimated depth should be 0~3m for an indoor scene. How can I recover the real scale of the estimated depth map?

In your demo, there are many scenes of multiple scales(indoor scenes(0~3m), outdoor scenes(0~100m), and rendering scenes from cartoons). If you keep the real scale of depth in these scenes, I think it is too difficult to train the model. How do you process the scale in your dataset? image

Good luck :)

alexsax commented 2 years ago

Hi! Great question!

The depth models are trained using Midas loss. The networks are trained to predict a "normalized" depth image that needs per-image scale + shift parameters. Just like the MiDaS papers, we supply the best-possible scale+shift parameters during training and evaluation. You could do the same, or you could train a network to estimate the scale+shift for ScanNet. We haven't tried anything with Omnidata + ScanNet yet--I'd be curious how this works out!

By the way, here's our implementation of the Midas loss. It's been 2 years, and the original paper hasn't released their training code. We'd love to hear about how our implementation worked for you, and if you end up making any important tweaks.

Lastly, I don't think you're using this right now, but if you end up using Omnidata for training, the images are stored as 16-bit single-channel images and a integer value of 2**16 corresponds to a depth of 128m. So the conversion is depth_metric=depth_png/512.0. But this is only for the Omnidata dataset--the pretrained models use the scale-aligning procedure outlined above.

puyiwen commented 2 years ago

你好!好问题!

使用Midas loss训练深度模型。网络被训练来预测需要每个图像比例+移位参数的“标准化”深度图像。就像 MiDaS 论文一样,我们在训练和评估期间提供了最好的 scale+shift 参数。你也可以这样做,或者你可以训练一个网络来估计 ScanNet 的 scale+shift。我们还没有尝试过使用 Omnidata + ScanNet 的任何东西——我很好奇它是如何工作的!

顺便说一下,这是我们对Midas loss的实现。已经2年了,原论文还没有公布他们的训练代码。我们很想听听我们的实施如何为您服务,以及您是否最终进行了任何重要的调整。

最后,我不认为您现在正在使用它,但如果您最终使用 Omnidata 进行训练,图像将存储为 16 位单通道图像,整数值 2**16 对应于深度128m。所以转换为depth_metric=depth_png/512.0。但这仅适用于 Omnidata 数据集——预训练模型使用上述比例对齐程序。

Hi,I want to use the taskonomy dataset to train depth estimation, I dont know what the max depth for taskonomy.I also want to know what is your depth data preprocessing?Thank you very much!

alexsax commented 2 years ago

The max depth is 128m, and the preprocessing is depth_metric=depth_png/512.0.

puyiwen commented 2 years ago

Hi! Great question!

The depth models are trained using Midas loss. The networks are trained to predict a "normalized" depth image that needs per-image scale + shift parameters. Just like the MiDaS papers, we supply the best-possible scale+shift parameters during training and evaluation. You could do the same, or you could train a network to estimate the scale+shift for ScanNet. We haven't tried anything with Omnidata + ScanNet yet--I'd be curious how this works out!

By the way, here's our implementation of the Midas loss. It's been 2 years, and the original paper hasn't released their training code. We'd love to hear about how our implementation worked for you, and if you end up making any important tweaks.

Lastly, I don't think you're using this right now, but if you end up using Omnidata for training, the images are stored as 16-bit single-channel images and a integer value of 2**16 corresponds to a depth of 128m. So the conversion is depth_metric=depth_png/512.0. But this is only for the Omnidata dataset--the pretrained models use the scale-aligning procedure outlined above.

Sorry to bother you again, I also run the demo.py with some indoor scene pictures, and appeared the same questions, the depth value is between [0-1]. I still don't know how to recover to the true depth value? Can you tell me more detail? Thank you very much!!

LiXinghui-666 commented 2 years ago

Hi! Great question! The depth models are trained using Midas loss. The networks are trained to predict a "normalized" depth image that needs per-image scale + shift parameters. Just like the MiDaS papers, we supply the best-possible scale+shift parameters during training and evaluation. You could do the same, or you could train a network to estimate the scale+shift for ScanNet. We haven't tried anything with Omnidata + ScanNet yet--I'd be curious how this works out! By the way, here's our implementation of the Midas loss. It's been 2 years, and the original paper hasn't released their training code. We'd love to hear about how our implementation worked for you, and if you end up making any important tweaks. Lastly, I don't think you're using this right now, but if you end up using Omnidata for training, the images are stored as 16-bit single-channel images and a integer value of 2**16 corresponds to a depth of 128m. So the conversion is depth_metric=depth_png/512.0. But this is only for the Omnidata dataset--the pretrained models use the scale-aligning procedure outlined above.

Sorry to bother you again, I also run the demo.py with some indoor scene pictures, and appeared the same questions, the depth value is between [0-1]. I still don't know how to recover to the true depth value? Can you tell me more detail? Thank you very much!!

Hello. I want to know when you run this model on Scannet, the image resolution of the Scannet data set is 640 x 480, which is different from the aspect ratio of the output resolution of ominidata. How do you handle the resolution problem so that the depth map and the RGB image are pixel-by-pixel corresponding?

Twilight89 commented 1 year ago

The max depth is 128m, and the preprocessing is depth_metric=depth_png/512.0.

Hi, but I see the preprocess is 'depth/(2**16-1)' in here

puyiwen commented 1 year ago

The max depth is 128m, and the preprocessing is depth_metric=depth_png/512.0.

Sorry to bother you, I want to know RGB camera intrinsic matrix of taskonomy. Can you help me? Thank you very much!!