DepthAnything / Depth-Anything-V2

[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
https://depth-anything-v2.github.io
Apache License 2.0
3.95k stars 341 forks source link

Metric depth, focal depth estimation output/input #152

Open calledit opened 2 months ago

calledit commented 2 months ago

So the metric depth version obviously outputs stuff in meters. To be able do that the model must somehow estimate the focal depth.

The fact that this estimation is not part of the output/inputs causes some issues. As shown by the following senario:

I know what focal depth the images I send to the model have. The Model does not, so it estimates a focal depth and gives a depth map. When I get the depth map I project the depth map in to a point cloud. At this point things are sometimes scaled wrong cause the model estimated the wrong focal depth. If I knew what focal depth the model estimated it could partially correct for it in hindsight, but since I don't know what it estimated I have to estimate what it estimated.

And to be clear most of the time the model estimates the focal depth correctly. But sometimes it is off by 10-20%.

I am not sure if there is a question here. Maybe if I could somehow find the focal depth hidden in the model? (unlikely I suppose) If there is no current solution then my hope would be that: The next model gets focal depths as input and output (actually xFOV and yFOV in degress would be better, as that does not depend on image height and width which causes issues when you use the model on downscaled images).

LiheYoung commented 2 months ago

Hi @calledit, thank you for your valuable advice! We’ll consider incorporating focal information into the model. For the current metric depth models, we train the indoor version using the Hypersim dataset and the outdoor version using the Virtual KITTI 2 dataset. You might use the focal information from these datasets and your test images (e.g., their ratio) to appropriately scale the predicted depth in your images.

calledit commented 2 months ago

Thanks for your reponse.

After some investigation it seams like Virtual KITTI 2 is all based on the exact same camera intrinsics (ie):

Focal depth X = 725.0087, focal depth y = 725.0087, W/2 = 620.5, H/2 = 187
Which corresponds to:
Field of View (degrees):
  fov_x = 81.117°
  fov_y = 28.926°

It is acctually suprissing that the model is as good as it is at determining the correct depth given the fact that it has never seen any variation of the fov under its training. I guess it might be explaned by the fact* that the model has seen variation in fov before it was finetunned for metric output.

*I dont know if that is a fact.

calledit commented 1 month ago

I have now got some more insights on this issue after testing the depth-pro model.

The depth-pro model has the samme issue, except it actually has FOV as an output. So one can use correction. But using the depth-pro model and correcting for incorrect FOV as reported by the model made me realise how bad correction works to correct for FOV in hindsight.

So there are two ways of correcting for FOV issues:

  1. Correcting for FOV to get the general distances to objects correct. You multiply the depth map with a correction factor. The issue with that is that you compress objects that are already the correct scale(like humans or other objects that the model used to guess its FOV).
  2. The second way of correcting, is applying trigonometric functions to pixels far to the left, right, top and bottom to fix rotation issues. While this fixes the rotation it also introduces allot of artefacts. Like curving flat walls and things like that.

So essentially what I am saying is that FOV output is not enough for a model that should be able to recreate things accurately. Such a model needs to take FOV as input. (Unless it can estimate FOV perfectly, but that is impossible in some cases)

Was hoping to get some response from the depth-pro people. Not sure if they understod the issue. https://github.com/apple/ml-depth-pro/issues/21

Thanks for your time!

I3aer commented 1 month ago

So the metric depth version obviously outputs stuff in meters. To be able do that the model must somehow estimate the focal depth.

The fact that this estimation is not part of the output/inputs causes some issues. As shown by the following senario:

I know what focal depth the images I send to the model have. The Model does not, so it estimates a focal depth and gives a depth map. When I get the depth map I project the depth map in to a point cloud. At this point things are sometimes scaled wrong cause the model estimated the wrong focal depth. If I knew what focal depth the model estimated it could partially correct for it in hindsight, but since I don't know what it estimated I have to estimate what it estimated.

And to be clear most of the time the model estimates the focal depth correctly. But sometimes it is off by 10-20%.

I am not sure if there is a question here. Maybe if I could somehow find the focal depth hidden in the model? (unlikely I suppose) If there is no current solution then my hope would be that: The next model gets focal depths as input and output (actually xFOV and yFOV in degress would be better, as that does not depend on image height and width which causes issues when you use the model on downscaled images).

what is the focal depth? I could not find anything about focal depth? Are you talking about depth of focus?