Open iheanu opened 1 month ago
Hi, From what I understand the code already gives you the absolute depth, as it is said the prediction is in meters:
# Run inference.
prediction = model.infer(image, f_px=f_px)
depth = prediction["depth"] # Depth in [m].
In particular, in ml-depth-pro/src/depth-pro/depth-pro.py, in the infer function one can see that the horizontal focal length $f_{px}$ is used to compute the absolute depth map $D$ from the canonical inverse depth map $C$ and the width $w$ using the formula
$$ D = \frac{f_{px}}{wC} $$
inverse_depth = canonical_inverse_depth * (W / f_px)
f_px = f_px.squeeze()
if resize:
inverse_depth = nn.functional.interpolate(
inverse_depth, size=(H, W), mode=interpolation_mode, align_corners=False
)
depth = 1.0 / torch.clamp(inverse_depth, min=1e-4, max=1e4)
return {
"depth": depth.squeeze(),
"focallength_px": f_px,
Hi, From what I understand the code already gives you the absolute depth, as it is said the prediction is in meters:
# Run inference. prediction = model.infer(image, f_px=f_px) depth = prediction["depth"] # Depth in [m].
In particular, in ml-depth-pro/src/depth-pro/depth-pro.py, in the infer function one can see that the horizontal focal length f p x is used to compute the absolute depth map D from the canonical inverse depth map C and the width w using the formula
D = f p x w C
inverse_depth = canonical_inverse_depth * (W / f_px) f_px = f_px.squeeze() if resize: inverse_depth = nn.functional.interpolate( inverse_depth, size=(H, W), mode=interpolation_mode, align_corners=False ) depth = 1.0 / torch.clamp(inverse_depth, min=1e-4, max=1e4) return { "depth": depth.squeeze(), "focallength_px": f_px,
@Clod98 What if I have the total intrinsics. I mean (fx, fy, cx, cy). The model for now only support one parameter fx. Am I correct?
Is the absolute depth extracted directly from the predicted depth map?
I am confused about why (W / f_px) can be used to transform canonical inverse depth to absolute inverse depth. Is there any further mathematical prove?
From the paper https://arxiv.org/pdf/2307.10984 which the ml-depth-pro's authors mentioned at paper, they scale the ground-truth depth $D$ with the ratio $w_d=\frac{f^c}{f}$ where $f^c$ is the canonical focal length (which they set as 1000) and $f$ is the real focal length. So the relationship between canonical depth and ground-truth depth is
D_c=\frac{f^c}{f}D
Here $f{px}$ is the focal length of the image which can be estimated or given by user, and I think that the authors set canonical focal length as the width of the image (and they resize the image size if it is not a network resolution to 1536x1536). So the $W/f{px}$ is just simple scale factor that 'normalize' the depth to solve the ambiguity of depth due to the different focal length. And here we are dealing with inverse canonical depth and inverse depth so the fomula would be
D_{inv}=\frac{W}{f_{px}}D_{c_{inv}}
Please let me know if I'm understanding wrong :)
That makes sense. So W is actually 1536 and $f_{px}$ is estimated focal length, corresponding to the number of pixels in x direction. Am I right?
Yes as far as I understand.
That makes sense. So W is actually 1536 and f p x is estimated focal length, corresponding to the number of pixels in x direction. Am I right?
Exactly, the network has two "heads", one for predicting the canonical inverse depth $C$, and another to predict the focal length $f_{px}$, so it's able to predict absolute depth "in-the-wild" (i.e. images coming from cameras with unknown intrinsics). If you have the focal length already though, you can just plug it in the infer function and that would be used in the conversion from inverse depth to absolute depth.
Can someone provide detailed steps?