apple / ml-depth-pro

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second.
Other
3.68k stars 247 forks source link

How to predict absolute depth? #31

Open iheanu opened 1 month ago

iheanu commented 1 month ago

Can someone provide detailed steps?

Clod98 commented 1 month ago

Hi, From what I understand the code already gives you the absolute depth, as it is said the prediction is in meters:

# Run inference.
prediction = model.infer(image, f_px=f_px)
depth = prediction["depth"]  # Depth in [m].

In particular, in ml-depth-pro/src/depth-pro/depth-pro.py, in the infer function one can see that the horizontal focal length $f_{px}$ is used to compute the absolute depth map $D$ from the canonical inverse depth map $C$ and the width $w$ using the formula

$$ D = \frac{f_{px}}{wC} $$

        inverse_depth = canonical_inverse_depth * (W / f_px)
        f_px = f_px.squeeze()

        if resize:
            inverse_depth = nn.functional.interpolate(
                inverse_depth, size=(H, W), mode=interpolation_mode, align_corners=False
            )

        depth = 1.0 / torch.clamp(inverse_depth, min=1e-4, max=1e4)

        return {
            "depth": depth.squeeze(),
            "focallength_px": f_px,
xiaodongww commented 1 month ago

Hi, From what I understand the code already gives you the absolute depth, as it is said the prediction is in meters:

# Run inference.
prediction = model.infer(image, f_px=f_px)
depth = prediction["depth"]  # Depth in [m].

In particular, in ml-depth-pro/src/depth-pro/depth-pro.py, in the infer function one can see that the horizontal focal length f p x is used to compute the absolute depth map D from the canonical inverse depth map C and the width w using the formula

D = f p x w C

        inverse_depth = canonical_inverse_depth * (W / f_px)
        f_px = f_px.squeeze()

        if resize:
            inverse_depth = nn.functional.interpolate(
                inverse_depth, size=(H, W), mode=interpolation_mode, align_corners=False
            )

        depth = 1.0 / torch.clamp(inverse_depth, min=1e-4, max=1e4)

        return {
            "depth": depth.squeeze(),
            "focallength_px": f_px,

@Clod98 What if I have the total intrinsics. I mean (fx, fy, cx, cy). The model for now only support one parameter fx. Am I correct?

iheanu commented 1 month ago

Is the absolute depth extracted directly from the predicted depth map?

kang-1-2-3 commented 1 month ago

I am confused about why (W / f_px) can be used to transform canonical inverse depth to absolute inverse depth. Is there any further mathematical prove?

KyuhoBae commented 1 month ago

From the paper https://arxiv.org/pdf/2307.10984 which the ml-depth-pro's authors mentioned at paper, they scale the ground-truth depth $D$ with the ratio $w_d=\frac{f^c}{f}$ where $f^c$ is the canonical focal length (which they set as 1000) and $f$ is the real focal length. So the relationship between canonical depth and ground-truth depth is

D_c=\frac{f^c}{f}D

Here $f{px}$ is the focal length of the image which can be estimated or given by user, and I think that the authors set canonical focal length as the width of the image (and they resize the image size if it is not a network resolution to 1536x1536). So the $W/f{px}$ is just simple scale factor that 'normalize' the depth to solve the ambiguity of depth due to the different focal length. And here we are dealing with inverse canonical depth and inverse depth so the fomula would be

D_{inv}=\frac{W}{f_{px}}D_{c_{inv}}

Please let me know if I'm understanding wrong :)

kang-1-2-3 commented 1 month ago

That makes sense. So W is actually 1536 and $f_{px}$ is estimated focal length, corresponding to the number of pixels in x direction. Am I right?

KyuhoBae commented 1 month ago

Yes as far as I understand.

Clod98 commented 1 month ago

That makes sense. So W is actually 1536 and f p x is estimated focal length, corresponding to the number of pixels in x direction. Am I right?

Exactly, the network has two "heads", one for predicting the canonical inverse depth $C$, and another to predict the focal length $f_{px}$, so it's able to predict absolute depth "in-the-wild" (i.e. images coming from cameras with unknown intrinsics). If you have the focal length already though, you can just plug it in the infer function and that would be used in the conversion from inverse depth to absolute depth.