apple / ml-depth-pro

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second.
Other
3.59k stars 236 forks source link

Field of view correction - model need to take FOV as input #21

Closed calledit closed 1 month ago

calledit commented 1 month ago

If you send the field of view to the infer function it does not get passed in to the model, the only thing that is done with the field of view is that it is multiplied with the "canonical depth map": https://github.com/apple/ml-depth-pro/blob/b2cd0d51daa95e49277a9f642f7fd736b7f9e91d/src/depth_pro/depth_pro.py#L285

This causes issues as the model uses its own field of view to generate the depth map and not the true one.

I can give an example of this going wrong: let's say I have an image with a person standing 2 meters from the camera and a wall 3 meters behind that person. The model "believes" this was captured with a 40 deg fov, but it was actually captured with a 65 deg fov.

  1. If I multiply the depth map with the true fx value (that which corresponds to 65 deg) the wall and the person will be the correct distance from each other. But the human will now have a depth of 5 cm.
  2. If I multiply the depth map with the models estimated fx value (that which corresponds to 40 deg) the humans will be the correct depth but will be standing to far away from the wall.

There is not really a question here. I guess it's an argument for the fact that the model really needs to take FOV as input.

Amael commented 1 month ago
calledit commented 1 month ago

That is what the code does, indeed.

The issue that I am describing is deeper than that. The issue is that canonical_depth will be wrong due to the fact that the model uses its own "guess" about the FOV to generate canonical_depth.

I think my example described the issue quite well, but I guess it was not detailed enough. The issue stems from the fact that it is mathematically impossible to generate a correct canonical_depth without a correct FOV. If the model "estimates" a wrong FOV (the FOV will be a feature in its own latent space) the resulting canonical_depth will be incorrect.

From my example the model knows the scale of a human so the model makes sure that the human will have the correct depth at the FOV which the model believes the image is taken at. A different area where the issue is quite visible(which might be a better example as it is a simpler phenomena) is with rotation of objects that are on the sides of the image. If the FOV of the image is wide but the model thinks the FOV is narrow the model will rotate the objects incorrectly. The objects are facing the camera so the model assumes that the object should be rotated mostly forward in the forward optical axis, but as the FOV is wider in reality the objects are not facing strictly forwards in the forward optical axis they are facing the camera from the side but the model can't know this.

TLDR; it is impossible to generate a correct canonical_depth without a correct FOV, so it would be good if a updated version of this model or the next version could take FOV as input.

Vincent630 commented 3 weeks ago

I am very interested in this problem and have done some preliminary research. From the code, it appears that the model predicts the FOV and then calculates the focal length in pixels (f_px), so the accuracy of the predicted FOV should affect the overall accuracy of depth estimation.

I tested Depth Pro on continuous frames and found that using the automatically predicted f_px results in significant depth errors. However, even when I set a fixed f_px (since I know the fixed depth values in certain regions across continuous frames), I still observe large depth prediction errors in these known regions, with discrepancies ranging from 10 to 50 cm.

Do you think this kind of error is normal, or is it that the current state of technology in monocular depth estimation is still far from achieving stable depth consistency across continuous frames? I would appreciate your insights on this matter. Thank you.

Let me know if you'd like to make any adjustments!

calledit commented 3 weeks ago

Giving the model the true FOV would most likely generate a more stable video, however my guess is that the resulting depth video would still not be stable enough for most use cases in which you want a depth video.

If you wanted stable depth video the easiest way would probably be to add some type of output from a previous frame(or a few frames) as input to the model. Essentially giving the model memory to remember relative depths between things in the image.

JUGGHM commented 3 weeks ago

According to the metric3d formulation (which is also applied in DepthPro), the FOV is de-facto "fixed“ from the model's perspective as it takes a fixed-size image (1536*1536) as inputs with a fixed canonical camera focal length. The real-world FOV should be recovered according to the focal length ratio.

calledit commented 3 weeks ago

Not exactly sure what you are talking about. Seams like you are referring to something relating to the ratio between xfov and yfov.

This thread is about an issue with how the model estimates depth that is caused by the fact that it does not actually know what FOV the the images is in.

The model internally estimates a FOV(we know this cause it is mathematically impossible to generate accurate depth map unless this is done). The model uses that internal estimation to calculate depth.

One of the issues that this causes can be described like this: Red is the true xFOV (something like 80 deg in this example image), green is the true position and rotation of a box as seen from above. Blue is the xFOV that the model estimated and internally used to generate the depth map(something like 40 deg in this example image). Pink is the box position and rotation as the model calculated based in the models estimated xFOV

Yellow is what happens when we scale the xFOV from the models output in to the true xFOV. It makes the square box in to a parallelogram.

FOV_estimation_Explanation

This issue be partially fixed by scaling things using trigonometric functions. But using trigonometric functions to correct the depth introduces artefacts like flat walls becoming concave or convex. The issue essentially can't be properly fixed by post processing the output of the model.

JUGGHM commented 2 weeks ago

Not exactly sure what you are talking about. Seams like you are referring to something relating to the ratio between xfov and yfov.

This thread is about an issue with how the model estimates depth that is caused by the fact that it does not actually know what FOV the the images is in.

The model internally estimates a FOV(we know this cause it is mathematically impossible to generate accurate depth map unless this is done). The model uses that internal estimation to calculate depth.

One of the issues that this causes can be described like this: Red is the true xFOV (something like 80 deg in this example image), green is the true position and rotation of a box as seen from above. Blue is the xFOV that the model estimated and internally used to generate the depth map(something like 40 deg in this example image). Pink is the box position and rotation as the model calculated based in the models estimated xFOV

Yellow is what happens when we scale the xFOV from the models output in to the true xFOV. It makes the square box in to a parallelogram.

FOV_estimation_Explanation

This issue be partially fixed by scaling things using trigonometric functions. But using trigonometric functions to correct the depth introduces artefacts like flat walls becoming concave or convex. The issue essentially can't be properly fixed by post processing the output of the model.

Hi Calledit, thank you for your example and detailed illustration for your opinion. But there is still one thing I must discuss here without which I cannot sleep well tonight.

In the canonical transformation modelling, the Z-axis is totally squeezed or stretched, which means that the green box and pink box will never be simultaneously square. Besides, their X and Y axis should remain the same. This can be directly derived by the equation Z/X = focal / delta_u. Here X and delta_u remains unchanged, while Z and the focal changes proportionally.

Actually, in your case, the pink box should be much farther than the green one, and it should be a parallelogram. In the meantime, both of them should lie on the same vertical line.

The improper thing in your depict is that the absolute depth of all three boxes are almost the same, which is not the truth according to the equation above.

calledit commented 2 weeks ago

You mean like this: FOV_estimation_Explanation_v4

If the model "knows" the size of the box(ie if the box is a type of object that the model recognises and knows typical sizes of) it would place it further away; however there are scenarios where it would not do that, but I think you are right: placing the pink box further away is a better example for describing the issue talked about in this thread.

In the "canonical transformation" everything will always be stretched yes. (but to make sure we are on the same page I want to say that: nothing in my example pictures represent the depth map as if it was the "canonical transformation").

The canonical transformation is mostly just pointless to end users since a depth map will always be bound to a specific FOV.

JUGGHM commented 2 weeks ago

You mean like this: FOV_estimation_Explanation_v4

If the model "knows" the size of the box(ie if the box is a type of object that the model recognises and knows typical sizes of) it would place it further away; however there are scenarios where it would not do that, but I think you are right: placing the pink box further away is a better example for describing the issue talked about in this thread.

In the "canonical transformation" everything will always be stretched yes. (but to make sure we are on the same page I want to say that: nothing in my example pictures represent the depth map as if it was the "canonical transformation").

The canonical transformation is mostly just pointless to end users since a depth map will always be bound to a specific FOV.

This figure is closer but not exactly correct yet. I will show one for better understanding some time later.

JUGGHM commented 2 weeks ago

IMG_1942

JUGGHM commented 2 weeks ago

IMG_1942

Since it is always the Z-axis that is stretched, the object belonging to different FOVs (or focals) should lie at the exact same X position when back projecting the depth (with whatever focal length) to the 3D world.

calledit commented 2 weeks ago

Right you fixed what I thought you might fix. The fact that I did not extend the foV triangle enogh.

While that is a good correction, I think your image is missing the most important point (the entire point that I was trying to talk about.) Which is the rotation and the depth of the box in reality.

If the box dimensions is known to the model, let's say it is a postal box with known dimensions say 0.5 x 0.5 x 1.2m. The model would make sure that the box has the right depth in the network predicted FOV.

For that image to illustrate the point of this "issue", you would need to add another thing, which is the real rotation and scale of the box. Something like this: (yellow box would be real box position and scale): corrected

JUGGHM commented 2 weeks ago

IMG_1943

JUGGHM commented 2 weeks ago

IMG_1943

Thank you for your reply and I recap my points here:

calledit commented 2 weeks ago
  • If not, it means that the depth itself is not correctly predicted (even under the wrong FOV perspective).

Now you are starting to touch on the issue at hand. It is that the depth will be incorrectly predicted if the model thinks the FOV of the picture is different from the pictures actual FOV. I was never talking about how 3d projection from depth maps works. I was talking about how the model estimates the death map. Then describing a scenario where the model will generate the wrong depth map. Ie the senario when the assumes an incorrect FOV, in such a senario the model will always generate incorrect depth maps.

JUGGHM commented 2 weeks ago
  • If not, it means that the depth itself is not correctly predicted (even under the wrong FOV perspective).

Now you are starting to touch on the issue at hand. It is that the depth will be incorrectly predicted if the model thinks the FOV of the picture is different from the pictures actual FOV. I was never talking about how 3d projection from depth maps works. I was talking about how the model estimates the death map. Then describing a scenario where the model will generate the wrong depth map. Ie the senario when the assumes an incorrect FOV, in such a senario the model will always generate incorrect depth maps.

Of course a wrong FOV will lead to a wrong final depth (D_f). That's for sure. But actually the variable (directly predicted by output layers) from the model is composed of two parts:

Either part should be regarded quite independently in this system, and both of them contribute to the final depth. In reality, you will never get perfect (D_c) or (F) .

Under an ideal situation where you can get correct (F) , but the network gives you a wrong (F) , the the final depth (D_f) is wrong. However, if we can somehow get well calibrated (F), the final depth will be correct.

In other words, the prediction of (D_c) is independent of the real-world FOV. We can always use an ideal FOV/focal to transform a correct (D_c) back into the real-world without distortion.

calledit commented 2 weeks ago

Of course a wrong FOV will lead to a wrong final depth (D_f). That's for sure. But actually the variable (directly predicted by output layers) from the model is composed of two parts Either part should be regarded quite independently in this system, and both of them contribute to the final depth. In reality, you will never get perfect (D_c) or (F) .

The model can't generate a correct (D_c) from an image without a correct FOV, this is mathematical fact. Machine learning models might seam magic but they are not, if something is mathematically impossible then it is impossible, it does not mater if you try to do it with AI or humans or whatever method you use; it will not happen. So while (D_c) and (F) are independent is some ways, (and the code of depth-pro assumes they are independent) they are definitely not independent.

Under an ideal situation where you can get correct (F) , but the network gives you a wrong (F) , the the final depth (D_f) is wrong. However, if we can somehow get well calibrated (F), the final depth will be correct.

We don't know what FOV the model actually internally uses to estimate (D_c) but I think it is safe to assume that it is more or less the same as the FOV as the model outputs as (F). From this we can know that if the model gives you an incorrect (F) the intermediate (D_c) and therefor the final (D_f) will be wrong as an incorrect (F) means the model used an an incorrect FOV to generate (D_c) and as previously stated without a correct FOV it is impossible to generate a correct (D_c).

JUGGHM commented 2 weeks ago

impossible

About the independence of the predicted variables, to some extent, I agree that they are more or less related, because the FOV network applies features from the encoders. However, the overall modelling of predicting both D_f and FOV regard them as different objectives.

But what is the mathematical fact about "The model can't generate a correct D_c from an image without a correct FOV"? If this refers to manually generating a canonical depth groundtruth label (D_c)_gt, I agree with you because we must known the real focal first. However, if you are refering the network prediction, I choose to reserve my opinion. Neural network can never give out perfect outputs. I can neither proof nor disproof your statement. Exposing the focal(FOV) to the network or use the canonical transformation are just two different ways to recover real-world metric depth. I cannot tell which one is better.

Besides, the canonical transformation method strictly depends on pinhole camera perspective projection, which is a solid mathematical basis. I insist that there is no significant defect for such a modelling.

Anyway, this is a fruitful technical discussion. Wish you have a good time.

calledit commented 2 weeks ago

But what is the mathematical fact about "The model can't generate a correct D_c from an image without a correct FOV"? If this refers to manually generating a canonical depth groundtruth label (D_c)_gt, I agree with you because we must known the real focal first. However, if you are refering the network prediction, I choose to reserve my opinion. Exposing the focal(FOV) to the network or use the canonical transformation are just two different ways to recover real-world metric depth. I cannot tell which one is better.

I can give you an example which shows that you need FOV to generate depth: NextDocument Senario 1 and senario 2 has the exact same image but different (D_c) as the rotation of the boxes on the sides(box 1 & 2) will require two different (D_c) to encode. The images are as I said identical and it is therefor impossible to figure out if you are looking at senario 1 or 2 and you can therefor not create the required (D_c) by just looking at the image. TLDR: If you don't know the FOV you can't create a correct (D_c) from an image as you need the FOV to know how objects on the sides of the image are rotated.

Besides, the canonical transformation method strictly depends on pinhole camera perspective projection, which is a solid mathematical basis. I insist that there is no significant defect for such a modelling.

Right we agree here.

Anyway, this is a fruitful technical discussion. Wish you have a good time.

Yeah, we are good.

JUGGHM commented 2 weeks ago

IMG_1942

I will use this figure as a counter example, and that...

JUGGHM commented 2 weeks ago

Right you fixed what I thought you might fix. The fact that I did not extend the foV triangle enogh.

While that is a good correction, I think your image is missing the most important point (the entire point that I was trying to talk about.) Which is the rotation and the depth of the box in reality.

If the box dimensions is known to the model, let's say it is a postal box with known dimensions say 0.5 x 0.5 x 1.2m. The model would make sure that the box has the right depth in the network predicted FOV.

* If I project the depth map from the model in the **fov that the network thinks the image is taken in** the box would be 0.5 x 0.5 x 1.2m.

* If I instead projected the depth map at the **real FOV** the box would instead be for example 0.2 x 0.5 x 1.2m which is wrong. (ie. the position of the box would be more accurate, but its scale and rotation would not be)

For that image to illustrate the point of this "issue", you would need to add another thing, which is the real rotation and scale of the box. Something like this: (yellow box would be real box position and scale): corrected

For the case you used here to oppose my example, I would say that it is possible that different z-distance box will look the same but it is not the yellow one. The example will be presented below.

This is because one plane of the box is not a sufficient clue to determine its exact position. Not the fault of unknown FOV. FOV in the following figure is the same.

calledit commented 2 weeks ago

You side stepped my point.

Let clarify: each box is exactly 1x1x1m has a 1x1m QR code on it and you know the size of the QR code and the box.

JUGGHM commented 2 weeks ago

IMG_1947

JUGGHM commented 2 weeks ago

You side stepped my point.

Let clarify: each box is exactly 1x1x1m has a 1x1m QR code on it and you know the size of the QR code and the box.

For the box in the middle, it is trivial and that if we can observe one plane only, the depth and the focal are proportional, without the risk of distortion. This indicate that the canonical depth is unique. Next we need to consider the side boxes.

JUGGHM commented 2 weeks ago

IMG_1948

JUGGHM commented 2 weeks ago

Ok this problem is inherently ambiguous but irrelevant to scale ambiguity. I found another case with the same image, same FoV boxes with different depth and rotation. That's not the fault of canonical transformation.

JUGGHM commented 2 weeks ago

IMG_1949

calledit commented 2 weeks ago

Seems like you are drawing something different. All boxes should always be facing the camera perfectly. That's what the QR codes are for. Ie you know the boxes are rotated with one face perfectly towards the camera.

JUGGHM commented 2 weeks ago

Seems like you are drawing something different. All boxes should always be facing the camera perfectly. That's what the QR codes are for. Ie you know the boxes are rotated with one face perfectly towards the camera.

I think you are right. I have some derivations and proved there exists some sort of rotated placement that caused ambiguity, under different FOVs. Your geometrical instinct is better than mine.

JUGGHM commented 2 weeks ago

IMG_1952

JUGGHM commented 2 weeks ago

IMG_1952

In a special situation when the equation of both sides equal to 1, the boxes are parallel to the image plane, and the canonical depth is unique.