LiheYoung / Depth-Anything

[CVPR 2024] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation
https://depth-anything.github.io
Apache License 2.0
6.36k stars 484 forks source link

true metric depth values #36

Open abhishekmonogram opened 5 months ago

abhishekmonogram commented 5 months ago

Hi @LiheYoung ,

This is super impressive work. I used the huggingface deployment to test out the network. I gave it a sample image from a camera with known camera intrinsics and it output a depth map(consider it as disparity as it says on huggingface). I see per pixel values of the depth/disparity map but I do not know how to go about extracting per pixel true metric depth from these. Are the depth maps relative or are they true metric? If they are true metric, then how can I go about extracting per pixel metric depth?

Abubakar17 commented 5 months ago

Hi @LiheYoung, I also have the same query, how do i get metric depth information from the disparity maps? The information under metric_depth is only evaluation results and not the true depth values. I would really appreciate if you could share your insights on this.

BTW you guys did a really amazing job here.

loevlie commented 5 months ago

I believe using the "pred" model output from the evaluate.py script https://github.com/LiheYoung/Depth-Anything/blob/5935968f82018d68fff44946573d34cdf27db827/metric_depth/evaluate.py#L80 (assuming you assign the correct focal length in the line above the model output) and using this https://github.com/LiheYoung/Depth-Anything/blob/main/metric_depth/zoedepth/utils/geometry.py should be all you need.

Since based on the ZoeDepth training pipeline the model output is metric depth in units of meters.

LiheYoung commented 5 months ago

Hi @abhishekmonogram and @Abubakar17, the demo on the huggingface only outputs the relative depth (disparity), rather than the metric depth. As @loevlie mentioned, if you hope to obtain metric depth values, please refer to: https://github.com/LiheYoung/Depth-Anything/tree/main/metric_depth. Also, you may refer to the files @loevlie mentioned.

abhishekmonogram commented 5 months ago

Thank you @loevlie for providing with those resources. The evaluate function is still to evaluate only on a custom dataset like NYU right. Do you know if there is script that directly does inference on any custom image?

@LiheYoung If huggingface outputs only disparity, how do I get the depth map from it? Because to get the depth, you also need the baseline, which is missing in case of monocular cameras.

Also could you comment on the accuracy of the per pixel true metric depth when you fine tuned on your own dataset. I read through the table in the paper, but it found it a little bit confusing to interpret those metrics.

loevlie commented 5 months ago

Hey @abhishekmonogram, I got the evaluate function to work on my own dataset by mainly following this article. I might write a script to run inference on a custom image, if I do I will share it.

1ssb commented 5 months ago
Screenshot 2024-01-26 at 10 21 10 PM

Pretty good on visualisation. @LiheYoung is it possible to confirm that the depths say 4.54 is in metres and there are no additional scales at play?

shizurumaya commented 5 months ago

I believe it would be extremely helpful to have a function that accepts an image path and a focal length (with the current default value) as inputs, and then generates a depth map with metric values. It can be quite challenging for someone who isn't deeply involved in this specific field to create such a method.

1ssb commented 5 months ago

I will update with my code right here, just waiting for the confirmation of the author on the correctness of scale.

On Fri, 26 Jan, 2024, 11:25 pm shizurumaya, @.***> wrote:

I believe it would be extremely helpful to have a function that accepts an image path and a focal length (with the current default value) as inputs, and then generates a depth map with metric values. It can be quite challenging for someone who isn't deeply involved in this specific field to create such a method.

— Reply to this email directly, view it on GitHub https://github.com/LiheYoung/Depth-Anything/issues/36#issuecomment-1912458399, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFEE4BVGHIQYH6P3GYC3YQPUZNAVCNFSM6AAAAABCK3CZBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGQ2TQMZZHE . You are receiving this because you commented.Message ID: @.***>

LiheYoung commented 5 months ago

Hi @1ssb , our Depth Anything models primarily focus on relative depth estimation. Thus, the output value from the HuggingFace published models does not represent any metric meanings. However, if you want to obtain metric depth information (in meters), you can use our models introduced here: https://github.com/LiheYoung/Depth-Anything/tree/main/metric_depth, just like @loevlie mentioned.

1ssb commented 5 months ago

Hi @LiheYoung I am indeed using the metric depth and the point cloud I have uploaded is indeed from the zoedepth. Can you kindly confirm that if these values of depth are for example 4.35 metres etc, they are indeed in metres wothout any need for further analysis/transformation?

On Sat, 27 Jan, 2024, 10:15 am Lihe Yang, @.***> wrote:

Hi @1ssb https://github.com/1ssb , our Depth Anything models primarily focus on relative depth estimation. Thus, the output value from the HuggingFace published models does not represent any metric meanings. However, if you want to obtain metric depth information (in meters), you can use our models introduced here: https://github.com/LiheYoung/Depth-Anything/tree/main/metric_depth, just like @loevlie https://github.com/loevlie mentioned.

— Reply to this email directly, view it on GitHub https://github.com/LiheYoung/Depth-Anything/issues/36#issuecomment-1912987120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFEDBCBTCPDR2TMZLWYDYQSA7ZAVCNFSM6AAAAABCK3CZBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSHE4DOMJSGA . You are receiving this because you were mentioned.Message ID: @.***>

LiheYoung commented 5 months ago

Yes, they are indeed in meters.

1ssb commented 5 months ago

Ok here is my code, let me know if you find any glitches, @LiheYoung you can integrate this file as a commit with changes if you would like so.

Edit: Revised, updated and simplified code which can handle any output size.

# infer.py
# Code by @1ssb
import argparse
import os
import glob
import torch
import numpy as np
from PIL import Image
import torchvision.transforms as transforms
import open3d as o3d
from tqdm import tqdm
from zoedepth.models.builder import build_model
from zoedepth.utils.config import get_config

# Global settings
FL = 715.0873
FY = 256 * 0.6
FX = 256 * 0.6
NYU_DATA = False
FINAL_HEIGHT = 256
FINAL_WIDTH = 256
P_x, P_y = 128, 128
INPUT_DIR = './my_test/input'
OUTPUT_DIR = './my_test/output'
DATASET = 'nyu' # Lets not pick a fight with the model's dataloader

def process_images(model):
    if not os.path.exists(OUTPUT_DIR):
        os.makedirs(OUTPUT_DIR)

    image_paths = glob.glob(os.path.join(INPUT_DIR, '*.png')) + glob.glob(os.path.join(INPUT_DIR, '*.jpg'))
    for image_path in tqdm(image_paths, desc="Processing Images"):
        try:
            color_image = Image.open(image_path).convert('RGB')
            original_width, original_height = color_image.size
            image_tensor = transforms.ToTensor()(color_image).unsqueeze(0).to('cuda' if torch.cuda.is_available() else 'cpu')

            pred = model(image_tensor, dataset=DATASET)
            if isinstance(pred, dict):
                pred = pred.get('metric_depth', pred.get('out'))
            elif isinstance(pred, (list, tuple)):
                pred = pred[-1]
            pred = pred.squeeze().detach().cpu().numpy()

            # Resize color image and depth to final size
            resized_color_image = color_image.resize((FINAL_WIDTH, FINAL_HEIGHT), Image.LANCZOS)
            resized_pred = Image.fromarray(pred).resize((FINAL_WIDTH, FINAL_HEIGHT), Image.NEAREST)

            focal_length_x, focal_length_y = (FX, FY) if not NYU_DATA else (FL, FL)
            x, y = np.meshgrid(np.arange(FINAL_WIDTH), np.arange(FINAL_HEIGHT))
            x = (x - P_x) / focal_length_x
            y = (y - P_y) / focal_length_y
            z = np.array(resized_pred)
            points = np.stack((np.multiply(x, z), np.multiply(y, z), z), axis=-1).reshape(-1, 3)
            colors = np.array(resized_color_image).reshape(-1, 3) / 255.0

            pcd = o3d.geometry.PointCloud()
            pcd.points = o3d.utility.Vector3dVector(points)
            pcd.colors = o3d.utility.Vector3dVector(colors)
            o3d.io.write_point_cloud(os.path.join(OUTPUT_DIR, os.path.splitext(os.path.basename(image_path))[0] + ".ply"), pcd)
        except Exception as e:
            print(f"Error processing {image_path}: {e}")

def main(model_name, pretrained_resource):
    config = get_config(model_name, "eval", DATASET)
    config.pretrained_resource = pretrained_resource
    model = build_model(config).to('cuda' if torch.cuda.is_available() else 'cpu')
    model.eval()
    process_images(model)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--model", type=str, default='zoedepth', help="Name of the model to test")
    parser.add_argument("-p", "--pretrained_resource", type=str, default='local::./checkpoints/depth_anything_metric_depth_indoor.pt', help="Pretrained resource to use for fetching weights.")

    args = parser.parse_args()
    main(args.model, args.pretrained_resource)
yyvhang commented 5 months ago

Ok here is my code, let me know if you find any glitches, @LiheYoung you can integrate this file asa commit with changes if you would like so.

# infer.py
# Code by @1ssb

import argparse
from tqdm import tqdm
import os, glob, torch
from PIL import Image
import torchvision.transforms as transforms
import numpy as np
import open3d as o3d
from zoedepth.models.builder import build_model
from zoedepth.utils.config import get_config

# Focal length settings
FL = 715.0873  # Default focal length, used if NYU_DATA is False
FY = 234.72  # Focal length in Y-axis
FX = 307.2   # Focal length in X-axis
NYU_DATA = False  # Flag to indicate if NYU data-specific settings are used

def infer(model, image, dataset):
    """
    Performs model inference on a single image.

    Args:
        model (torch.nn.Module): The depth estimation model.
        image (torch.Tensor): The input image tensor.
        dataset (str): The name of the dataset being used.

    Returns:
        torch.Tensor: Predicted depth map.
    """
    pred = model(image, dataset=dataset)
    return pred

def get_depth_from_prediction(pred):
    """
    Extracts the depth map from model prediction.

    Args:
        pred (torch.Tensor | list | tuple | dict): Model prediction.

    Returns:
        torch.Tensor: Extracted depth map.
    """
    if isinstance(pred, torch.Tensor):
        return pred
    elif isinstance(pred, (list, tuple)):
        return pred[-1]
    elif isinstance(pred, dict):
        return pred.get('metric_depth', pred.get('out'))
    else:
        raise TypeError(f"Unknown output type {type(pred)}")

def depth_to_point_cloud(depth, color_image):
    """
    Converts a depth map and a color image to a 3D point cloud.

    Args:
        depth (numpy.ndarray): The depth map.
        color_image (PIL.Image): The color image.

    Returns:
        tuple: Tuple containing points and colors for the point cloud.
    """
    height, width = depth.shape
    color_image = color_image.resize((width, height))
    focal_length_x, focal_length_y = (FL, FL) if NYU_DATA else (FX, FY)

    x, y = np.meshgrid(np.arange(width), np.arange(height))
    x = (x - width / 2) / focal_length_x
    y = (y - height / 2) / focal_length_y

    z = depth
    x = np.multiply(x, z)
    y = np.multiply(y, z)

    points = np.stack((x, y, z), axis=-1).reshape(-1, 3)
    colors = np.array(color_image).reshape(-1, 3) / 255.0

    return points, colors

def process_image(model, image_path, output_dir, dataset):
    """
    Processes a single image, performs depth estimation, and saves the resulting point cloud.

    Args:
        model (torch.nn.Module): The depth estimation model.
        image_path (str): Path to the image file.
        output_dir (str): Directory to save the point cloud.
        dataset (str): The name of the dataset being used.
    """
    color_image = Image.open(image_path).convert('RGB')
    image_tensor = transforms.ToTensor()(color_image).unsqueeze(0).to('cuda' if torch.cuda.is_available() else 'cpu')

    pred_dict = infer(model, image_tensor, dataset)
    pred = get_depth_from_prediction(pred_dict).squeeze().detach().cpu().numpy()

    points, colors = depth_to_point_cloud(pred, color_image)
    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(points)
    pcd.colors = o3d.utility.Vector3dVector(colors)

    min_depth, max_depth = np.min(pred[pred > 0]), np.max(pred)
    print(f"Processed {image_path}: Min Depth: {min_depth}, Max Depth: {max_depth}")

    output_filename = os.path.join(output_dir, os.path.splitext(os.path.basename(image_path))[0] + ".ply")
    o3d.io.write_point_cloud(output_filename, pcd)

def main(config, input_dir, output_dir, dataset):
    """
    Main function to process all images in a directory.

    Args:
        config (dict): Configuration for the model.
        input_dir (str): Directory containing input images.
        output_dir (str): Directory to save point clouds.
        dataset (str): The name of the dataset being used.
    """
    model = build_model(config).to('cuda' if torch.cuda.is_available() else 'cpu')
    model.eval()

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    image_paths = glob.glob(os.path.join(input_dir, '*.png')) + glob.glob(os.path.join(input_dir, '*.jpg'))
    if not image_paths:
        print("No images found in the input directory.")
        return

    for image_path in tqdm(image_paths, desc="Processing Images"):
        try:
            process_image(model, image_path, output_dir, dataset)
        except Exception as e:
            print(f"Error processing {image_path}: {e}")

def test_model(model_name, pretrained_resource, input_dir, output_dir, dataset):
    """
    Tests a model with given parameters.

    Args:
        model_name (str): The name of the model.
        pretrained_resource (str): Path to pretrained model weights.
        input_dir (str): Directory containing input images.
        output_dir (str): Directory to save point clouds.
        dataset (str): The name of the dataset being used.
    """
    config = get_config(model_name, "eval", dataset)
    if pretrained_resource:
        config.pretrained_resource = pretrained_resource
    main(config, input_dir, output_dir, dataset)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Depth estimation and point cloud generation script.")
    parser.add_argument("-m", "--model", type=str, default='zoedepth', help="Name of the model to test")
    parser.add_argument("-p", "--pretrained_resource", type=str, default='local::./checkpoints/depth_anything_metric_depth_indoor.pt', help="Pretrained resource to use for fetching weights.")
    parser.add_argument("-d", "--dataset", type=str, default='nyu', help="Dataset to evaluate on")
    parser.add_argument("-i", "--input_dir", type=str, default='./my_test/input', help="Input directory containing images")
    parser.add_argument("-o", "--output_dir", type=str, default='./my_test/output', help="Output directory for point clouds")
    args = parser.parse_args()

    test_model(args.model, args.pretrained_resource, args.input_dir, args.output_dir, args.dataset)

Hi @1ssb, thanks for the code. I found that whatever the original size of the image is, the output depth possesses the shape of (392, 518). Is there a method to obtain the depth information corresponding to the original image size? Does interpolation need to be performed here?

1ssb commented 5 months ago

Hi @yyvhang, please check the updated code.

yyvhang commented 5 months ago

Hi @yyvhang, please check the updated code.

Thanks!

DiTo97 commented 5 months ago

I believe using the "pred" model output from the evaluate.py script

https://github.com/LiheYoung/Depth-Anything/blob/5935968f82018d68fff44946573d34cdf27db827/metric_depth/evaluate.py#L80

(assuming you assign the correct focal length in the line above the model output) and using this https://github.com/LiheYoung/Depth-Anything/blob/main/metric_depth/zoedepth/utils/geometry.py should be all you need. Since based on the ZoeDepth training pipeline the model output is metric depth in units of meters.

Are you sure that the focal argument (focal=focal) is necessary or even does anything? I just cannot see it being used in any forward method of the metric depth models, but only in the eval dataloader, @LiheYoung, @loevlie

Also, @LiheYoung, how more accurate is inference with flip augmentation, default in the evaluation script?

DiTo97 commented 5 months ago

Ok here is my code, let me know if you find any glitches, @LiheYoung you can integrate this file asa commit with changes if you would like so.

Edit: Revised, updated and simplified code which can handle any output size.

# infer.py
# Code by @1ssb
import argparse
import os
import glob
import torch
import numpy as np
from PIL import Image
import torchvision.transforms as transforms
import open3d as o3d
from tqdm import tqdm
from zoedepth.models.builder import build_model
from zoedepth.utils.config import get_config

# Global settings
FL = 715.0873
FY = 256 * 0.6
FX = 256 * 0.6
NYU_DATA = False
FINAL_HEIGHT = 256
FINAL_WIDTH = 256
INPUT_DIR = './my_test/input'
OUTPUT_DIR = './my_test/output'
DATASET = 'nyu' # Lets not pick a fight with the model's dataloader

def process_images(model):
    if not os.path.exists(OUTPUT_DIR):
        os.makedirs(OUTPUT_DIR)

    image_paths = glob.glob(os.path.join(INPUT_DIR, '*.png')) + glob.glob(os.path.join(INPUT_DIR, '*.jpg'))
    for image_path in tqdm(image_paths, desc="Processing Images"):
        try:
            color_image = Image.open(image_path).convert('RGB')
            original_width, original_height = color_image.size
            image_tensor = transforms.ToTensor()(color_image).unsqueeze(0).to('cuda' if torch.cuda.is_available() else 'cpu')

            pred = model(image_tensor, dataset=DATASET)
            if isinstance(pred, dict):
                pred = pred.get('metric_depth', pred.get('out'))
            elif isinstance(pred, (list, tuple)):
                pred = pred[-1]
            pred = pred.squeeze().detach().cpu().numpy()

            # Resize color image and depth to final size
            resized_color_image = color_image.resize((FINAL_WIDTH, FINAL_HEIGHT), Image.LANCZOS)
            resized_pred = Image.fromarray(pred).resize((FINAL_WIDTH, FINAL_HEIGHT), Image.NEAREST)

            focal_length_x, focal_length_y = (FX, FY) if not NYU_DATA else (FL, FL)
            x, y = np.meshgrid(np.arange(FINAL_WIDTH), np.arange(FINAL_HEIGHT))
            x = (x - FINAL_WIDTH / 2) / focal_length_x
            y = (y - FINAL_HEIGHT / 2) / focal_length_y
            z = np.array(resized_pred)
            points = np.stack((np.multiply(x, z), np.multiply(y, z), z), axis=-1).reshape(-1, 3)
            colors = np.array(resized_color_image).reshape(-1, 3) / 255.0

            pcd = o3d.geometry.PointCloud()
            pcd.points = o3d.utility.Vector3dVector(points)
            pcd.colors = o3d.utility.Vector3dVector(colors)
            o3d.io.write_point_cloud(os.path.join(OUTPUT_DIR, os.path.splitext(os.path.basename(image_path))[0] + ".ply"), pcd)
        except Exception as e:
            print(f"Error processing {image_path}: {e}")

def main(model_name, pretrained_resource):
    config = get_config(model_name, "eval", DATASET)
    config.pretrained_resource = pretrained_resource
    model = build_model(config).to('cuda' if torch.cuda.is_available() else 'cpu')
    model.eval()
    process_images(model)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--model", type=str, default='zoedepth', help="Name of the model to test")
    parser.add_argument("-p", "--pretrained_resource", type=str, default='local::./checkpoints/depth_anything_metric_depth_indoor.pt', help="Pretrained resource to use for fetching weights.")

    args = parser.parse_args()
    main(args.model, args.pretrained_resource)

@1ssb, in the original version of the code snippet versions you had different values for the focal lengths on the x, y axis, while it has been changed to them being equal to the final image shape times a scaling factor:

LiheYoung commented 5 months ago

Hi @1ssb, thank you a lot for contributing this script! Would you mind making a pull request? You can put this file in our metric_depth folder and maybe name it as depth_to_pointcloud.py? I will merge it to our main branch ASAP.

1ssb commented 5 months ago

Yeah sure, sending a pull request soon!

On Wed, 31 Jan, 2024, 10:22 pm Lihe Yang, @.***> wrote:

Hi @1ssb https://github.com/1ssb, thank you a lot for contributing this script! Would you mind making a pull request? You can put this file in our metric_depth https://github.com/LiheYoung/Depth-Anything/tree/main/metric_depth folder and maybe name it as depth_to_pointcloud.py? I will merge it to our main branch ASAP.

— Reply to this email directly, view it on GitHub https://github.com/LiheYoung/Depth-Anything/issues/36#issuecomment-1918910381, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFEFUVC4O4NCANFR36KDYRISOLAVCNFSM6AAAAABCK3CZBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJYHEYTAMZYGE . You are receiving this because you were mentioned.Message ID: @.***>

1ssb commented 5 months ago

Hi @DiTo97, "why changing and why fixing the scaling factor (fy = 256 * 0.6)?" this is specific to my application do not bother yourself with it.

"how could we get the focal-adjusted metric depth map instead of the focal-adjusted point cloud?" Instead of using the point cloud, normalise the tensor map of the RGBD and multiply with 255 to get a colormap. Utilise the script in the inference file to do this.

DiTo97 commented 5 months ago

Hi @DiTo97, "why changing and why fixing the scaling factor (fy = 256 * 0.6)?" this is specific to my application do not bother yourself with it.

"how could we get the focal-adjusted metric depth map instead of the focal-adjusted point cloud?" Instead of using the point cloud, normalise the tensor map of the RGBD and multiply with 255 to get a colormap. Utilise the script in the inference file to do this.

good to know, @1ssb.

As for the second question, I meant if it is necessary to adjust the focal length of the generated metric depth map, or not. I see you resizing the depth map, and consequently re-scaling the map by the desired focal length, before projecting to a point cloud. Maybe it's related to your specific use case that you were mentioning, e.g., are the specific focal lengths you put in for x, y the focal lengths from the RGB camera intrinsics?

To sum up, if I provide the model with some RGB image, and want the generated metric depth map to have the same resolution, by interpolating afterwards, should I just do the interpolation or also re-scale the depth map values by the resolution change? In general, even if I didn't change the depth map resolution, could I use and trust those metric values as they are generated, or should I do some focal length re-scaling depending on the RGB camera device I am using?

1ssb commented 5 months ago

@DiTo97 Yes you are right and yes you can trust it anywhere. An RGBD is where for every pixel you have a depth. The resizing needs an interpolation, which I have already done for you in my script, so don't worry about it just appropriately update the global values. You need to simply adjust for the pixelwise depth to the depth in 3D (viewing from a point) hence the transformation.

zhongqiu1245 commented 5 months ago

@1ssb Thank you for your contribution! May I ask a question about the weight(xxx.pt) in your script is finetuning by your dataset? Could I load the original depthanything weight(xxxxx_vits14.pth)?

1ssb commented 5 months ago

Hi @zhongqiu1245, I have not checked with any other pretrained resource. Remember that this model is following the zoedepth style and not the relative scale metric so the transformer weight vits.pth probably would not work if the output of the model has the last step as a layered norm (which in all probability is giving you an unsqueezed activated output) which will in general be followed by a normalisation somewhere. Sorry for the long sentences.

SilenceGoo commented 5 months ago

thanks @1ssb, great contribution. I tried your code and successfully got a .ply file, but how could I open it? Win10's default tool doesn't work with it.

1ssb commented 5 months ago

I am glad it does, make sure the file is not corrupted. I generally use Meshlab and it works well.

On Thu, 1 Feb, 2024, 3:21 pm John Doe, @.***> wrote:

thanks @1ssb https://github.com/1ssb, great contribution. I tried your code and successfully got a .ply file, but how could I open it? Win10's default tool doesn't work with it.

— Reply to this email directly, view it on GitHub https://github.com/LiheYoung/Depth-Anything/issues/36#issuecomment-1920477391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFEDUSZGS26LJJSG2YE3YRMJ4XAVCNFSM6AAAAABCK3CZBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRQGQ3TOMZZGE . You are receiving this because you were mentioned.Message ID: @.***>

SilenceGoo commented 5 months ago

thanks @1ssb again, my stupid, cloudcompare gave a shot.

zhongqiu1245 commented 5 months ago

@1ssb thank you!

abhishekmonogram commented 5 months ago

@1ssb I took 4 images by keeping the object at different distances from the camera (40cm,50cm,60cm and 70cm). I have the camera intrinsics (FY = 822.59804231766066 , FX = 838.14270160166848). When I use the NYU indoor checkpoint to test this with the code you provided, the point cloud metric distances are not as accurate as real true distances. Any thoughts on what could be going wrong?

40cm 50cm 60cm 70cm

1ssb commented 5 months ago

For the NYU make sure you are setting nyu dataset flag to true for the correct focal length which I have directly provided in the script. If you are using your own fx and fy values, you will get a different distance from the origin as expected.

On Fri, 2 Feb, 2024, 2:48 am Abhishek Pavani, @.***> wrote:

@1ssb https://github.com/1ssb I took 4 images by keeping the object at different distances from the camera (40cm,50cm,60cm and 70cm). I have the camera intrinsics (FY = 822.59804231766066 , FX = 838.14270160166848). When I use the NYU indoor checkpoint to test this with the code you provided, the point cloud metric distances are not as accurate as real true distances. Any thoughts on what could be going wrong?

40cm.jpg (view on web) https://github.com/LiheYoung/Depth-Anything/assets/141050083/ce1e9fea-6823-432b-b5fd-87ab1773823e 50cm.jpg (view on web) https://github.com/LiheYoung/Depth-Anything/assets/141050083/8829f38a-d5ef-4560-bcbc-6200ac58211d 60cm.jpg (view on web) https://github.com/LiheYoung/Depth-Anything/assets/141050083/eb3b1234-faa2-4e26-8aec-4dd9af586152 70cm.jpg (view on web) https://github.com/LiheYoung/Depth-Anything/assets/141050083/462d5ee6-70f0-4f72-8bd1-74171e43605c

— Reply to this email directly, view it on GitHub https://github.com/LiheYoung/Depth-Anything/issues/36#issuecomment-1921632395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFECF5SFOUV6KAKHUGK3YRO2MVAVCNFSM6AAAAABCK3CZBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRRGYZTEMZZGU . You are receiving this because you were mentioned.Message ID: @.***>

LiheYoung commented 5 months ago

Hi @1ssb, Thank you a lot for your tremendous efforts!

DmitryPor commented 4 months ago

Can this be used to obtain depth or point cloud from a spherical panorama?

1ssb commented 4 months ago

@DmitryPor could you clarify what you mean by spherical panorama?

DmitryPor commented 4 months ago

@DmitryPorне могли бы вы уточнить, что вы подразумеваете под сферической панорамой?

Equirectangular 360-degree image example zoedepth https://huggingface.co/spaces/shariqfarooq/ZoeDepth

Yajiang commented 4 months ago

@DmitryPorне могли бы вы уточнить, что вы подразумеваете под сферической панорамой?

Equirectangular 360-degree image example zoedepth https://huggingface.co/spaces/shariqfarooq/ZoeDepth

I guess not. spherical panorama is very different from perspective images. You can split a spherical panorama to cubic images and then use this model.

hgolestaniii commented 4 months ago

For the NYU make sure you are setting nyu dataset flag to true for the correct focal length which I have directly provided in the script. If you are using your own fx and fy values, you will get a different distance from the origin as expected. On Fri, 2 Feb, 2024, 2:48 am Abhishek Pavani, @.> wrote: @1ssb https://github.com/1ssb I took 4 images by keeping the object at different distances from the camera (40cm,50cm,60cm and 70cm). I have the camera intrinsics (FY = 822.59804231766066 , FX = 838.14270160166848). When I use the NYU indoor checkpoint to test this with the code you provided, the point cloud metric distances are not as accurate as real true distances. Any thoughts on what could be going wrong? 40cm.jpg (view on web) https://github.com/LiheYoung/Depth-Anything/assets/141050083/ce1e9fea-6823-432b-b5fd-87ab1773823e 50cm.jpg (view on web) https://github.com/LiheYoung/Depth-Anything/assets/141050083/8829f38a-d5ef-4560-bcbc-6200ac58211d 60cm.jpg (view on web) https://github.com/LiheYoung/Depth-Anything/assets/141050083/eb3b1234-faa2-4e26-8aec-4dd9af586152 70cm.jpg (view on web) https://github.com/LiheYoung/Depth-Anything/assets/141050083/462d5ee6-70f0-4f72-8bd1-74171e43605c — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFECF5SFOUV6KAKHUGK3YRO2MVAVCNFSM6AAAAABCK3CZBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRRGYZTEMZZGU . You are receiving this because you were mentioned.Message ID: @.>

Hi @1ssb, @LiheYoung,

Thanks for your active contribution.

I am not sure if I understand your comment completely. As you know, when you trained your metric network with a specific dataset with specific focal length and resolution; you may not get correct metric depth values by running the network over an image with a different focal length and resolution. I looked into your code and found that the focal length does not change he depth values (z), it changes only x and y. It would be interesting to know how to compensate for a different focal length? or a different image resolution?

I did some experiments and found that if you use the "outdoor pre-trained metric network" and set a kitti image with full resolution (e.g., 1216x352), you will get correct depth values. However, when you crop the image to something like 512x288 (no resize, only crop), and run the same network you get wrong depth data. This means the Field of View (image resolution) is important. The same is also true if you rescale kitti and put it into the metric model. As you know, by resizing image you basically modify the focal length. Here I elaborated a bit more: [(https://github.com/LiheYoung/Depth-Anything/issues/85)]

I guess, it's still confusing how to use the pre-trained "metric" networks for an arbitrary image.

hgolestaniii commented 4 months ago

@1ssb I took 4 images by keeping the object at different distances from the camera (40cm,50cm,60cm and 70cm). I have the camera intrinsics (FY = 822.59804231766066 , FX = 838.14270160166848). When I use the NYU indoor checkpoint to test this with the code you provided, the point cloud metric distances are not as accurate as real true distances. Any thoughts on what could be going wrong?

40cm 50cm 60cm 70cm

Hi @abhishekmonogram,

Did you manage to get correct depth values (i.e., 40cm, 50cm, 60cm, and 70cm) from the indoor metric models? If yes, could you please how?

1ssb commented 4 months ago

Hi, let me get in on this because this seems important to clarify. A few things:

  1. My script only converts the distance map to a depth map in 3D. So if the principal point depth is not accurate i.e. where the optical axis passes through the scene (no transformations applied) then there is nothing my script can help with. This is something you can easily verify by checking the depth at the midpoint index.

  2. The depths are by extension not going to be perfect, that said the metric representation should be correct given the image being attributed to the specific context or model i.e. indoor/outdoor. There will be a bit of error because its at the end of day still a relative understanding agrregated over the metric representation that the model is learning.

  3. Understand that deep learning models learn in distribution of data. If the data did not have certain cases it will probably not support it like obviously it has not been trained on panoramas so it will not give you any good depth for that so make sure you infer over the same domain of data that the model was fed.

  4. Finally, a perfect metric depth estimation will probably give you something which can well differentiate between 10cms of difference but that is also a bit of hit and miss. For metric depth by definition, the goal of the loss function is not responsible to give you correct to cm representation, it will give you correct to metre representation. While there is nothing stopping it from learning that but there is always the mean aggregation over scenes which will prevent overfitting.

hgolestaniii commented 4 months ago

Hi, let me get in on this because this seems important to clarify. A few things:

  1. My script only converts the distance map to a depth map in 3D. So if the principal point depth is not accurate i.e. where the optical axis passes through the scene (no transformations applied) then there is nothing my script can help with. This is something you can easily verify by checking the depth at the midpoint index.
  2. The depths are by extension not going to be perfect, that said the metric representation should be correct given the image being attributed to the specific context or model i.e. indoor/outdoor. There will be a bit of error because its at the end of day still a relative understanding agrregated over the metric representation that the model is learning.
  3. Understand that deep learning models learn in distribution of data. If the data did not have certain cases it will probably not support it like obviously it has not been trained on panoramas so it will not give you any good depth for that so make sure you infer over the same domain of data that the model was fed.
  4. Finally, a perfect metric depth estimation will probably give you something which can well differentiate between 10cms of difference but that is also a bit of hit and miss. For metric depth by definition, the goal of the loss function is not responsible to give you correct to cm representation, it will give you correct to metre representation. While there is nothing stopping it from learning that but there is always the mean aggregation over scenes which will prevent overfitting.

Hi @1ssb, @LiheYoung,

Many thanks for your reply.

It's understandable that the network is trained on full resolution KITTI and is expected to deliver good results on this specific set. However; what's the point of training a depth estimation network? You might use it for the pictures you get from your own camera, right?

Let me give you an example. I tried the metric depth-anything on the pre-trained outdoor (KITTI) in three different settings, on the KITTI_Evaluation_set, containing 1000 images (you can download it from the official KITTI website: https://s3.eu-central-1.amazonaws.com/avg-kitti/data_depth_selection.zip):

1- Applied network on full resolution images (1216x352), then compared the output with ground truth -> RMSE=~2 meters 2- Crop the center of images to get 512x352, applied network on the cropped images, and compared the output with correspondingly cropped ground truth -> RMSE=~4 meters 3- Crop the center of images to get 512x288, applied network on the cropped images, and compared the output with correspondingly cropped ground truth -> RMSE=~6 meters

All of the tested images are KITTI contents with the same focal length (no image resize), just filed of view is changed (by cropping). Probably, if I use a different content I would get much worse results.

The question is still valid: is it possible to use this network for an arbitrary image? If yes, how to compensate for different focal length and field of view?

I have trained a network my self (based on Mobilenet-V2). I know the details of my training. I also found how to adapt for the focal length difference (in my network, a resizing is required before feeding the image into the network), and different field of view. If you give me an arbitrary image, I know how to get correct depth map out of my network. I want to know how to get the same from depth-aynthing.

1ssb commented 4 months ago

@hgolestaniii congratulations then you have made a distinct scientific finding which you should publish making sure indeed that you can claim this over both indoor and outdoor data to proclaim that you do better than this model. Also make sure that you are using this current set up correctly and appropriately. Looking forward to your work.

My two cents, give this work a bit of space and think of improving your input data to adapt to something canonically good enough so that you can adapt to the kind of input this model is good at. I will leave the rest for @LiheYoung to guide you with.

hgolestaniii commented 4 months ago

@1ssb thanks. My network is designed to be lite for edge devices. I can't say it performs better than depth-aynthing, but my metric performance results are not far from depth-anything. At least, in my network, its clear how to use it for an image with different resolution and focal length than KITTI. I guess, someone who knows the details of this algorithm can tell us how to get depth-aynthing worked for an arbitrary image. Probably @LiheYoung can help in this step.

MalekWahidi commented 4 months ago

Hi, let me get in on this because this seems important to clarify. A few things:

  1. My script only converts the distance map to a depth map in 3D. So if the principal point depth is not accurate i.e. where the optical axis passes through the scene (no transformations applied) then there is nothing my script can help with. This is something you can easily verify by checking the depth at the midpoint index.
  2. The depths are by extension not going to be perfect, that said the metric representation should be correct given the image being attributed to the specific context or model i.e. indoor/outdoor. There will be a bit of error because its at the end of day still a relative understanding agrregated over the metric representation that the model is learning.
  3. Understand that deep learning models learn in distribution of data. If the data did not have certain cases it will probably not support it like obviously it has not been trained on panoramas so it will not give you any good depth for that so make sure you infer over the same domain of data that the model was fed.
  4. Finally, a perfect metric depth estimation will probably give you something which can well differentiate between 10cms of difference but that is also a bit of hit and miss. For metric depth by definition, the goal of the loss function is not responsible to give you correct to cm representation, it will give you correct to metre representation. While there is nothing stopping it from learning that but there is always the mean aggregation over scenes which will prevent overfitting.

Hi @1ssb, @LiheYoung,

Many thanks for your reply.

It's understandable that the network is trained on full resolution KITTI and is expected to deliver good results on this specific set. However; what's the point of training a depth estimation network? You might use it for the pictures you get from your own camera, right?

Let me give you an example. I tried the metric depth-anything on the pre-trained outdoor (KITTI) in three different settings, on the KITTI_Evaluation_set, containing 1000 images (you can download it from the official KITTI website: https://s3.eu-central-1.amazonaws.com/avg-kitti/data_depth_selection.zip):

1- Applied network on full resolution images (1216x352), then compared the output with ground truth -> RMSE=~2 meters 2- Crop the center of images to get 512x352, applied network on the cropped images, and compared the output with correspondingly cropped ground truth -> RMSE=~4 meters 3- Crop the center of images to get 512x288, applied network on the cropped images, and compared the output with correspondingly cropped ground truth -> RMSE=~6 meters

All of the tested images are KITTI contents with the same focal length (no image resize), just filed of view is changed (by cropping). Probably, if I use a different content I would get much worse results.

The question is still valid: is it possible to use this network for an arbitrary image? If yes, how to compensate for different focal length and field of view?

I have trained a network my self (based on Mobilenet-V2). I know the details of my training. I also found how to adapt for the focal length difference (in my network, a resizing is required before feeding the image into the network), and different field of view. If you give me an arbitrary image, I know how to get correct depth map out of my network. I want to know how to get the same from depth-aynthing.

You may find this paper interesting.

Abstract: "It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes. We observe that it is essentially due to not only the scale-ambiguous problem but also the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes."

1ssb commented 4 months ago

Hi @MalekWahidi as far as I know this is still very much an unsolved problem and undeployably experimental. Thanks for the paper!

MalekWahidi commented 4 months ago

Hi @MalekWahidi as far as I know this is still very much an unsolved problem and undeployably experimental. Thanks for the paper!

Actually, I think Google recently made significant progress on this front with their new DMD model. But it's such a shame they didn't publish any code for this one. I guess some crossover between what they did in this paper and Depth Anything would make a really promising approach for generalizable monocular metric depth estimation.

1ssb commented 4 months ago

@MalekWahidi indeed I have come across this paper but the sheer amount of training and 33% better results from SOTA benchmarks seem incoherent. In terms of deployability, a simple test is to take a stereo pair (transform the point clouds and combine them) and see if its 3D consistent upon two inferences, and repeat it for any camera of choice which most methods including DepthAnything fail at. Its a much fairer metric than predicting exact depths because frankly if depths are really that important any practical person would use an RGBD cam.

MalekWahidi commented 4 months ago

@MalekWahidi indeed I have come across this paper but the sheer amount of training and 33% better shots seem incoherent in terms of deployability. A simple test is to take a stereo pair (transform the point clouds and combine them) and see if its 3D consistent upon two inferences, and repeat it for any camera of choice which most methods including DepthAnything fail at. Its a much fairer metric than predicting exact depths because frankly if depths are really that important any practical person would use an RGBD cam.

I didn't quite understand ur first statement. And regarding point cloud/metric depth consistency across multiple inference steps, I guess some form of temporal smoothing should work? Current RGBD cams are notoriously noisy as well.

1ssb commented 4 months ago

@MalekWahidi sorry my bad. True RGBD cameras are noisy but calibration is far more straight forward and so is outlier culling. These things on the other hand are far less trustworthy.

Well I am not sure what you mean by temporal smoothing. The idea is to perfect the overlapping point matching to a 3D procrustes error to zero.

elevenjiang1 commented 3 months ago

I have run the code in depth_to_pointcloud.py, but I found that the estimate depth is 2 times than real depth, after estimate_depth 2, the RMSE(real_depth,2 estimate_depth) is around 150mm, match the result in paper. By the way, I want to know whether the depth_to_pointcloud can be deployed in openvino?

1ssb commented 3 months ago

Hmm "but I found that the real depth is 2 times than real image, after estimate_depth2": this seems like a systemic problem and should not be happening if it's an indoor scene. Correct for the focal plane values. And compare the principal point depth with the real depth first.

Also, can you post the RGBD here?

elevenjiang1 commented 3 months ago

Hmm "but I found that the real depth is 2 times than real image, after estimate_depth2": this seems like a systemic problem and should not be happening if it's an indoor scene. Correct for the focal plane values. And compare the principal point depth with the real depth first.

Also, can you post the RGBD here

color_0 depth_0

Thanks a lot! and here is my code reference depth_to_pointcloud.py, only change in process_images() function

def process_images(model):
    color_image=Image.open("color_0.png").convert('RGB')
    original_width, original_height = color_image.size
    image_tensor = transforms.ToTensor()(color_image).unsqueeze(0).to('cuda' if torch.cuda.is_available() else 'cpu')

    pred = model(image_tensor, dataset=DATASET)
    if isinstance(pred, dict):
        pred = pred.get('metric_depth', pred.get('out'))
    elif isinstance(pred, (list, tuple)):
        pred = pred[-1]
    pred = pred.squeeze().detach().cpu().numpy()
    resized_pred = Image.fromarray(pred).resize((original_width, original_height), Image.NEAREST)

    np_resized_pred=np.array(resized_pred)
    np_resized_pred=np_resized_pred*1000
    np_resized_pred=np_resized_pred.astype(np.int16)

    with open("depth_0.png", 'wb') as f:#对于png文件,专门采用了png文件的保存方式
        writer = png.Writer(width=np_resized_pred.shape[1], height=np_resized_pred.shape[0],
                            bitdepth=16, greyscale=True)
        zgray2list = np_resized_pred.tolist()
        writer.write(f, zgray2list)