LiheYoung / Depth-Anything

[CVPR 2024] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation
https://depth-anything.github.io
Apache License 2.0
7.01k stars 538 forks source link
depth-estimation image-synthesis metric-depth-estimation monocular-depth-estimation

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

[**Lihe Yang**](https://liheyoung.github.io/)1 · [**Bingyi Kang**](https://scholar.google.com/citations?user=NmHgX-wAAAAJ)2† · [**Zilong Huang**](http://speedinghzl.github.io/)2 · [**Xiaogang Xu**](https://xiaogang00.github.io/)3,4 · [**Jiashi Feng**](https://sites.google.com/site/jshfeng/)2 · [**Hengshuang Zhao**](https://hszhao.github.io/)1* 1HKU    2TikTok    3CUHK    4ZJU †project lead *corresponding author **CVPR 2024** Paper PDF Project Page

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation by training on a combination of 1.5M labeled images and 62M+ unlabeled images.

teaser

Try our latest Depth Anything V2 models!

News

Features of Depth Anything

If you need other features, please first check existing community supports.

Performance

Here we compare our Depth Anything with the previously best MiDaS v3.1 BEiTL-512 model.

Please note that the latest MiDaS is also trained on KITTI and NYUv2, while we do not.

Method Params KITTI NYUv2 Sintel DDAD ETH3D DIODE
AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$
MiDaS 345.0M 0.127 0.850 0.048 0.980 0.587 0.699 0.251 0.766 0.139 0.867 0.075 0.942
Ours-S 24.8M 0.080 0.936 0.053 0.972 0.464 0.739 0.247 0.768 0.127 0.885 0.076 0.939
Ours-B 97.5M 0.080 0.939 0.046 0.979 0.432 0.756 0.232 0.786 0.126 0.884 0.069 0.946
Ours-L 335.3M 0.076 0.947 0.043 0.981 0.458 0.760 0.230 0.789 0.127 0.882 0.066 0.952

We highlight the best and second best results in bold and italic respectively (better results: AbsRel $\downarrow$ , $\delta_1 \uparrow$).

Pre-trained models

We provide three models of varying scales for robust relative depth estimation:

Model Params Inference Time on V100 (ms) A100 RTX4090 (TensorRT)
Depth-Anything-Small 24.8M 12 8 3
Depth-Anything-Base 97.5M 13 9 6
Depth-Anything-Large 335.3M 20 13 12

Note that the V100 and A100 inference time (without TensorRT) is computed by excluding the pre-processing and post-processing stages, whereas the last column RTX4090 (with TensorRT) is computed by including these two stages (please refer to Depth-Anything-TensorRT).

You can easily load our pre-trained models by:

from depth_anything.dpt import DepthAnything

encoder = 'vits' # can also be 'vitb' or 'vitl'
depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{:}14'.format(encoder))

Depth Anything is also supported in transformers. You can use it for depth prediction within 3 lines of code (credit to @niels).

No network connection, cannot load these models?

Click here for solutions - First, manually download the three checkpoints: [depth-anything-large](https://huggingface.co/spaces/LiheYoung/Depth-Anything/blob/main/checkpoints/depth_anything_vitl14.pth), [depth-anything-base](https://huggingface.co/spaces/LiheYoung/Depth-Anything/blob/main/checkpoints/depth_anything_vitb14.pth), and [depth-anything-small](https://huggingface.co/spaces/LiheYoung/Depth-Anything/blob/main/checkpoints/depth_anything_vits14.pth). - Second, upload the folder containing the checkpoints to your remote server. - Lastly, load the model locally: ```python from depth_anything.dpt import DepthAnything model_configs = { 'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]}, 'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]}, 'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]} } encoder = 'vitl' # or 'vitb', 'vits' depth_anything = DepthAnything(model_configs[encoder]) depth_anything.load_state_dict(torch.load(f'./checkpoints/depth_anything_{encoder}14.pth')) ``` Note that in this locally loading manner, you also do not have to install the ``huggingface_hub`` package. In this way, please feel free to delete this [line](https://github.com/LiheYoung/Depth-Anything/blob/e7ef4b4b7a0afd8a05ce9564f04c1e5b68268516/depth_anything/dpt.py#L5) and the ``PyTorchModelHubMixin`` in this [line](https://github.com/LiheYoung/Depth-Anything/blob/e7ef4b4b7a0afd8a05ce9564f04c1e5b68268516/depth_anything/dpt.py#L169).

Usage

Installation

git clone https://github.com/LiheYoung/Depth-Anything
cd Depth-Anything
pip install -r requirements.txt

Running

python run.py --encoder <vits | vitb | vitl> --img-path <img-directory | single-img | txt-file> --outdir <outdir> [--pred-only] [--grayscale]

Arguments:

For example:

python run.py --encoder vitl --img-path assets/examples --outdir depth_vis

If you want to use Depth Anything on videos:

python run_video.py --encoder vitl --video-path assets/examples_video --outdir video_depth_vis

Gradio demo

To use our gradio demo locally:

python app.py

You can also try our online demo.

Import Depth Anything to your project

If you want to use Depth Anything in your own project, you can simply follow run.py to load our models and define data pre-processing.

Code snippet (note the difference between our data pre-processing and that of MiDaS) ```python from depth_anything.dpt import DepthAnything from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet import cv2 import torch from torchvision.transforms import Compose encoder = 'vits' # can also be 'vitb' or 'vitl' depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{:}14'.format(encoder)).eval() transform = Compose([ Resize( width=518, height=518, resize_target=False, keep_aspect_ratio=True, ensure_multiple_of=14, resize_method='lower_bound', image_interpolation_method=cv2.INTER_CUBIC, ), NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), PrepareForNet(), ]) image = cv2.cvtColor(cv2.imread('your image path'), cv2.COLOR_BGR2RGB) / 255.0 image = transform({'image': image})['image'] image = torch.from_numpy(image).unsqueeze(0) # depth shape: 1xHxW depth = depth_anything(image) ```

Do not want to define image pre-processing or download model definition files?

Easily use Depth Anything through transformers within 3 lines of code! Please refer to these instructions (credit to @niels).

Note: If you encounter KeyError: 'depth_anything', please install the latest transformers from source:

pip install git+https://github.com/huggingface/transformers.git
Click here for a brief demo: ```python from transformers import pipeline from PIL import Image image = Image.open('Your-image-path') pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-small-hf") depth = pipe(image)["depth"] ```

Community Support

We sincerely appreciate all the extensions built on our Depth Anything from the community. Thank you a lot!

Here we list the extensions we have found:

If you have your amazing projects supporting or improving (e.g., speed) Depth Anything, please feel free to drop an issue. We will add them here.

Acknowledgement

We would like to express our deepest gratitude to AK(@_akhaliq) and the awesome HuggingFace team (@niels, @hysts, and @yuvraj) for helping improve the online demo and build the HF models.

Besides, we thank the MagicEdit team for providing some video examples for video depth estimation, and Tiancheng Shen for evaluating the depth maps with MagicEdit.

Citation

If you find this project useful, please consider citing:

@inproceedings{depthanything,
      title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data}, 
      author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
      booktitle={CVPR},
      year={2024}
}