Inference time is slower than expected

LiheYoung / Depth-Anything

[CVPR 2024] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation

https://depth-anything.github.io

Apache License 2.0

6.78k stars 520 forks source link

Inference time is slower than expected #49

Open junbangliang opened 7 months ago

junbangliang commented 7 months ago

Hi,

Thanks for sharing the work, when I try to run the vitl example in an A100 gpu, I found the inference time settles down to around 120ms rather than 13ms as stated in the repo, is there a reason for this? I provided the experiment I ran below.

Thanks!

import cv2
import numpy as np
import os
import torch
import torch.nn.functional as F
from torchvision.transforms import Compose

from depth_anything.dpt import DepthAnything
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet

import matplotlib.pyplot as plt

if __name__ == '__main__':

    os.environ['CUDA_VISIBLE_DEVICES'] = '0'

    DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

    gpu_name = torch.cuda.get_device_name(torch.cuda.current_device())
    print(f"GPU being used: {gpu_name}")

    encoder = 'vitl'
    depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{}14'.format(encoder)).to(DEVICE).eval()

    total_params = sum(param.numel() for param in depth_anything.parameters())
    print('Total parameters: {:.2f}M'.format(total_params / 1e6))

    transform = Compose([
        Resize(
            width=518,
            height=518,
            resize_target=False,
            keep_aspect_ratio=True,
            ensure_multiple_of=14,
            resize_method='lower_bound',
            image_interpolation_method=cv2.INTER_CUBIC,
        ),
        NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        PrepareForNet(),
    ])

    filename = "assets/examples/demo1.png"

    raw_image = cv2.imread(filename)
    image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0

    h, w = image.shape[:2]

    image = transform({'image': image})['image']
    image = torch.from_numpy(image).unsqueeze(0).to(DEVICE)

    print(f"image shape: {image.shape}")

    with torch.no_grad():
        import time
        for i in range(1000):
            start = time.perf_counter()
            depth = depth_anything(image)
            print(f"inference time is: {time.perf_counter() - start}s")

GPU being used: NVIDIA A100-SXM4-80GB
Total parameters: 335.32M
image shape: torch.Size([1, 3, 518, 784])
inference time is: 3.4120892197825015s
inference time is: 0.014787798281759024s
inference time is: 0.01355740800499916s
inference time is: 0.10093897487968206s
inference time is: 0.12020917888730764s
inference time is: 0.11985550913959742s
inference time is: 0.12007139809429646s
inference time is: 0.1200293991714716s
inference time is: 0.12007084907963872s
inference time is: 0.12004875903949142s
inference time is: 0.12011446803808212s

LiheYoung commented 7 months ago

Hi, I tried your code. The output is similar to yours. Strangely, as shown in your log, the second and third loops require only 0.0147s and 0.0135 seconds respectively, which is close to our reported inference time (yours are still a little larger because our inference time is tested on 518x518 resolution). Honestly, I also have no idea why the inference time suddenly becomes 10x larger from the fourth loop.

Intriguingly, if you insert the time counter in our run.py, you will find the obtained inference time is normal, and always consistent with our reported ones. Wish you could have a try.

heyoeyo commented 7 months ago

I think testing speed this way (the perf_counter calls before and after calling the model) gives inconsistent results due to cuda synchronization. You need to force the GPU to sync up before printing the inference time, or else the print can happen before the gpu is done (which is likely where the 0.014/0.013 times are coming from).

You can force a sync by moving the data back to the cpu (i.e. using depth_anything(image).cpu()) or explicitly sync up:

start = time.perf_counter()
depth = depth_anything(image)
torch.cuda.synchronize()
print(f"inference time is: {time.perf_counter() - start}s")

junbangliang commented 7 months ago

With the gpu synchronization implemented, the script now gives the following inference time. Is there a way to speed up the inference for a large amount of images?

GPU being used: NVIDIA A100-SXM4-80GB
Total parameters: 335.32M
image shape: torch.Size([1, 3, 518, 784])
inference time is: 3.3442822340875864s
inference time is: 0.12628611270338297s
inference time is: 0.12091294582933187s
inference time is: 0.11994686676189303s
inference time is: 0.12029408616945148s
inference time is: 0.1201697769574821s
inference time is: 0.12012322712689638s
inference time is: 0.12011239631101489s
inference time is: 0.12021539593115449s
inference time is: 0.12010658718645573s
inference time is: 0.12012855615466833s

heyoeyo commented 7 months ago

Is there a way to speed up the inference for a large amount of images?

The usual speed up for lots of images is to batch them together. You can do this with: image_batch = torch.cat((image1, image2, image3, ... etc)) This comes at the cost of higher VRAM usage. Doing this should reduce the amount of back-and-forth between the cpu and gpu.

You might also get a small speed up by changing the torch.no_grad() part to instead use torch.inference_mode(), if you're using a newer version of pytorch.

You can also try using the torch.channels_last memory format, though whether this helps will depend on the model, and very slightly alters the results (from what I've seen). This is something you do like setting the device: data.to(device, memory_format=torch.channels_last)

Lastly, you might get a big speed up by using torch.float16, but at the expense of slightly worse results usually (and in some cases, you can get NaN/inf results that wouldn't occur with the default float32 type, in that case torch.bfloat16 may work better). You also do this like setting the device: data.to(device, dtype=torch.float16)

kishore-greddy commented 7 months ago

@jlia904 Even after you corrected the code snippet with the torch.cuda.synchronize(), your inference speed settles around 120ms, which is 10 times slower albeit at a higher resolution. Did you try with the resolution of 512x512 to see if you could reproduce the numbers reported by the authors?

junbangliang commented 7 months ago

@kishore-greddy Yes I did try 512x512 resolution. The inference speed is still over 100ms.

kishore-greddy commented 7 months ago

@LiheYoung As reported by @jlia904 , I also tried inferring on 512x512 image resolution on the tesla v100-dgxs-32gb, and my inference time was around 130ms which is nowhere close to 20ms as reported by you. Could you recheck your numbers or share a code snippet that you use to get the 20ms inference time on V100?

kishore-greddy commented 7 months ago

@jlia904 Thanks for the reply. Do you know the possible reason for it? Or do you think that the reported numbers are wrong?

junbangliang commented 7 months ago

@kishore-greddy Now I tried it on another A100 machine and now I can get down to 70ms, results vary between machines but still not close to numbers reported from the authors.

GPU being used: NVIDIA A100-SXM4-80GB Total parameters: 335.32M image shape: torch.Size([1, 3, 518, 518]) inference time is: 1.75464620799994s inference time is: 0.08826967400000285s inference time is: 0.08717626499992548s inference time is: 0.07318027000019356s inference time is: 0.07299402100011321s inference time is: 0.07296579000012571s inference time is: 0.07296102099985546s inference time is: 0.07299358000000211s inference time is: 0.07297067099989363s inference time is: 0.07296242999996139s inference time is: 0.07297045099994648s inference time is: 0.07316351999998005s inference time is: 0.07317171099998632s inference time is: 0.07321707999994942s inference time is: 0.07319737000011628s inference time is: 0.07319433999987268s inference time is: 0.07316986000000725s inference time is: 0.07317408099993372s inference time is: 0.07317609000006087s

heyoeyo commented 7 months ago

Do you know the possible reason for it?

It could just be that they've left out info about how they're running the model. If they use float16, that can knock ~50% off the time and batching can reduce that another ~25% (by comparison, inference_mode and channels_last memory formatting don't seem to do much for these models). Using xFormers knocks another 25% when using float16. With these changes, I get the following numbers on a 3090 @ 518x518:

vit-small:

``` GPU being used: NVIDIA GeForce RTX 3090 Total parameters: 24.79M image dtype: torch.float16 image shape: torch.Size([32, 3, 518, 518]) batch size: 32 Per-image time: 7.4 ms Per-image time: 3.3 ms Per-image time: 3.2 ms Per-image time: 3.2 ms Per-image time: 3.0 ms Per-image time: 3.1 ms Per-image time: 3.2 ms ```

vit-base:

``` GPU being used: NVIDIA GeForce RTX 3090 Total parameters: 97.47M image dtype: torch.float16 image shape: torch.Size([32, 3, 518, 518]) batch size: 32 Per-image time: 11.5 ms Per-image time: 7.7 ms Per-image time: 7.7 ms Per-image time: 7.6 ms Per-image time: 7.7 ms Per-image time: 7.7 ms Per-image time: 7.6 ms ```

vit-large:

``` GPU being used: NVIDIA GeForce RTX 3090 Total parameters: 335.32M image dtype: torch.float16 image shape: torch.Size([32, 3, 518, 518]) batch size: 32 Per-image time: 25.5 ms Per-image time: 23.2 ms Per-image time: 23.1 ms Per-image time: 23.0 ms Per-image time: 23.0 ms Per-image time: 23.0 ms Per-image time: 23.1 ms ```

For reference, without these changes, vit-large takes around 94ms per image.

I'm not familiar with the A100/V100 and where they stand vs. the 3090, but these numbers seem reasonable compared to the reported 4090 numbers, assuming the tests were done with float16 or bfloat16.

kishore-greddy commented 7 months ago

@heyoeyo Thanks for your reply, I will try it out on my side as well with the change in precision and update the results

Bolt-Scripts commented 3 months ago

@heyoeyo Could you provide a bit more detail or share the modified code on how to add batching? I'm not super familiar with all the torch stuff and how you would fully implement these changes 😅

heyoeyo commented 3 months ago

Sure, here's a modified version of the code @jlia904 posted originally. The biggest change is just using a device_config dictionary in place of the original DEVICE value. This lets you set the data type (e.g. float16) when moving data to the gpu.

import time

import cv2
import torch
from torchvision.transforms import Compose

from depth_anything.dpt import DepthAnything
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet

# Settings
encoder = "vits"
use_channels_last = False
use_batching_example = False
use_float16 = False

# Example of passing 1 or 4 images to the model
files_list = ["assets/examples/demo1.png"] # Batch of 1
if use_batching_example:
    files_list = [
        "assets/examples/demo1.png",
        "assets/examples/demo2.png",
        "assets/examples/demo3.png",
        "assets/examples/demo4.png",
    ]

# Set up device/data type for image & model weights
device_config = {
    "device": "cuda",
    "dtype": torch.float16 if use_float16 else torch.float32,
    "memory_format": torch.channels_last if use_channels_last else None
}

depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{}14'.format(encoder)).eval()
depth_anything.to(**device_config)
transform = Compose([
    Resize(
        width=518,
        height=518,
        resize_target=False,
        keep_aspect_ratio=False,
        ensure_multiple_of=14,
        resize_method='lower_bound',
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    PrepareForNet(),
])

# Loading & pre-processing image data
image_list = []
for filename in files_list:
    raw_image = cv2.imread(filename)
    image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0
    image = transform({'image': image})['image']
    image = torch.from_numpy(image).unsqueeze(0)
    image_list.append(image)
image_batch = torch.cat(image_list).to(**device_config)

print("GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))
print("Image batch shape:", image_batch.shape)
print("Device Config:", device_config)

# Computing depth results
batch_size = image_batch.shape[0]
with torch.no_grad():
    for i in range(24):
        start = time.perf_counter()
        depth = depth_anything(image_batch).cpu()
        total_time_ms = 1000 * (time.perf_counter() - start)
        print("Per-image time:", round(total_time_ms/batch_size, 1), "ms")

# For feedback, check if xformers is installed from: pip install xformers
# (model uses it automatically if available, it helps with float16 speed/memory use)
using_xformers = False
try:
    import xformers
    using_xformers = True
except ImportError:
    pass

# Some feedback at the end, for reference
print("Using channels last format:", use_channels_last)
print("Using batching:", use_batching_example)
print("Using float 16:", use_float16)
print("Using xformers:", using_xformers)

The script should be placed in the root of the DepthAnything folder, so that all the import/image paths work properly. You can adjust the settings at the top to toggle the options on and off to see what effect they have on the running speed. The batching in this case is only using 1 (no batching) or 4 images. You get more improvement by using higher batch sizes, but you need a use-case where you'd even have a bunch of images all ready at once to take advantage of this.

Also, shameless plug :p, but I have a repo that has some of these speed-ups built-in and includes a video script in case that's of any use.

Bolt-Scripts commented 3 months ago

@heyoeyo Thanks, much appreciated 😋 Currently I'm working on a system for streaming depth frames from a video for realtime use actually. I think it'd maybe be helpful to be able to batch the frames ahead of the video to increase performance. Because as it is I end up with low gpu utilization and can only really process a bit over 15fps, almost regardless of model and resolution. It feels as if the gpu just spits stuff out faster than it can be supplied with new data. Using fp16 and stuff doesn't really make a difference, probably because the bottleneck is elsewhere, my guess being just excessive gpu sync from all the stuff going around. Which is why batching seems like it'd help a lot here if that's right.

But I don't really know enough about all this torch malarky to really understand like what operations might cause issues or how to speed this up. I'm unclear on where certain operations even take place, like it looks like the image transformations and such happen on cpu since the to.(device) stuff happens afterwards, but maybe I'm wrong about that. But point being idk if stuff like that is sucking away processing times or what, or if certain cpu side things could be done on separate threads to have data always prepared for the gpu. idk man. So if you have any more tips on how to minimize downtime and increase gpu occupancy, that'd be great 😅

heyoeyo commented 3 months ago

depth frames from a video for realtime use

This can be a tricky use case to optimize I think. While batching can help, it directly opposes the requirement of it being realtime, since you'd need to wait for frames to form each batch. For example, if you form a batch of 32 (which seems to give a decent bump in performance), then that would lead to a ~1 second delay (assuming 30fps) in processing the first frame of that batch. So there may be a limit to the benefit of batching, depending on how strict your realtime requirement is.

can only really process a bit over 15fps, almost regardless of model and resolution

This seems surprising to me! There should be a very noticeable difference between the vit-small and vit-large processing speeds. I'd assume this means that the bottleneck may be reading frames from the video (this can be very slow for certain codecs, like h265), or that it's a result of how the time is being measured? It's hard to say.

Using fp16 and stuff doesn't really make a difference

This is also surprising. Just to be sure, that code I posted doesn't use float16 by default (the use_float16 variable needs to be set to True), so in case you ran it as-is and didn't see any difference, that might be why?

all this torch malarky to really understand like what operations might cause issues or how to speed this up

Ya the asynchronous stuff is confusing in pytorch, since it's not explicit (like async/await in javascript for example). When you need to move something to the cpu, there's actually a 'non-blocking' argument that can be passed in to delay the sync, though it's behavior can be confusing! In the code above, instead of moving the depth data to the cpu using .cpu(), you can do something like: depth = depth_anything(image_batch).to("cpu", non_blocking=True) This seems to let the code 'run ahead' until the next line where the depth variable is actually needed. It doesn't do much for the code above, but may be helpful for your use case, since it's a bit like using multiple threads.

it looks like the image transformations and such happen on cpu

Yes that's correct. There was another post about moving these operations to the GPU (issue #173) and that poster said they got some noticeable improvements. They posted a link to their updated code, so may be worth checking out.

any more tips on how to minimize downtime and increase gpu occupancy

I'd recommend double checking that there's not some issue with reading frames quickly enough (e.g. if the frames are being read at 15fps, then no amount of GPU optimization can produce an output faster than 15fps). I usually just drop perf_counter() calls all over the place to figure out what's taking the most time (though this can be tricky with the pytorch/cuda stuff). Otherwise, there's also other runtimes (e.g. tensorRT) which seem to make better use of the GPU, so that's worth considering (that image transformation issue is worth checking out for this too, since that user seemed to be using tensorRT).

Bolt-Scripts commented 2 weeks ago

@heyoeyo Finally getting back around to this project, its a bit daunting 😅 But I just wanted to say that I really appreciate your input.

Just to hit on some points, I agree it all seems surprising, but it does make sense that speeding up prediction wont make it faster overall if the bottleneck is elsewhere. I'm sure float16 was being used, I made sure it was properly enabled in my code and I could see the memory being halved. Honestly I'm not sure if I'd even expect fp16 to speed things up considering that to my knowledge gpus always operate on 32 bit values anyway and halfs wouldnt magically speed things up. But I'm not really familiar with cuda and if there's some magic going on there. I guess the speedup is supposed to come from moving less memory around in the first place if you can half it before moving to gpu but its all just 🥴 in my brain. Nothing about it makes any sense, because I'm seeing like a 35-50% speed reduction when enabling fp16. At least when not doing the prediction. When adding that back in it seems to be pretty similar with or without, perhaps due to some bottleneck there with prediction and all that.

Just doing a quick test, it seems the video readback is pretty fast, at least fast enough to be doing what i want at like 235fps on a 1080p video. But that slows down dramatically when doing just about anything else. Adding in the cv2.COLOR_BGR2RGB drops it to 52fps Just the transform without the color: 45fps Both: 27fps :/ And adding in everythinggg else besides the actual prediction step, we're down to 22fps before even predicting anything. So, that doesn't seem great. It would seem that getting that stuff off the cpu would be a good place to start.

I actually tried some stuff using multiple cpu threads, with the idea being to get around it a bit by doing as much transform stuff ahead of time on multiple threads and queueing things up to be predicted. I got this to at least 'work' and it seemed to have helped a bit with reading and transforming the frames, but when putting the prediction back into the mix it seems to slow down again. And I'm not sure if this is just hinting that on top of the bottleneck with the image processing, that the prediction is limited by something too. But even running the small model at like a resolution of 50px the max I've been able to get is 30fps, and doesn't really slow down until about 400px, which feels like there's performance on the table there, but that feels like batching would be required at that point. If there's still just a straight up bottleneck even when all the other steps are done ahead of time and it just needs to be predicted. It's just really hard to tell. If i set it to 50px and skip the prediction, I can get 100fps Adding prediction back in, only 25-30 :/ When predicting at 50px i see maybe 10%gpu usage if that, where as 500px is closer to 100% Even though they run at about the same fps. I'm not sure how much more clear I can get that there's some weird backandforth slowing it down. Not sure exactly what or where but batching is reverberating in my brain... I do worry about the implementation with my weird threads and buffering quickly enough for realtime... Perhaps I will report back will results.

Oh and also I'm not sure tensorRT is really viable for me as I need this to be widely distributable. I tried getting it set up at one point but just kinda ended up in a compatibility nightmare and even when I got it to 'work' it was abysmally slow and sometimes produced glitched results i think? So idk, I just gave up trying to figure out what was going wrong with it given how annoying all of it was and inference speed not really being my problem anyway. I'm not entirely sure but I think tensorRT isn't easy to just package up and have someone else download and run, which what i currently have set up at least works rather well porting across systems using pyinstaller, somehow. And I want people with AMD cards to have a good experience as well, which I've been using DirectML for which I like for it's very easy drop-in like one line usability. And honestly I've seen some really decent results running through it. Even if all this ends up with an enormous like 8gb install size with all the python bloat 😅 I like that it at least works, even if I'd probably get wayyy better results if I'd figured out using C# ML libraries directly in unity and doing the whole onnx thing or whatever. But it's not worth it to me to redo everything at this point, and besides last i checked it was a really weird process getting proper gpu support that route and might have distribution issues anyway.

I'm just rambling at this point but figured I'd share some of my weird experience in all this.

heyoeyo commented 2 weeks ago

I guess the speedup is supposed to come from moving less memory around in the first place

Ya I think that's a big part of it. There are several similar-ish youtube videos discussing this, and the consensus seems to be that GPUs tend to have more compute power than they can typically use, because they don't have the memory bandwidth to 'feed' the computing cores.

235fps on a 1080p video... Adding in the cv2.COLOR_BGR2RGB drops it to 52fps

That's definitely weird! Switching from BGR to RGB doesn't involve any computation, it's just swapping data around in memory. So that seems to hint at either the CPU and/or system RAM (i.e. not VRAM) as being a major bottleneck.

But even running the small model at like a resolution of 50px the max I've been able to get is 30fps, and doesn't really slow down until about 400px

Just as a sanity check, it might be worth trying this same experiment but running the model on the CPU. That way you can isolate the timings from the GPU weirdness, and confirm that it really does run slower when doing more computation (and if not, it would hint at there being a config/timing issue maybe?). Though on CPU, you may have to switch to measuring seconds per frame.

Anyways, I think you're on the right track using threading + batching + moving transformations to GPU if it's a CPU issue.

Bolt-Scripts commented 2 weeks ago

@heyoeyo Yeah alright so, I pretty much have confirmed all my suspicions by implementing batching and GPU transforms. Thanks for linking to that other post for reference and again guiding me a bit on batching and stuff. Neither of them on their own made a big difference but helped a bit, but combined it alleviates a good bit of the cpu bottlenecking and now performance scales wayyyy better with resolution/model size and fp16 seems to help more like its supposed to as well. There's still bottlenecks in the the whole setup with the resolution of source content slowing things down with video reading and just presuming that all the memory moving and transform operations are still slowed down by resolution even if most of it is handed off to gpu asap. But it performs well enough that I'm not toooo concerned about it anymore.

With a batch size of 8 I'm able to get to a pretty good balance of performance and speed I think. Higher batch sizes are technically faster but that only matters so much as you can feed it data fast enough to keep up, which is the bottleneck there so there's a crossover point where it balances out and stops helping, and the lower the source resolution the faster i can feed it and the higher batching could help, but I don't really need it. If I set it to 50px and a 720p source video just to keep everything p minimal and set batch size to 32, I can get like over 300fps on that, with prediction, not very practical but shows that everything at least is about as good as I can get with this particular setup. Small model at default 518px prediction and 720p source I can get around 120fps on my computor. Which isn't anything too crazy with an aging i7-8700k and 4070.

Unfortunatelyyy, I haven't really been able to realize the fruits of this labor yet as it seems I now have some kind of bottleneck on the Unity side reading in the frames and putting them on the gpu. Honestly I shouldn't even have to do that since it's already data on the gpu in the first place, it'd be great if I could take the tensor data directly from pytorch and shove it into a compute buffer I can use in graphics. But I couldn't figure that out last time I tried to research it, admittedly a niche issue but I feel like it has to be possible somehow. Because currently moving it onto the cpu and sending it over a local websocket, just to convert it around back on the gpu for every frame is far from ideal. I remember trying this, https://pytorch.org/docs/stable/generated/torch.Tensor.data_ptr.html But I don't really know how all the stuff works in the low level api territory. Nor do I know if it's even possible to bind a resource on the gpu from a pointer. I feel like it should be, somehowwww. But I haven't found a way to do it and researching this kind of obscure topic is just painful 🥲 I'll have to figure something out one way or another. I just have a bad feeling I'm going to have to write some native plugin just for something that should be theoretically very simple, take existing block of data in vram and treat it as compute buffer data.

heyoeyo commented 2 weeks ago

Small model at default 518px prediction and 720p source I can get around 120fps on my computor

That's impressive! It's inline with the reported A100 timing which is ~10x more expensive than a 4070.

moving it onto the cpu and sending it over a local websocket, just to convert it around back on the gpu for every frame is far from ideal

It might be worth checking out something like onnx (there's an existing depth-anything-onnx repo), which should allow for the model to be run directly inside of Unity. That way there's no need for all the back-and-forth communication (granted, onnx can be it's own headache from my experience). Otherwise I think trying to directly share memory across processes is (in general) intentionally very difficult since it's a major security + data corruption concern.