LiheYoung / Depth-Anything

[CVPR 2024] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation
https://depth-anything.github.io
Apache License 2.0
6.76k stars 518 forks source link

on a powerful laptop it's very slow how about jetson nano ? #128

Open damavand1 opened 5 months ago

damavand1 commented 5 months ago

hi

I tested small model, in inference it takes 3 second's for a picture my laptop config is intel i9 11 nvidia 3060 6 GB 16 GB ram SSD pytorch + cuda

my picture size is 512*384 now the question is how can i improve speed ?

i want run Depth-Anything on Jetson nano

heyoeyo commented 5 months ago

If that time comes from running a single image through the model, then most of the 3 seconds may just be the model loading/setting up, which generally takes much longer than processing an image. Re-using the loaded model (as opposed to re-loading it every time an image is processed) should give a big speed up.

It's also worth making sure that the GPU is being used. You can check that by printing out the result of: torch.cuda.is_available(), which should print True. If not, you may need to upgrade the python env.

damavand1 commented 5 months ago

Thank's

this time i pass a folder contains 144 jpg images ( 512*384) and this is result for small and cuda is enabled

3.66it/s

It's good, but i think my laptop in at least 50 times faster than a jetson nano

what happened if i run Depth-Anything on a Jetson nano?

I needs at least 4 FPS depth frame for my autonomous robot what's the solution ?

heyoeyo commented 5 months ago

That still seems slower than I would expect, at least for the gpu... (it's roughly in line with the speed I get on cpu at that resolution).

In any case, aside from finding a smaller/faster model, the main way to speed up the execution would be to run at a smaller image size. From my experience, vit-small seems to sort of work down to a size of 196x196, but really starts degrading below (it probably depends on the image though). It's also more sensitive to the aspect ratio as you shrink the size, square sizing seems to hold up better for some reason (again, maybe depends on the images).

Beyond shrinking the image, one other (more involved) improvement is to try a more efficient runtime. I think pytorch is primarily designed for training models, but isn't necessarily the best for inference. As far as I know, onnx (for cpu) and tensorrt (for cuda) are both faster options, and there are links to versions of these for depth-anything on the main repo page.

damavand1 commented 5 months ago

jpg size is 512*358 72 dpi bit depth 24

its normal picture i get using my android phone and resize down it

torch.cuda.is_available() returns true

but i run my program in windows 11 and it shows me some problem with 'triton' does it can make run time slow ?

PS C:\a\depth\Depth-Anything> python run.py --encoder vits --img-path C:\Users\a\Desktop\Test --outdir depth_vis True A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "C:\Users\a\AppData\Local\Programs\Python\Python311\Lib\site-packages\xformers\__init__.py", line 55, in _is_triton_available from xformers.triton.softmax import softmax as triton_softmax # noqa ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sadegh\AppData\Local\Programs\Python\Python311\Lib\site-packages\xformers\triton\softmax.py", line 11, in <module> import triton ModuleNotFoundError: No module named 'triton' Total parameters: 24.79M 100%|██████████████████████████████████████████████████████| 144/144 [00:39<00:00, 3.67it/s

heyoeyo commented 5 months ago

Oh I see, sorry I was comparing the running speed to a video script I have, which times just the model execution itself. The run.py script includes a bunch of other steps (notably saving the output images) which have a big impact on the overall speed. If you comment out just the very last line of the run.py script (which is where the results are saved as files), that should give a little bit more accurate representation of the model run speed.

In another issue, #49, a user posted a script that times the model run speed more directly, which may be helpful if you're trying to get more accurate timing info.

For the sizing, you can change the image size that the model uses by adjusting the resize/transformation settings, which defaults to 518x518. Setting that to 196x196 gives me a factor of 2x speed up on GPU and about 10x on CPU.

As for the triton thing, apparently it's not available on Windows, but shouldn't be an issue...? I'm not really familiar with it, but if it is causing any issues, you can try uninstalling xformers which I assume would get rid of the error (although xformers normally helps speed up model execution, so it could actually hurt the run speed a bit).

damavand1 commented 5 months ago

As you say in #49 without any changes each images takes about 3 ms it means your 3090 GPU can done 333 image every seconds

but this is my result: 1- windows + last line commented: PS C:\a\depth\Depth-Anything> python run.py --encoder vits --img-path C:\Users\a\Desktop\Test --outdir depth_vis CUDA available: True GPU being used: NVIDIA GeForce RTX 3060 Laptop GPU

Total parameters: 24.79M 100%|██████████████████████████████████████████████████████| 144/144 [00:37<00:00, 3.81it/s]

2- windows + last line (write on disk) 3.67it/s

3- on Ubuntu 22.04+same laptop+cuda installed (last line commented) 3.03it/s

but if i change 518x518 to 196x196 now the result is: 17.40it/s

Another problem i found now, is original image size for example if my original image size be 512x384 and want to resize it to 196x196 it takes less time and very faster than if my original image be 4000x3000

it means this part of code is one of the must bottleneck of code :

transform = Compose([ Resize( width=196, height=196, resize_target=False, keep_aspect_ratio=True, ensure_multiple_of=14, resize_method='lower_bound', image_interpolation_method=cv2.INTER_CUBIC, ), NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), PrepareForNet(), ])

The important question is depth in 196x196 is enough for a Self driving car (if i create depth in metric_depth)?

heyoeyo commented 5 months ago

Yes it's strange how slow the run.py script can be, on my machine I also get between 1.5-10 it/s (depending on the images). Most of the time seems to be spent loading & saving the images.

That being said, the vit-small model itself is quiet fast. I've tested it with my old laptop (i7 8750H & GTX1060 on Win10) and it can do 50-60ms per frame (so 15-20 it/s or so) around the resolution you mentioned. The video script I use displays the results interactively, so you easily confirm it runs faster than the ~3 frames per second that the run.py script reports. It should be usable with the depth-anything python env if you want to avoid re-installing everything. If you do end up using it, just make sure to call it with -sync to get more accurate timing and you can use -b 196 to change the image sizing.

Lowering the resolution to 196x196 gives poorer results but it's a trade-off with speed. I just find 196x196 to be as low as the model can go before the output is completely useless, so it acts as an upper-bound on how fast the model can run practically.

LWQ2EDU commented 5 months ago

I ran Depth-Anything on a Jetson orin nano. Inferencing is significantly faster with tensorrt than with pytorch, and tensort is very easy to use on jetson because of the jetson jetpack SDK. You can try the C++/Python demo depth-anything-tensorrt on your Jetson nano.

damavand1 commented 5 months ago

so, we don't have any official version of Depth-Anything runs on TensorRT ?

LWQ2EDU commented 5 months ago

so, we don't have any official version of Depth-Anything runs on TensorRT ?

it's better to generate your own TensorRT engine, because of TensorRT version, quantization operation, CUDA version, device etc. See also DepthAnythingTensorrtDeploy

Jayzhang2333 commented 1 week ago

I am also trying to run depth anything on an orin nano with tensorrt. How much FPS can you reach? RIght now, I can reach almost 5FPS with the depth anything v2 small. I quantized the model into FP16. Anyone reaches higher FPS?