DepthAnything / Depth-Anything-V2

Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
https://depth-anything-v2.github.io
Apache License 2.0
3.29k stars 260 forks source link

operate on a live video #116

Open Kafka157 opened 1 month ago

Kafka157 commented 1 month ago

I'm really interested in your work! I only have a question, is it possible to run in real time?

heyoeyo commented 1 month ago

On a new-ish GPU, the small and base model are fast enough to run real-time (i.e. <30ms per frame). The large model might run fast enough on something like a 4090 GPU, otherwise it would require dropping the resolution to reach real-time speeds.

On CPU it's not really practical to run real-time. For example on my machine, I have to drop the resolution down to around 196x196 to get the small model running ~30ms per frame, but then the prediction quality is not great.

If you want to see how fast the models run, I have a script here that can run on video files or webcams. You can also adjust the resolution (by adding a flag when running the script, -b 196 for example) to see how that affects the processing speed.

Kafka157 commented 1 month ago

On a new-ish GPU, the small and base model are fast enough to run real-time (i.e. <30ms per frame). The large model might run fast enough on something like a 4090 GPU, otherwise it would require dropping the resolution to reach real-time speeds.

On CPU it's not really practical to run real-time. For example on my machine, I have to drop the resolution down to around 196x196 to get the small model running ~30ms per frame, but then the prediction quality is not great.

If you want to see how fast the models run, I have a script here that can run on video files or webcams. You can also adjust the resolution (by adding a flag when running the script, -b 196 for example) to see how that affects the processing speed.

GPU is not really a problem, I can borrow A100 from my instructor~ Thanks for your explanation, I'll soon try to operate on live video by using large model!

heyoeyo commented 1 month ago

borrow A100

Just as a heads up, the script I linked won't work on a server (it runs a local/non-web-based UI), but the run_video.py script that's included in this repo should work fine. You may need to adjust some of the default settings, most importantly switching to float16 and using the xformers library, since these provide a significant speed boost. There's some discussion of this back on a depth-anything v1 issue.

jamesbaker1 commented 2 days ago

Hi @heyoeyo

Thank you for your script! On a mini PC like Raspberry PI, how many FPS do you think you can get just through CPU?

heyoeyo commented 2 days ago

Hi @jamesbaker1

That's a good question! I've never tried running these sorts of models on a PI, but I would expect the PI to take several seconds per frame. My cpu (Ryzen 7600) takes around 340ms per frame for the vit-small model and around 2600ms per frame for the large model (at default resolution). It seems like the PI4 might be ~10x slower, so maybe 1 to 10 seconds per frame with vit-small and 30+ seconds per frame for vit-large (but these are just guesses).

That being said, these are Pytorch timings which isn't all that great at CPU execution (it doesn't support float16 for example). Other runtimes, like onnx or openvino (if that even works on ARM...?) can use smaller datatypes and generally seem to make better use of the CPU, so there's probably potential for at least 2x (or more) speed up on CPU.

In any case, I wouldn't expect the PI to manage real-time speeds. Maybe other GPU-centric mini PCs (like the jetson/orin boards from nvidia) might be capable of it though.