DepthAnything / Depth-Anything-V2

Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
https://depth-anything-v2.github.io
Apache License 2.0
3.13k stars 239 forks source link

Temporal consistency #85

Open vitacon opened 1 month ago

vitacon commented 1 month ago

I used your model on several videos (to be more precise on several sequences of images) and the quality and the resolution is very impressive. The temporal consistency is usually also good but not in all videos. Sometimes the overall brightness suddenly changes and the video of depth maps flickers.

It seems there is no option to fix these kind of inaccuracies yet. (Or is it?) I suppose a second pass that would compare and adjust histograms of neighboring depth maps could help but I did not try to write it yet. Do you plan to add something like this to improve the temporal consistency?

https://github.com/user-attachments/assets/d89adda8-b907-47c6-af92-3b0807005583

LiheYoung commented 1 month ago

Thank you for sharing your test results. We are indeed working on improving the temporal consistency. Please stay tuned.

visonpon commented 1 month ago

for temporal consistency, it seems you guys can get some ideas from SAM2

elvistheyo commented 4 weeks ago

@vitacon did you get a solution to this problem?

vitacon commented 4 weeks ago

@elvistheyo In my app that uses the depth maps, I made exactly the thing I mentioned in my previous post. (Well, I copied most of the code from my older project. =)

I.e. I work with just two neighboring frames so I generate a histogram of each of them, find 16 values that divide the histogram area (integral) evenly, count 16 average values of the previous and following frame and finally adjust the values of the current frame accordingly to get the "same" histogram.

It's definitely not an ideal solution but it helps a bit. I believe a more reliable way would be comparing the source images too, finding the unchanged areas, make sure the depth maps in these areas are the same too and use the values as "control points" to adjust the rest of the depth map.

elvistheyo commented 3 weeks ago

@vitacon thanks for the explanation. if you don't mind, could you kindly explain your second solution in more detail if possible? i did not quite understand the idea of it.

vitacon commented 3 weeks ago

@elvistheyo The neighboring frames are usually very similar so you can align them (they could be slightly offset because of camera movement) and compare them pixel by pixel (or square (8x8 pix) by square). If the difference is small enough we can suppose the pixel (or square) captures (usually) the same object in the same distance. It means the depth values of these pixels should be the same too. It they are not we should change the depth values somehow - replace them with an average (probably [previous + current + following]/3 or even [previous + following]/2 ) or something similar that would reduce the sudden changes.

The depths of areas that had changed must be adjusted by some other method (maybe the previous one using histogram). Probably we can "calibrate" the range a bit because we already know how much we changed (or did not change) the depth maps of the "unchanged" areas.