DepthAnything / Depth-Anything-V2

[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
https://depth-anything-v2.github.io
Apache License 2.0
3.58k stars 301 forks source link

Custom depth map output format #70

Closed ThreeDeeJay closed 2 months ago

ThreeDeeJay commented 3 months ago

It'd be nice to be able to use Depth Anything V2 in DepthViewer (2D to 3D video converter/player for VR/3D displays), but it only supports V1 using this. I'm particularly interested in 2D to 3D video/movie conversion, and considering it might not be possible get stable, realtime (like 24FPS) conversion using the highest quality models, the best alternative is to pre-cache the depth map for every frame beforehand. Luckily, DepthViewer already has a cache format for videos, which is basically just a ZIP file containing every raw frame in PGM/PFM format and some metadata. demo_cache So could you please add an option to run_video.py to output every depth frame into numbered PGM/PFM files, perhaps in a (probably solid) ZIP file? Then we could just manually add a metadata file from a template to use it in DepthViewer 👀👌

If you need a sample, here's Wildlife and the .depthviewer file generated by DepthViewer's built-in model.

heyoeyo commented 3 months ago

I'm not sure how likely it would be for such a specific/custom change top be added to the codebase, but it's relatively simple to make the modifications on your own local copy.

First, the current video runner doesn't include a frame index, so you'd have to add that in. You'd also want to create a folder to store the pgm files for each separate video (in case you process multiple videos, you don't want them overwriting each other). This can be done by adding some extra code to create the save folder path and start a frame counter just above the existing while loop, something like:

video_name = os.path.splitext(os.path.basename(filename))[0]
pgm_folder_path = os.path.join(args.outdir, video_name)
os.makedirs(pgm_folder_path, exist_ok=True)
frame_idx = 0

And then the only other change would be to save pgm files and update the frame counter. You can do this by adding some new lines just after the normalized depth image is created:

pgm_save_path = os.path.join(pgm_folder_path, f"{frame_idx:06}.pgm")
cv2.imwrite(pgm_save_path, depth)
frame_idx += 1
ThreeDeeJay commented 2 months ago

Thanks, I'll give it a shot when I get a chance I added the changes to my fork https://github.com/ThreeDeeJay/Depth-Anything-V2/commit/271db9f0f18bc625ab6bf5eed7921c51cba4a9d1

ThreeDeeJay commented 2 months ago

Follow-up: it works! Just had to Remove leading zeroes from PGM filename Then I ran python run_video.py --encoder vitl --video-path "path\to\Wildlife.wmv" --outdir Wildlife --pred-only --grayscale and zipped PGM files in "path\to\Wildlife\Wildlife" along with METADATA.txt into "%USERPROFILE%\AppData\LocalLow\parkchamchi\DepthViewer\depths\Wildlife.wmv.MidasV21Small.05aefc9502534cac696a25e5020ae40abbc2a2bc918a9902a60673849a0857b4.depthviewer" file and voilà, Run DepthViewer and open Wildlife.wmv (or whatever file you're using)👌 Note: it says it's using MiDAS model but I'm just fooling DepthViewer into loading a pre-generated depth map from an unsupported model using a supported (built-in) model name so that there's no need for extra configuration. @parkchamchi might wanna check this out 👀

Cross-eyed 3D screenshots of (realtime) playback in action: DepthViewer002_085 DepthViewer001_085 DepthViewer004_085 DepthViewer003_085 DepthViewer006_085 DepthViewer005_085

Now I just gotta figure out how to improve performance because right now it's like 5FPS tops and CPU/GPU usage is pretty low 🤔 also I think wrapping this into a script that zips and copies the file to the DepthViewer cache folder should be trivial.

heyoeyo commented 2 months ago

figure out how to improve performance

I'm not sure if you mean the initial video recording or the 3D display, but if it's the video recording, I recently added a frame-by-frame saving feature to my own video script (it just needs to be run with the --allow_recording flag), which I think tends to run faster (~2x) than the original depth-anything code (due to using float16 instead of 32). So maybe that can help if that's the slow part (my version also has the zero-padding that would need to be removed and saves in .png instead of .pgm, but that's easy to modify).

script that zips

Yes this should be doable using the builtin zipfile library. It seems like you'd also need to fill out a bunch of fields to generate the metadata file, it looks like you already have a bunch of them (e.g. width/height/framerate) in your script, you can also get a timestamp and frame count using:

import time
timestamp = int(time.time())
framecount = int(raw_video.get(cv2.CAP_PROP_FRAME_COUNT))
ThreeDeeJay commented 2 months ago

Interesting, I'll check it out. 👍 And yeah, I meant performance when generating the depth image/video. Ideally, it'd use as much of the CPU/GPU hardware for faster conversion (maybe even realtime) DepthViewer gets a significant performance boost when using CUDA but even it seems to be bottlenecked somewhere so it doesn't achieve full horsepower usage. 😔

By the way, do you think it would be feasible to write a tool that generates the left and right view on-the-fly to stream it in a local "server" so that other players can open it via URL? I like DepthViewer, but I'd love to be able to use motion interpolation like I do in PotPlayer with SVP to watch Blu-ray 3D movies to get 72FPS per eye on my 144hz active 3D monitor.

Also, any idea if there's a better format to store depth map videos? Come to think of it, it seems really inefficient to store uncompressed (other than image-agnostic Zip compression) raw frames, so I wonder if there's a video format that uses temporal frame differences/delta to save space (especially if we can eventually distribute depth maps online to reduce the need to cook our hardware to generate the same thing), while still decoding to lossless frames because we obviously don't want compression artifacts messing up the depth map.

heyoeyo commented 2 months ago

any idea if there's a better format to store depth map videos? ... really inefficient to store uncompressed ...

Yes the .pgm format is very inefficient (and may contribute to the slow playback if disk access is the bottleneck). Using .png is better but at the cost of needing to decompress it (i.e. more cpu work). Apparently ffv1 is a codec that can do lossless video, I just tried this command: ffmpeg -framerate 30 -i %08d.png -c:v ffv1 output.mkv Using this on the pngs from my video script does create a video that's a bit smaller than the pngs themselves (4.6MB for the video vs 6.7MB for 180 pngs separately vs. 25.3MB for pgm), so that might be worth trying for smaller storage + speeding up disk access (should be quite significant actually).

a tool that generates the left and right view on-the-fly to stream it in a local "server" so that other players can open it via URL

In theory yes... although I'm not sure about left/right views. I don't know if that's handled as two separate streams or some combined stream following constraints on the layout (I'm not familiar with VR standards at all). I've used this mediamtx server for creating 'live' video streams from files before, and I assume it would be possible to dynamically generate the frames on the fly to feed that stream. But I don't much more than that as far as how the VR formatting would be/could be handled .

ThreeDeeJay commented 2 months ago

I think besides lossless encoding, it also needs higher bit depth for an accurate depth map that's not constrained to the regular RGB range. Like, the Zoe model required a PFM because apparently not even PGM preserved the full depth information.

But PFM is still an image format. I think ideally we'd want a lossless video codec that allows high enough bit depth and uses correlation between consecutive frames in case there's redundant data, to compress efficiently.

AVIF and HEIC both support depth maps and can be used to compress video (HEVC for the latter, I presume), so I wonder if we could get them working with DepthAnything 🤔 JXL

although I'm not sure about left/right views. I don't know if that's handled as two separate streams or some combined stream following constraints on the layout (I'm not familiar with VR standards at all).

The standard-ish 3D format is just SBS (side by side, left eye first) so if we could somehow create a live stream containing both views (displacing/warping source video pixels using the depth map) in a single frame, then PotPlayer (or any other 3D/VR player) could split the image before showing them to each eye separately.

But the main challenge would be using a method that generates the left and views without losing too much quality or causing distortion/excessive haloing created by stretched pixels. Probably something like SuperDepth3D, but to process the video stream before serving it for external players.

EDIT: stable-diffusion-webui-depthmap-script is able to generate SBS and added DepthAnythingV2 support. So I'm probably gonna bug them to see if we can set up an SBS stream server 🤞

heyoeyo commented 2 months ago

lossless video codec that allows high enough bit depth and uses correlation between consecutive frames in case there's redundant data, to compress efficiently

Considering the overall customness of this sort of data, a simple-ish solution might be to just encode the higher bits into the green and blue channels of the frame, and then use that ffv1 codec (or any other lossless codec) on the resulting frame. That would allow for up to 24-bit depth per frame (8 per RGB channel), which should be more than enough (actually just using the red/green channels for 16-bits seems like it's probably good enough). Then it would just be a matter of properly un-packing the channel data on the other end.

ThreeDeeJay commented 2 months ago

That sounds like a good idea. I wonder if we could take it a step further and reuse the source/2D video's compression/edges/motion vectors for the depth map. So like a sort of 2D plus depth and Delta (stored separately/on a different track like the MVC/2nd view of Blu-ray 3Ds, which allows playing the main/AVC view in 2D for backwards compatibility)

That way maybe someone could eventually make a 3D player that auto-downloads depth maps generated by other people, using the video track hash, and it'd hopefully make it more legal to distribute movie depth maps to begin with, since you'd need to own the video file to decode the depth map. That'd certainly be way better than sharing the whole movie in SBS online, not to mention having a separate depth map could allow 3D strength customization and keep the door open for updated features like better haloing/gap fill algorithms or something.

heyoeyo commented 2 months ago

I completely forgot 3D blu-rays were a thing! It makes sense that there are already some standards for merging 2D/depth data together into video formats. Though maybe if built-in NPUs become commonplace, real-time conversion of 2D video to 3D depth maps will eventually just be a standard part of every video player, that'd be pretty cool.

ThreeDeeJay commented 2 months ago

yeah, Blu-ray 3D (MVC) is reasonably efficient for full HD 3D. Here's the main/AVC track's filesize compared to the second view (MVC): image Decoding on PC can be tedious though (only a handful players support it) so I even had to put together a script here https://github.com/ThreeDeeJay/PotPlayer3D/releases/latest