Tencent / DepthCrafter

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
https://depthcrafter.github.io
Other
970 stars 48 forks source link

varying depth values #30

Open arlo-ml opened 1 month ago

arlo-ml commented 1 month ago

Hi, Thanks for your amazing work

I've been testing your code with long videos (from 300 to 800 frames). I often get varying values of the background over the time. For example, with this video (383 frames), I get different values of the background, using these parameters, from frame 231 to frame 315: Output frame rate: 24 Inference steps: 25 Guidance scale: 1.2 Dataset: kitti

Is it expected with longer videos?

01 02

arlo-ml commented 1 month ago

And these are other videos. As you can see, the values of the background change over time:

01 02

01 02

juntaosun commented 1 month ago

The project mainly relies on stable video diffusion. Each time you enter the inference prediction it changes, it is random, I don't think it is truly time consistent.

acgourley commented 1 month ago

My understanding of the paper is that they are using a sliding context window of around 1.5s for inference and so it makes sense it would shift over periods longer than a couple seconds. I doubt there is a simple fix but I'd love to hear it if people have ideas.

wbhu commented 1 month ago

Hi, thank you for your feedback. Due to memory restriction, the max processing length for one time is 110. Videos longer than 110 are processed in overlapped segments.

Temporal consistency within the same segment is very good, I think. As for temporal consistency among segments, our designed inference strategy (including noise initialization and latent interpolation) works for most cases, but it's hard to always guarantee the consistency among segments due to temporal context.

Best, Wenbo

arlo-ml commented 1 month ago

hi Wenbo, thank you for the explanation I've already tested around 40 videos (all the sequences are black-and-white films), and I can confirm that your designed inference strategy works for most cases, even with videos longer than 110 frames. I was wondering if using different values for the noise initialization and latent interpolation may help adapting to different scenarios. I've tried to do a brief search, but I could not find any terminal commands that would help me solving those isolated cases. Would it require to modify your original code?

wbhu commented 1 month ago

Hi, the noise initialization for overlapped segments has been included in the code. For the failure case, you may try to set a different random seed (which is default set to 42) by adding the argument "--seed xxx". I'm not sure if this will help or not ...

What will influence for sure is where to segment the video, you may tune this for the failure case

arlo-ml commented 1 month ago

Thank you, I'll try to make more tests, following your suggestions

STUDYHARD2113 commented 1 month ago

Hi, thank you for your feedback. Due to memory restriction, the max processing length for one time is 110. Videos longer than 110 are processed in overlapped segments.

Temporal consistency within the same segment is very good, I think. As for temporal consistency among segments, our designed inference strategy (including noise initialization and latent interpolation) works for most cases, but it's hard to always guarantee the consistency among segments due to temporal context.

Best, Wenbo

hi Wenbo, I found the code to norm the whole segment depth, if I need to split a very long sequence (>150 frames) into different parts of the infer, if I want to keep the different segments consistent, do I need to remove this part? Because although different parts may have scene overlap, but from the depth gt to consider, then surely the depth range of different segment is not the same?

# normalize the depth map to [0, 1] across the whole video res = (res - res.min()) / (res.max() - res.min())