Open telamon opened 1 month ago
Hi, thanks for your interesting insight!
Can you provide the input image as well as the crop code?
I'll respond to this issue within one week after the ICLR submission ddl.
The image above is an asset, but I think any high resolution image should do.
(like we use images of 1024x1280 to represent x,y dimensions, but the depth z-axis is limited to 256...)
There's currently no crop-code, I used Gimp and Blender to quickly test the idea before hacking together something like an "InpaintDepth" comfyui-custom-node - I'm still unsure if it's possible to blend the two depth maps together with decent results. That's why I'm asking for feedback.
No stress and good luck at the conference.
(1) For the first query: The initial depth map tends to be flat due to the large invalid background area. A better approach might be to crop the image while maintaining a balanced ratio between the foreground and background. Your proposed two-stage depth fusion is intriguing, but it might encounter challenges in selecting the optimal cropping region, as both depth1 and depth2 could still struggle with incorrect scaling. I have tried several cases, the two-stage fusion is quite unstable, and my proposal is more convenient.
(2) For the second query: Yes, that makes sense! Storing higher precision depth maps could offer significant advantages. It's possible that researchers often opt for 8-bit precision because it's sufficient for most applications, while 16-bit or 32-bit data might introduce additional complexity, making the training process more difficult or less stable.
Thank you for the reply. (1) Yes. I realized that depth fusion is quite complex.... I tried to gain insight into the "flatness" using the following test:
Sample | Background Included? | Eye Inner | Chin | Neck |
---|---|---|---|---|
A) Full | yes | 0.049 | 0.043 | 0.139 |
B) Head | upper corners | 0.092 | 0.058 | 0.192 |
C) Face | none | 0.278 | 0.190 | 0.505 |
(measurements taken by positioning a parallel plane at the near-most vertex (tip of the nose))
It is as you say. the model is very good at separating background from objects in foreground.
But I failed to find balance - I expected greater separation of detail between sample A and B, but the flatness is somewhat similar.
It wasn't until I cropped away all background pixels (C) that attention was diverted to detail and facial structure emerged.
end note Seeing the difficulty of this problem, I am intrigued but also quite demotivated.
When I opened this issue I assumed that enriching detail iteratively would be as simple as inpainting depth the same way we inpaint color.
Problem 1.a) Selecting crop region. As an artist I would manually select a crop region where inferred depth does not express the detail. As an algorithm; I'm not sure, searching for flatness could maybe be done looking for regions with low local variance in depth and comparing to an edge detection in color/detail measurement.
Problem 2.a) Depth fusion.
The above test showed me that even if i know the depth1
bounds of a cropped area then scaling depth2
into that box would just reproduce the same flatness.
If i attempted to fix it manually with the blender's sculpting tools, I'd have push some of the surrounding vertices backwards and then pull some detail forward, meaning that depth1
would have to be partially invalidated.
Really starting to regret I asked.. haha
TL;DR having compared quite a few depth maps, The Flatness does not occur on every input, some images render with great detail and captivating stereo others not so much. Flat compositions will be flat, deep compositions get deeper.
(2) Sorry I barked up the wrong tree - and thank you. (i found my 8bit problem) The encoding scheme I proposed is not as visually appealing as the grayscale when viewed by itself, but it's compatible with software that expects 8bit-maps and I can confirm that GeoWizard infers higher resolution than what can be represented by 8bits. However I saw no difference between RGB24 and and the obscure PNG GrayScale16bit mode in blender, but other software might fail to decode GS16.
Leaving breadcrumbs to my encoding tests and visualizations.
Hi, thank you for your insightful input! We’ve also observed that the cropped region can impact flatness. A straightforward way to mitigate this is by setting the scale to 1.4 and the shift to 0.4 (these values can be adjusted based on visualization needs), and then converting the relative depth back to metric depth. We've found that an appropriate weight group of scale and shift works well for most human face cases. Best regards!
Thank you for the lead, but do you mind reopening the issue and maybe change title to "Expose scale & shift as options" ? I will take a look at this myself but it'll take some time because I just switched assignments. Regards!
Sure
edit 2: In order to focus the model to onto different details. As an inference person I would like to expose two additional variables to tweak:
original text:
Hello, first of all I'd like to say thank you for this model, It's really fun to use and produces convincing results.
But I have an issue, it seems to me that a lot of detail is being lost/misplaced due to the low resolution of the depth map output (8bits)
So I tried running inference on an image twice.
First pass: the original image
Second pass: a cropped out region with high detail (face) and then stitched both of the depth maps together.
The results were surprisingly refreshing:
left) Single-pass. Most of the foreground detail had been pushed into the upper boundary, and the character appears quite flat. right) Second pass reveals new geometry, smoother surfaces but also some more noise.
So my first question is, if geowizard/inference is used recursively to refine and inpaint regions with higher detail - what would be the most correct way to merge the two depth maps? my naive approach would be:
Second inquiry; most current depth maps are grayscale "PNG"-files. wouldn't it make sense to break the norm and start using the green and blue channels to store higher resolution depth information? I guess it's possible to keep the red component as is, but i don't understand why nobody is storing additional fractions in the other two bytes.
~~Lastly I know little about training, but I wonder if the model was trained to produce 8bits output, or if by any chance there's an 8bit bottleneck somewhere in the training pipeline that would prevent the produced depth maps from being as smooth as it's sibling normals/maps.~~ output is 32bit
Thanks for hearing me out!
edit: fixed late night code snippets :brain: :gun: