DepthAnything / Depth-Anything-V2

[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
https://depth-anything-v2.github.io
Apache License 2.0
3.86k stars 336 forks source link

question:is there anyway use the output as mask to segment or remove component in source image #25

Open wanghaisheng opened 4 months ago

wanghaisheng commented 4 months ago

is so, can you show some link or example ?

XiaoLuoLYG commented 4 months ago

+1 deperately needed

xuan-cloud commented 4 months ago

+1

heyoeyo commented 4 months ago

The most user-friendly way to do this is probably to export the depth prediction as a grayscale image and then load that into a photo-editor and threshold the parts of the mask you want. Then you can use the thresholded mask as an alpha channel with your original image to get segmentations. Alternatively, there are existing models (like Segment-Anything) that can do user-guided segmentation as well.

However, if you want to do it in code with Depth-Anything, you can add the following lines to the run.py script, just after the uint8 depth image is created:

# Use depth prediction to mask out parts of the input color image
# low, high = 0.75, 1.0 # Keep 'close' parts of the image
low, high = 0.0, 0.25 # Keep 'far' parts of the image
low_mask, high_mask = depth >= int(low*255), depth <= int(high*255)
out_bgra = cv2.cvtColor(raw_image, cv2.COLOR_BGR2BGRA)
out_bgra[:,:,-1] = (np.bitwise_and(low_mask, high_mask) * 255).astype(np.uint8)
cv2.imwrite(os.path.join(args.outdir, os.path.splitext(os.path.basename(filename))[0] + "_masked.png"), out_bgra)

The low and high values control which part of the depth map is used for segmenting.

xuan-cloud commented 4 months ago

The most user-friendly way to do this is probably to export the depth prediction as a grayscale image and then load that into a photo-editor and threshold the parts of the mask you want. Then you can use the thresholded mask as an alpha channel with your original image to get segmentations. Alternatively, there are existing models (like Segment-Anything) that can do user-guided segmentation as well.

However, if you want to do it in code with Depth-Anything, you can add the following lines to the run.py script, just after the uint8 depth image is created:

# Use depth prediction to mask out parts of the input color image
# low, high = 0.75, 1.0 # Keep 'close' parts of the image
low, high = 0.0, 0.25 # Keep 'far' parts of the image
low_mask, high_mask = depth >= int(low*255), depth <= int(high*255)
out_bgra = cv2.cvtColor(raw_image, cv2.COLOR_BGR2BGRA)
out_bgra[:,:,-1] = (np.bitwise_and(low_mask, high_mask) * 255).astype(np.uint8)
cv2.imwrite(os.path.join(args.outdir, os.path.splitext(os.path.basename(filename))[0] + "_masked.png"), out_bgra)

The low and high values control which part of the depth map is used for segmenting.

This method looks great. However, values need to be manually specified. What methods can I use to automatically determine thresholds in images?

heyoeyo commented 4 months ago

Determining the thresholds automatically will probably depend a lot on what sort of images you're working with and what part of the image you want to segment. A very simple automated approach would be to take the median depth value and then take everything either above that or below that, depending on whether you want near or far parts of the scene. Something like:

# Take all parts of image based on percentile of depth data
take_far_parts = False
threshold_point = int(np.percentile(depth, 50))
mask = depth < threshold_point if take_far_parts else depth >= threshold_point
out_bgra = cv2.cvtColor(raw_image, cv2.COLOR_BGR2BGRA)
out_bgra[:,:,-1] = (255 * mask).astype(np.uint8)
cv2.imwrite(os.path.join(args.outdir, os.path.splitext(os.path.basename(filename))[0] + "_masked.png"), out_bgra)

This example uses the 50th percentile (i.e. the median) but it might make sense to change depending on how the depth data tends to be distributed for the images you're working with, and what you're trying to segment.

More generally you can use a photo editor to check the histogram of the depth image vs. the part of the image you want. Here's an example using the online pixlr editor (there are others though, like photopea) to show the histogram of the depth values. depth_histogram_example If you did this for a bunch of the images you want, you might find that there's some pattern to the part you want to segment vs. the distribution of depth values, and you could either hard-code those thresholds or use some automated approach to pick off a re-occurring pattern in the histogram. But it all really depends on what you're trying to segment and the distribution of the depth image.

xuan-cloud commented 4 months ago

Determining the thresholds automatically will probably depend a lot on what sort of images you're working with and what part of the image you want to segment. A very simple automated approach would be to take the median depth value and then take everything either above that or below that, depending on whether you want near or far parts of the scene. Something like:

# Take all parts of image based on percentile of depth data
take_far_parts = False
threshold_point = int(np.percentile(depth, 50))
mask = depth < median_depth if take_far_parts else depth >= median_depth
out_bgra = cv2.cvtColor(raw_image, cv2.COLOR_BGR2BGRA)
out_bgra[:,:,-1] = (255 * mask).astype(np.uint8)
cv2.imwrite(os.path.join(args.outdir, os.path.splitext(os.path.basename(filename))[0] + "_masked.png"), out_bgra)

This example uses the 50th percentile (i.e. the median) but it might make sense to change depending on how the depth data tends to be distributed for the images you're working with, and what you're trying to segment.

More generally you can use a photo editor to check the histogram of the depth image vs. the part of the image you want. Here's an example using the online pixlr editor (there are others though, like photopea) to show the histogram of the depth values. depth_histogram_example If you did this for a bunch of the images you want, you might find that there's some pattern to the part you want to segment vs. the distribution of depth values, and you could either hard-code those thresholds or use some automated approach to pick off a re-occurring pattern in the histogram. But it all really depends on what you're trying to segment and the distribution of the depth image.

Thank you again for your response. Yes, when processing a batch of images, the threshold distribution varies for each image, so I may not be able to uniformly apply a fixed threshold. Howeve, for example, I always hope to obtain the second-ranked segmented object in terms of depth values in the image, but the depth values of the second-ranked object vary across different images. How can this be addressed?

heyoeyo commented 4 months ago

I always hope to obtain the second-ranked segmented object in terms of depth values in the image

There's probably a number of ways to approach this, but one straightforward way that might work is to calculate the histogram of the depth map and search for the second peak. You can get the histogram using:

hist_counts, hist_bins = np.histogram(depth, np.linspace(0, 255, 256))

As is, the histogram is likely to be very noisy which would make it hard to find the 'second peak'. You can smooth it out to make it easier, with something like:

# Smooth out histogram counts with a gaussian 'blur'
gsize, gsigma = 15, 5
gx = np.linspace(-gsize, gsize, 1 + 2 * gsize)
gaussian = np.exp(-(gx**2) / (2 * gsigma**2))
smooth_hist = np.convolve(hist_counts, gaussian, "same") / sum(gaussian)

# Plot original & smoothed counts for comparison
matplotlib.pyplot.plot(hist_counts)
matplotlib.pyplot.plot(smooth_hist, linewidth=4)

This will generate a plot that compares the original vs. smoothed histogram. For the image I posted before, I get the following: example_histograms

You may have to adjust the smoothing parameters (gsize & gsigma) depending on how noisy the depth histograms tend to be, for this example the settings seem to work alright. Then from here you can 'search' for whatever peak you want (for example just by iterating through the smoothed histogram counts and looking for cases where the values change from increasing to decreasing) and then threshold the depth based on a range around that peak (or maybe even search for the valleys surrounding the peak and use those as the masking thresholds, if that makes sense for your depth distributions). If there still ends up being too much noise to cleanly find peaks, you can always repeatedly apply the same smoothing trick to further simplify the histogram.

xuan-cloud commented 4 months ago

I always hope to obtain the second-ranked segmented object in terms of depth values in the image

There's probably a number of ways to approach this, but one straightforward way that might work is to calculate the histogram of the depth map and search for the second peak. You can get the histogram using:

hist_counts, hist_bins = np.histogram(depth, np.linspace(0, 255, 256))

As is, the histogram is likely to be very noisy which would make it hard to find the 'second peak'. You can smooth it out to make it easier, with something like:

# Smooth out histogram counts with a gaussian 'blur'
gsize, gsigma = 15, 5
gx = np.linspace(-gsize, gsize, 1 + 2 * gsize)
gaussian = np.exp(-(gx**2) / (2 * gsigma**2))
smooth_hist = np.convolve(hist_counts, gaussian, "same") / sum(gaussian)

# Plot original & smoothed counts for comparison
matplotlib.pyplot.plot(hist_counts)
matplotlib.pyplot.plot(smooth_hist, linewidth=4)

This will generate a plot that compares the original vs. smoothed histogram. For the image I posted before, I get the following: example_histograms

You may have to adjust the smoothing parameters (gsize & gsigma) depending on how noisy the depth histograms tend to be, for this example the settings seem to work alright. Then from here you can 'search' for whatever peak you want (for example just by iterating through the smoothed histogram counts and looking for cases where the values change from increasing to decreasing) and then threshold the depth based on a range around that peak (or maybe even search for the valleys surrounding the peak and use those as the masking thresholds, if that makes sense for your depth distributions). If there still ends up being too much noise to cleanly find peaks, you can always repeatedly apply the same smoothing trick to further simplify the histogram.

Thank you very much for your response. Smoothing indeed is effective! However, I sometimes find that within the generated depth maps, the depth distribution within the same object can be uneven, resulting in minimal differences in depth between objects. This significantly impacts the segmentation performance. For instance, in your example image, the turtle as a whole exhibits uneven depth values, making it challenging to distinguish from the background foliage when using depth information for segmentation. This issue becomes more pronounced when two objects are close together. Do you have any suggestions for addressing this? Thank you very much!

heyoeyo commented 4 months ago

Many images have a ground-plane (like the turtle image) that makes segmenting especially difficult. It's possible to partially remove this using a plane-of-best-fit, which can help to isolate objects if you have images like this. The code for that is a bit more involved, but I have a script here that can do it, using the function estimate_plane_of_best_fit, you can do something like:

# Remove plane-of-best-fit from depth data
plane_fit = estimate_plane_of_best_fit(depth)
depth_no_plane = depth - plane_fit

Removing the ground plane can help to isolate subjects more clearly and maybe help with segmentation. For example, here's how it looks with the turtle image (left is original, right is with plane-of-best-fit removed): normal_vs_planeremoval

However, like you said, some images are arranged in such a way that things aren't neatly separated in depth (like the foliage vs. turtle here). In cases where objects aren't arranged at distinct depths, I think depth-based segmentation isn't going to work, at least not on it's own (it may still be useful as part of a larger process). For example, the turtle is much easier to segment from the original color image than from the depth map using something like Segment-Anything:

segment_anim

If you need something more automated, I would consider something like Grounded-SAM, which can segment based on text prompts, assuming the things you're segmenting can be targeted using text. If you have a use case where the depth information is specifically important for the segmentation, then it may be possible to combine the depth-based segmentation mask with the mask from something like grounded-SAM (for example, using grounded-SAM to get multiple object segmentations and then using those to get the depth map of each object and excluding objects whose mean/min/max/etc. depth is outside some threshold).

xuan-cloud commented 3 months ago

Grounded-SAM

Thank you so much for your code! It's incredibly practical! The current issue remains the inconsistency in depth values within the same object, such as the significant difference between the turtle's shell and body in the image. This easily leads to incorrectly segmenting different regions of the same object during segmentation. It continues to impact segmentation performance, and segmentation is just one step in my project. Introducing more additional models might increase complexity and time overhead. Therefore, I'm still looking for simpler and more efficient methods at the moment.

xuan-cloud commented 3 months ago

Many images have a ground-plane (like the turtle image) that makes segmenting especially difficult. It's possible to partially remove this using a plane-of-best-fit, which can help to isolate objects if you have images like this. The code for that is a bit more involved, but I have a script here that can do it, using the function estimate_plane_of_best_fit, you can do something like:

# Remove plane-of-best-fit from depth data
plane_fit = estimate_plane_of_best_fit(depth)
depth_no_plane = depth - plane_fit

Removing the ground plane can help to isolate subjects more clearly and maybe help with segmentation. For example, here's how it looks with the turtle image (left is original, right is with plane-of-best-fit removed): normal_vs_planeremoval

However, like you said, some images are arranged in such a way that things aren't neatly separated in depth (like the foliage vs. turtle here). In cases where objects aren't arranged at distinct depths, I think depth-based segmentation isn't going to work, at least not on it's own (it may still be useful as part of a larger process). For example, the turtle is much easier to segment from the original color image than from the depth map using something like Segment-Anything:

segment_anim segment_anim

If you need something more automated, I would consider something like Grounded-SAM, which can segment based on text prompts, assuming the things you're segmenting can be targeted using text. If you have a use case where the depth information is specifically important for the segmentation, then it may be possible to combine the depth-based segmentation mask with the mask from something like grounded-SAM (for example, using grounded-SAM to get multiple object segmentations and then using those to get the depth map of each object and excluding objects whose mean/min/max/etc. depth is outside some threshold).

Furthermore, when applying your code for removing the background, I found that it doesn't work on some images. After applying it, my pixel values undergo significant changes. The left image in the figure below shows the histogram before background removal, and the right image shows it after removal. The part around 30 in the image represents some noise generated by the background, while the part around 200 represents the object. Ideally, after removing the background, only the peak around 200 should remain. However, the result is not what I expected. Do you know the reason for this? before after

heyoeyo commented 3 months ago

However, the result is not what I expected. Do you know the reason for this?

Sorry I should have clarified, for the plane removal, you'll need to convert the depth data into a floating point format (e.g. np.float32). The depth data from the depth-anything result (and used in the histogram) will likely be in uint8 format, which can only take on values between 0 and 255. When the subtraction with the plane occurs, any pixel value below the plane value will 'wrap around' due to the uint8 format (e.g. 50 - 100 = 206), and that can cause the values to re-distribute in a strange way.

The other problem is if there is no obvious plane (or it doesn't make up a significant portion of the image), in which case the plane-fit will be influenced by other objects and the removal will give a strange result.

That all being said, the histogram will (generally) be heavily distorted by the plane removal. It should usually push pixels towards 'black' which will give a histogram that is more heavily distributed towards 0. It'll also move around the values of the object you're trying to segment, depending on where the object is in the image (since all the pixels of the object also have the plane values subtracted from them, which will shift them around in the histogram). Though the idea is that for some images it can push all the background/floor close to zero, and hopefully leave the object at higher values to more easily segment/threshold out. However this is very image dependent, for example it doesn't work great with the turtle image, due to the foliage.

The current issue remains the inconsistency in depth values within the same object...

There may be ways of mitigating this, though it's always going to be very image-dependent. One simple-ish thing to try is mixing together similar regions before any processing. This can be done using morphological filtering applied to grayscale data (like the depth image). Assuming you have the uint8 version of the depth data, you can do something like:

# Apply morphological filtering to reduce depth variations in nearby areas
effect_strength = 51
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, [effect_strength]*2)
depth_morph = cv2.morphologyEx(depth_uint8, cv2.MORPH_OPEN, kernel)

That may help 'flatten' the object you're trying to segment, so that the histogram-based techniques can better extract it. Morphological filtering can also be used on the (binary) segmentation mask itself to expand the segmented region (in the code above this is done by changing cv2.MORPH_OPEN to cv2.MORPH_DILATE or possibly cv2.MORPH_CLOSE in some cases), so that might also help to fix a segmentation mask that's incomplete.

xuan-cloud commented 3 months ago

但是结果却不是我所期望的。你知道原因是什么吗?

抱歉,我应该澄清一下,为了移除平面,您需要将深度数据转换为浮点格式(例如 np.float32)。来自深度任何结果的深度数据(用于直方图)可能为 uint8 格式,该格式只能取 0 到 255 之间的值。当使用平面进行减法时,由于 uint8 格式(例如 50 - 100 = 206),平面值以下的任何像素值都会“回绕”,这可能导致值以奇怪的方式重新分布。

另一个问题是如果没有明显的平面(或者它没有构成图像的很大一部分),在这种情况下平面拟合将受到其他物体的影响,并且移除将产生奇怪的结果。

话虽如此,直方图通常会因平面移除而严重扭曲。它通常会将像素推向“黑色”,这将使直方图更倾向于 0。它还会移动您要分割的对象的值,具体取决于对象在图像中的位置(因为对象的所有像素也都减去了平面值,这会在直方图中移动它们)。虽然这个想法是,对于某些图像,它可以将所有背景/地板推向接近零,并希望将对象保持在更高的值以更轻松地分割/阈值化。然而,这非常依赖于图像,例如,由于树叶的原因,它对乌龟图像的效果并不好。

当前的问题仍然是同一对象内的深度值不一致......

可能有办法缓解这种情况,尽管它总是非常依赖于图像。一个比较简单的方法是在进行任何处理之前将相似的区域混合在一起。这可以通过对灰度数据(如深度图像)应用形态过滤来实现。假设您有 uint8 版本的深度数据,您可以执行以下操作:

# Apply morphological filtering to reduce depth variations in nearby areas
effect_strength = 51
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, [effect_strength]*2)
depth_morph = cv2.morphologyEx(depth_uint8, cv2.MORPH_OPEN, kernel)

这可能有助于“扁平化”您尝试分割的对象,以便基于直方图的技术可以更好地提取它。形态过滤也可以用于(二进制)分割蒙版本身,以扩大分割区域(在上面的代码中,这是通过更改cv2.MORPH_OPENcv2.MORPH_DILATEcv2.MORPH_CLOSE在某些情况下可能完成的),因此这也可能有助于修复不完整的分割蒙版。

Thank you very much for your assistance! I've also encountered an issue where inconsistent depths within the same object make segmentation difficult. The following line of code scales the depth to the range of 0-255: depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0 depth = depth.astype(np.uint8). However, I've noticed that methods based on depth segmentation tend to segment more accurately when there are two or more objects in the scene. But if there's only one object in the scene, I often find the results are not as precise. Could this be due to the lack of depth contrast between two objects or some other reason? What do you think?Thank you!

heyoeyo commented 3 months ago

Could this be due to the lack of depth contrast between two objects

Yes that makes sense. If there's only one object, it's likely that it's taking up more of the screen and more of the range of depth values. An object with a wide range of depth values can be be hard to separate from the background (if using thresholding methods), since there's a higher chance of overlapping values with other parts of the image. It may be possible to use more sophisticated segmentation algorithms if simple thresholding doesn't work (like the Felzenszwalb or Chan-Vese algorithms available from scikit-image), though in some cases these may work better using the original color image instead of the depth output.

Kafka157 commented 3 months ago

is so, can you show some link or example ?

Well, have you ever tried an existing model called 'Segment and Track Anything'? You can simply mask sth just by clicking it.

xuan-cloud commented 1 month ago

However, the result is not what I expected. Do you know the reason for this?

Sorry I should have clarified, for the plane removal, you'll need to convert the depth data into a floating point format (e.g. np.float32). The depth data from the depth-anything result (and used in the histogram) will likely be in uint8 format, which can only take on values between 0 and 255. When the subtraction with the plane occurs, any pixel value below the plane value will 'wrap around' due to the uint8 format (e.g. 50 - 100 = 206), and that can cause the values to re-distribute in a strange way.

The other problem is if there is no obvious plane (or it doesn't make up a significant portion of the image), in which case the plane-fit will be influenced by other objects and the removal will give a strange result.

That all being said, the histogram will (generally) be heavily distorted by the plane removal. It should usually push pixels towards 'black' which will give a histogram that is more heavily distributed towards 0. It'll also move around the values of the object you're trying to segment, depending on where the object is in the image (since all the pixels of the object also have the plane values subtracted from them, which will shift them around in the histogram). Though the idea is that for some images it can push all the background/floor close to zero, and hopefully leave the object at higher values to more easily segment/threshold out. However this is very image dependent, for example it doesn't work great with the turtle image, due to the foliage.

The current issue remains the inconsistency in depth values within the same object...

There may be ways of mitigating this, though it's always going to be very image-dependent. One simple-ish thing to try is mixing together similar regions before any processing. This can be done using morphological filtering applied to grayscale data (like the depth image). Assuming you have the uint8 version of the depth data, you can do something like:

# Apply morphological filtering to reduce depth variations in nearby areas
effect_strength = 51
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, [effect_strength]*2)
depth_morph = cv2.morphologyEx(depth_uint8, cv2.MORPH_OPEN, kernel)

That may help 'flatten' the object you're trying to segment, so that the histogram-based techniques can better extract it. Morphological filtering can also be used on the (binary) segmentation mask itself to expand the segmented region (in the code above this is done by changing cv2.MORPH_OPEN to cv2.MORPH_DILATE or possibly cv2.MORPH_CLOSE in some cases), so that might also help to fix a segmentation mask that's incomplete.

I have thought about it for a long time. Do you think is it possible for me to use clustering methods to identify regions of the same object? Would this help with segmentation?

heyoeyo commented 1 month ago

Do you think is it possible for me to use clustering methods to identify regions of the same object?

Yes clustering should work for certain images, the Felzenszwalb algorithm is a good example that is based on clustering. Even simpler clustering approaches have been the standard way photo editors handle segmentation (e.g. 'fuzzy selection' often called the magic wand tool in photo editors), though I think newer AI based models (like SAM or YOLO) will tend to outperform other techniques when trying to segment 'typical' looking objects (e.g. a person in a photograph).