Why do the results of the same image differ?

mulinhu commented 1 year ago

Hi, I try to find all the objects in the image automatically. I used below code.

import numpy as np
import torch
import matplotlib.pyplot as plt
import cv2
import glob
def show_anns(anns,save_path):
    if len(anns) == 0:
        print(save_path)
        return
    sorted_anns = sorted(anns, key=(lambda x: x['area']), reverse=True)
    ax = plt.gca()
    ax.set_autoscale_on(False)
    polygons = []
    color = []
    for ann in sorted_anns:
        m = ann['segmentation']
        img = np.ones((m.shape[0], m.shape[1], 3))
        color_mask = np.random.random((1, 3)).tolist()[0]
        for i in range(3):
            img[:,:,i] = color_mask[i]
        ax.imshow(np.dstack((img, m*0.35)))
    plt.savefig(save_path)

import sys
sys.path.append("..")
from segment_anything import sam_model_registry, SamAutomaticMaskGenerator, SamPredictor

sam_checkpoint = "../sam_vit_h_4b8939.pth"
model_type = "vit_h"

device = "cuda"

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)

mask_generator = SamAutomaticMaskGenerator(sam)

files = glob.glob(fr"./*.jpg")
idx = 0
for file in files:
    image = cv2.imread(file)

    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    print(f"image.shape:{image.shape}")
    plt.clf()
    plt.subplot(1,2,1)
    plt.imshow(image)

    plt.subplot(1,2,2)
    plt.imshow(image)
    masks = mask_generator.generate(image)
    print(fr"masks:{len(masks)}")
    show_anns(masks,fr"{idx}.png")
    idx += 1

However, I got this result:

But the demo's effect is very good, you can see the below result 1681316389260

My biggest question is why the online effect is so good！！！ Have you used any other methods？

AnasCHARROUD commented 1 year ago

I remarked that also !!!!!!.

I think that there is a pre-processing for the images which we did not perform.

JordanMakesMaps commented 1 year ago

There's also a number of parameters when initializing the model; you're currently using all of the default values. I agree that it would be nice to know what parameters are used in the online demo @HannaMao

LedKashmir commented 1 year ago

Pay attention to the show_anns() function, the line 'color_mask = np.random.random((1, 3)).tolist()[0]' may result in the difference in the same input, but I also don't know how to handle this problem.

mulinhu commented 1 year ago

The situation you mentioned may arise. But I think the probability of the same numerical value is very low. @LedKashmir

Glisten-5481 commented 1 year ago

Hope to know what parameters are used in the online demo @HannaMao.

Jordan-Pierce commented 1 year ago

Pay attention to the show_anns() function, the line 'color_mask = np.random.random((1, 3)).tolist()[0]' may result in the difference in the same input, but I also don't know how to handle this problem.

That's definitely a possibility, but you can see in the example image provided above, the parent bear's ears are segmented separately than in the online demo, so it is giving different results (different parameters than the API's default)

huxycn commented 1 year ago

there are a lot of parameters can be tuned in SamAutomaticMaskGenerator

mask_generator = SamAutomaticMaskGenerator(
    # model: Sam,
    # points_per_side: Optional[int] = 32,
    # points_per_batch: int = 64,
    # pred_iou_thresh: float = 0.88,
    # stability_score_thresh: float = 0.95,
    # stability_score_offset: float = 1.0,
    # box_nms_thresh: float = 0.7,
    # crop_n_layers: int = 0,
    # crop_nms_thresh: float = 0.7,
    # crop_overlap_ratio: float = 512 / 1500,
    # crop_n_points_downscale_factor: int = 1,
    # point_grids: Optional[List[np.ndarray]] = None,
    # min_mask_region_area: int = 0,
    # output_mode: str = "binary_mask",
    model=sam,

    points_per_side=32,
    points_per_batch=64,
    pred_iou_thresh=0.86,
    stability_score_thresh=0.92,
    box_nms_thresh=0.5,
    # crop_n_layers=1,
    # crop_n_points_downscale_factor=2,

    min_mask_region_area=500,
)

box_nms_thresh: can remove duplicate mask by their bounding box iou crop_n_layers=1, and crop_n_points_downscale_factor=2, can get you finer results because it use multi crops to extract features and decode masks min_mask_resion_area can remove "holes" and "islands" attached to every single mask

Jordan-Pierce commented 1 year ago

@huxycn have you seen an improvements doing any sort of image pre-processing? Obviously speed if resizing the image, but I've tried sharpening the image and that seems to help a little.

Jack-bo1220 commented 1 year ago

same question! Is there any solution

Clear-3d commented 1 year ago

Same question! The results can vary wildly. 5da1d642760dc16d489dacf86cc9276 976e7494716baf5bfe97a1ec9829b57

nudlesoup commented 1 year ago

following

maheshs11 commented 1 year ago

@huxycn box_nms_thresh and crop_nms_thresh, how to set these so that i get only one mask and one bbox (no duplicate), so if set 0.5 it removes any overlap of more than 50 percent or less than 50?

yong2khoo-lm commented 1 year ago

+1 on this post. i am getting different results too. The one on the web has way much better result. It would be great to know the existing parameters. (or any additional processing)

Additionally, generation can be automatically run on crops of the image to get improved performance on smaller objects, and post-processing can remove stray pixels and holes.

Wondering what is being done there...

Akhp888 commented 1 year ago

+1 following

dongjielie commented 1 year ago

I tested 200 images, out of which 15 had poor segmentation results, but the online demo test results were excellent!!!, I really want to know why

reconlabs-young commented 1 year ago

Same here

helen1c commented 1 year ago

+1 following

HettyPatel commented 1 year ago

+1 following

chava100 commented 1 year ago

+1 following

chava100 commented 1 year ago

I believe the inference done in the demo is on the onnx quantized model. when I ran examples an the quantized onnx model results improved significantly I don't know why it is so, but maybe this can help you.

Jordan-Pierce commented 1 year ago

Hi @chava100 would you mind posting some screenshots, I know a lot of people would be interested.

theodu commented 1 year ago

I agree, even with the "basic" sam prediction with clicks to segment only one object, the demo shows much better results than running it with default values. It would be great to have the parameters in the demo! Thanks

chava100 commented 1 year ago

Hi @chava100 would you mind posting some screenshots, I know a lot of people would be interested.

Unfortunately I cannot share images from the dataset I tested on, so I tried to reproduce the results an a different example. The example I have is with a b_box prompt because I couldn't figure out how to do 'segment anything' when the model is exported to onnx. A capture that shows results from the SAM pytorch model:

A capture that shows results from SAM exported to onnx_quantized model:

This is a capture from the demo and I cannot guarantee that the values of the b_box are exactly the same as in the other two images because I do not see how numerical values can be entered but I tried to do it as close as possible:

hope it helps.

sliftist commented 1 year ago

I've been grappling with the same issue for the past few days, and while I don't have a solution, I made some progress on identifying the issue.

I believe the SAM model in the repo is the same as the web model, but the vit-h image encoder has slightly different weights.

Here is a mask created from an embedding taken from the web example (copied out of the console), using the web SAM onnx model: download

And now the EXACT SAME SAM model, but with an image embedding created from the vit-h model specified in the repo: download

The strange part is that the odd 4x4 repeating grid pattern DOES appear in the mask from the web embedding, but only in the middle of the mask (near the bottom), never at the edges.

Directly comparing the image embeddings is strange too, this is from the web model (mapping values -1 to 1 to 0 to 255 rgb):

And from the vit-h model provided in the repo:

At first it looks like the difference is just the scaling (the web model has values closer to 0), but this isn't true in all cases. In one section the padding becomes entirely black, which I could not replicated no matter what color of padding I used (I tested white, gray, and black). I spent a while trying to make the embedding match via scaling, offset, normalization, etc, but I couldn't get it to work.

Given the superb quality of the mask created from the web embedding (which is literally pixel perfect, in contrast to the other very messy mask), I assume there isn't a trivial fix, and the web demo is simply using a heavily retrained vit-h model.

sliftist commented 1 year ago

Also of note, quantizing the vit-h model gives pretty much the same mask result:

YutingZhang commented 1 year ago

@sliftist I also often notice similar artifacts at the edges produced by the vit-h model. When using non-natural images (like 3D anime snapshots), these artifacts can sometimes become quite messy, even extending far from the edge with the vit-h model. However, the demo results look much cleaner in comparison.

Thank you also for the in-depth analysis using different feature maps with the same decoder model. It convincingly shows that either the model used for the demo (which appears to be better) has not been released, or the input image was preprocessed somehow.

idonahum1 commented 1 year ago

following +1

sssmallmonster commented 1 year ago

Why do I get this result using onnx_quantized model? @chava100 output output2

liren2515 commented 1 year ago

following +1

jiangwei221 commented 1 year ago

Why do I get this result using onnx_quantized model? @chava100

Hi there! I've met the same problem, do you know what causes the shifting/offset?

Update: It's because I used a onnx model that is traced on 3:2 resolution, and applied it for 16:9 images.

phucbienvan commented 1 year ago

Why do I get this result using onnx_quantized model? @chava100

Hi there! I've met the same problem, do you know what causes the shifting/offset? Update: It's because I used a onnx model that is traced on 3:2 resolution, and applied it for 16:9 images.