why SAM inverts pixel values of output?

facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Apache License 2.0

47.87k stars 5.66k forks source link

why SAM inverts pixel values of output? #675

Open aleemsidra opened 10 months ago

aleemsidra commented 10 months ago

I have gray scale skull image, given that image format for SAM accepts RGB image format, I converted my image to RGB mode and then scaled it as:

bgr_img = cv2.cvtColor( image, cv2.COLOR_GRAY2BGR)
scaled_tensor = 255 * (bgr_img - bgr_img.min()) / (bgr_img.max() - bgr_img.min())
image = scaled_tensor.astype(np.uint8)

The output of SAM has the dimension of original input dimension which means it's gray scale. So I donot convert ground truth to RGB format for dice calculation. The brain skull in the ground truth is white, and background is black, but why SAM's output has reversed pixels as shown below. It should have white for skull and black for background

Because of this inversion when I am calculating surface dice score it is coming out to be zero. Can someone please pin point me where in SAM's code this inversion is ahppening

heyoeyo commented 10 months ago

There shouldn't be an inversion step normally. I would guess that the prompt given to the model is pointing to a part of the background, and if the background is all a similar color/appearance, then the model segments it as the 'answer' for the given prompt.

If the prompt is selecting the skull, then there may be a normalization issue with the prompt coordinates. For example if a single (0.5, 0.5) point prompt is given to select the skull, but the model is expecting pixel units, then (0.5, 0.5) points at the top-left corner of the image, which would select the background in this case. The solution may just be a matter of scaling the coordinates differently (or using .predict(...) vs. .predict_torch(...) which assume different units).

aleemsidra commented 10 months ago

@heyoeyo , I didnot pass any prompts. I am just passing the input image to model as:

for idx in tqdm(range(len(dataset)), desc= f"Processing  images", unit= "image"): 
            input_samples, gt_samples, voxel = dataset[idx]

            slices = []

            for slice_id, img_slice in tqdm(enumerate(input_samples), total=len(input_samples), desc="Processing slices"): # looping over single img

            batched_input = [
                                        {'image': prepare_image(img_slice, resize_transform, model),
                                        'original_size': img_slice[0,:,:].shape}]

            preds = model(batched_input)
            slices.append(preds.squeeze().detach().cpu())

segmented_volume = torch.stack(slices, axis=0)  $ stacking 2d slices
mask = torch.zeros(segmented_volume.shape) 
mask[torch.sigmoid(segmented_volume) > 0.5] = 1 # thresholding```

I have commented out L123 in sam.py, as I want logits to compute dice score and returning only masks. Rest of the code is exactly same. Even if I use L123 in sam.py, I get inverted result.

heyoeyo commented 10 months ago

If no prompts are provided, it looks like the output masks will be based on the learned no-mask embedding only. It's not really clear what parts of the image the no-mask embedding would tend to select, but I'd guess it doesn't specifically favor the center of the image and that's why the background gets selected.

A simple way to check if things are inverted would be to provide a point prompt near the center and see if the resulting mask includes/excludes that point.

aleemsidra commented 10 months ago

I passed the box prompts, and it solved the issue.