facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
45.43k stars 5.37k forks source link

SAM for dynamic aspect ration images #760

Open mariiak2021 opened 3 weeks ago

mariiak2021 commented 3 weeks ago

Hi @heyoeyo , I'm trying to fine-tune SAM for panorama images of the indoor scenes, which are all of different rations, but as they are panoramas their width is always longer then height.

  1. What will you recommend as the best approach to deal with different resolution images if SAM requires 1024x1024 images as input?
  2. I implemented custom point grid to work in the automatic SAM mode. By custom I mean non-square one: with bigger amount of points for X side. I'm wondering if I need to change something with initialization of point embedding in prompt_encoder.py? My point grid changed density, so not sure if I need to do something with point weights?

To provide a better context, here is the example of the panorama image I use as input and the masks output I got after fine-tuning. The masks are targeting the objects of interest but they are clearly over-segmented. I'm trying to understand if it's because of the image pre-processing or changes with the point grid. vis_mask_FloorPlan418__train__25-bef0 Thank you!

heyoeyo commented 2 weeks ago

What will you recommend as the best approach to deal with different resolution images if SAM requires 1024x1024 images as input?

As-is, there's not really a lot of alternatives to handling non-square images, other than the existing padding that the SAM model uses. I have a modified copy of the model that can directly process non-square images and at sizes higher than 1024 (or lower) and the model seems to work just fine. It's currently a messy w.i.p., but if I remember, I'll post a link to it here once I've uploaded it. In the meantime, I think there's not really any option, other than to scale/pad the image to 1024x1024 as it already does. If you're really worried about it, you can try manually stretching your image vertically so that it is square, then run the SAM model on that stretched image. This should give the model much more 'resolution' to work with, though the masks will also be vertically stretched, so you'll have to un-stretch them afterwards.

My point grid changed density, so not sure if I need to do something with point weights

The automatic mask generator just runs the normal 'single point prompt' version of SAM repeatedly for every point on your point grid (and then does some work to clean-up/merge all of the separate mask results). So adjusting the point grid shouldn't require any changes to the prompt-encoder weights.

If you're running the automatic mask generator script directly on your computer (i.e. not through a notebook/collab) then you can visualize the points on your image, by adding a few lines of code. In the segment_anything/automatic_mask_generator.py script, after line 240, you can add:

debug_img = cv2.cvtColor(cropped_im, cv2.COLOR_RGB2BGR)
for xy in points_for_image:
    pt_xy = xy.astype(np.int32).tolist()
    cv2.circle(debug_img, pt_xy, 2, (255,0,255), -1)
cv2.imshow("POINTS", debug_img)
cv2.waitKey(0)
cv2.destroyWindow("POINTS")

You'll also need to add import cv2 to the top of the script for this to work. This will draw the image with the point grid overlayed, so it might help diagnose any problems with the points. Once you press any key, it'll close the window and continue processing.

Another thing worth trying is the crop-n-layers argument for the automatic mask generator. Based on the image you posted, it sort of looks like the error on the masking might be a 'low resolution' issue (when the image is scaled/padded to fit the square 1024 resolution, that lamp might be very small/blurry and that's what the mask is fitting to). That cropping argument causes the image to be zoomed/cropped before processing (if you add that visualization code above, you can see what it's doing), and that might give more resolution to the lamp for masking?