facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
46.41k stars 5.5k forks source link

Running SAM on Large Satellite Images and Implementing Encoder-Decoder Workflow #497

Open Ankit-Vohra opened 1 year ago

Ankit-Vohra commented 1 year ago

I am currently facing challenges while trying to load and run the Segment Anything model on a large images, which is approximately 8-10 GB in size. I would like to request your assistance in understanding the necessary steps to achieve this successfully.

Implementing Encoder-Decoder Workflow I would like to explore the implementation of an encoder-decoder workflow using the provided SAM model. I am particularly interested in learning how to pass a large image through the encoder, store the embeddings, and then perform real-time inference using the decoder model based on the embeddings generated from the encoder.

Guidance Request: To achieve the encoder-decoder workflow, I request detailed steps or guidelines on how to:

Input a large image into the encoder of the SAM model. Store the embedded vector generated by the encoder. Implement the decoder model to run on top of the embedded vector in real-time for semantic segmentation tasks.

heyoeyo commented 1 year ago

The model requires a 1024x1024 RGB image as an input, so a (very!) large 8-10 GB image would be automatically downscaled by the model before the encoding/segmentation steps.

Assuming you wanted to use the full resolution of such a large input image, you could try cropping the image into a bunch of 1024x1024 'tiles' and process each one separately. This would require a lot of computation, and there would be issues with segmenting along the boundaries of each of those 1024x1024 tiles (i.e. trying getting a proper segmentation across tiles), but I think that's a good starting point if you wanted to use the input image at it's full resolution.

As for the actual steps to process the image, the predictor example notebook is a good reference for how to do this.

The following line performs the image encoding and also stores the embedding vector for re-use: predictor.set_image(image) The encoded image data is stored in predictor.features after you run the .set_image(...) function, though you normally don't need to access it directly.

To run the decoder model on the encoded image, you can use the following function: masks, scores, logits = predictor.predict(...) So if you wanted to do segmentation in real-time, you'd need to repeatedly call this function with whatever 'real time input' you want to use (points or bounding boxes, see the example notebook for how the input to the function is formatted). This function re-uses the image encoding from the earlier (set_image) step, so it's quite fast.

Ankit-Vohra commented 1 year ago

Thanks for your quick response @heyoeyo . But I have a couple of queries if I do batch system as it will not have information of the neighbouring tile due to which an input prompt might not mark the complete object as some part of the object may be present in another tile. I would like to share the below two links in which the organisation have worked on SAM and have integrated it as a annotation tool to annotate large satellite imagery. Their response time is in milliseconds hence I can say that they are not tiling the images. Please check the resources: https://picterra.ch/blog/faster-ml-production-meta-ai-segment-anything-picterra/ https://www.youtube.com/watch?v=usN-5zBm_E0

heyoeyo commented 1 year ago

As far as I can tell, they are at least loading the image as tiles. For example, you can see the tiles loading in when they zoom out at the 0:35 mark of that video. Though most likely it's nothing to do with the SAM model, but just to avoid loading the entire image dataset all at once (which definitely makes sense if it's GB worth of data!)

Based on the demo, I would instead guess that they encode the visible area once the user selects the 'magic wand' tool. There's a good hint this is happening at 0:42 into the video, where a delay occurs saying 'Initializing magic wand...', which would be the (slow) image encoding step. They probably have to deal with re-encoding as the user zooms in/out and pans around as well.

They also have an 'image-boundary' issue at 2:45 into the video, when they zoom out of their road selection and the masking doesn't extend outside of the viewing area when they made the mask (i.e. it doesn't follow further down the road they selected). I think this kind of behavior is preferable for this use case though, since you wouldn't want to automatically accept the SAM masking on areas that aren't being viewed, since it could be wrong like it was on their first masking attempt of the road.

So I think the steps are still mostly the same, you'd want to use predictor.set_image(visible_image) on the part of the map the user is viewing to encode the image, and then use predictor.predict(...) to generate the segmentation mask as the user provides input.

Also, in case it's relevant, the predictor example notebook has a section called 'Specifying a specific object with additional points' which explains how to do the foreground vs. background point selection (the 'Add to outline' / 'Remove from outline' toggle feature in the video).

abhinav2712 commented 6 months ago

As far as I can tell, they are at least loading the image as tiles. For example, you can see the tiles loading in when they zoom out at the 0:35 mark of that video. Though most likely it's nothing to do with the SAM model, but just to avoid loading the entire image dataset all at once (which definitely makes sense if it's GB worth of data!)

Based on the demo, I would instead guess that they encode the visible area once the user selects the 'magic wand' tool. There's a good hint this is happening at 0:42 into the video, where a delay occurs saying 'Initializing magic wand...', which would be the (slow) image encoding step. They probably have to deal with re-encoding as the user zooms in/out and pans around as well.

They also have an 'image-boundary' issue at 2:45 into the video, when they zoom out of their road selection and the masking doesn't extend outside of the viewing area when they made the mask (i.e. it doesn't follow further down the road they selected). I think this kind of behavior is preferable for this use case though, since you wouldn't want to automatically accept the SAM masking on areas that aren't being viewed, since it could be wrong like it was on their first masking attempt of the road.

So I think the steps are still mostly the same, you'd want to use predictor.set_image(visible_image) on the part of the map the user is viewing to encode the image, and then use predictor.predict(...) to generate the segmentation mask as the user provides input.

Also, in case it's relevant, the predictor example notebook has a section called 'Specifying a specific object with additional points' which explains how to do the foreground vs. background point selection (the 'Add to outline' / 'Remove from outline' toggle feature in the video).

What if we want to perform the same but with point prompting and also lets say with images lesser in size, How can I approach doing that? Can you also explain how in this link the workflow might have been?

heyoeyo commented 6 months ago

What if we want to perform the same but with point prompting and also lets say with images lesser in size

I may be misunderstanding your question, but I would say the same approach described above (and also in the predictor notebook, for example in the "Specifying a specific object with additional points" section) should work, since it already is using point-prompts. Though bounding box inputs would work too, and shouldn't really change anything.

The images being smaller also shouldn't really affect anything about the approach, as long as the image is still big enough that the section being viewed needs to be cropped/scaled before processing with the SAM model. If the image is small enough (e.g. less than 1024x1024), then it can be directly processed with SAM and the cropping step isn't needed.

imneonizer commented 3 months ago

I am currently segmenting large satellite images ~20GBs tif files, the way I implemented my pipeline is, I have a proxy which serves requests via multiple sam workers.

I load the whole 20GB tif file into memory, take overlapping slices of 1024x1024 resolution, get predictions for them do some post processing to make masks looks good and merge them back at the specific location on a dummy image of original 20GB resolution (80000x80000).

Then I write this resulting image as tif on to the disk and load it in to qgis to visualise the masks. Results so far looks pretty interesting.

Here is the method I use for calculating slices.

def calculate_slice_bboxes(
    image_height: int,
    image_width: int,
    slice_height: int = 512,
    slice_width: int = 512,
    overlap_height_ratio: float = 0,
    overlap_width_ratio: float = 0
):
    slice_bboxes = []
    y_max = y_min = 0
    y_overlap = int(overlap_height_ratio * slice_height)
    x_overlap = int(overlap_width_ratio * slice_width)
    while y_max < image_height:
        x_min = x_max = 0
        y_max = y_min + slice_height
        while x_max < image_width:
            x_max = x_min + slice_width
            if y_max > image_height or x_max > image_width:
                xmax = min(image_width, x_max)
                ymax = min(image_height, y_max)
                xmin = max(0, xmax - slice_width)
                ymin = max(0, ymax - slice_height)
                slice_bboxes.append([xmin, ymin, xmax, ymax])
            else:
                slice_bboxes.append([x_min, y_min, x_max, y_max])
            x_min = x_max - x_overlap
        y_min = y_max - y_overlap
    return slice_bboxes

image

image