Detailed architecture of the SAM (Segment anything model)

facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Apache License 2.0

47.9k stars 5.67k forks source link

Detailed architecture of the SAM (Segment anything model) #684

Open jetsonwork opened 9 months ago

jetsonwork commented 9 months ago

Hello everyone,

I am trying to draw the architecture of SAM (not the one available in the original paper), including all the details. Could you please guide me on how to proceed? if someone has already done this, I would appreciate it if you could share it with me.

heyoeyo commented 9 months ago

For the image encoder specifically, the SAM model uses the 'plain' architecture described in the paper: "Exploring Plain Vision Transformer Backbones for Object Detection"

However, if you're looking for 'all the details', then the code is definitely the best place to be looking. All of the model code is under the segment_anything > modeling folder, and it's very well organized/straight-forward compared to most other model implementations that I've seen.

jetsonwork commented 9 months ago

Thanks. I have studied some blogs and they have mentioned the encoder architecture is based on MAE ViT as below. Could you please confirm it?

heyoeyo commented 9 months ago

Yes the paper mentions they started with a model trained as a MAE, though the actual model in SAM isn't an autoencoder (the output is 64x64x256 compared to the 1024x1024x3 input), I assume they just removed the decoder part.