Why is the output of the Image Encoder here a tensor with the shape of (64, 64, 256) instead of a one-dimensional vector?

facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Apache License 2.0

47.81k stars 5.65k forks source link

Why is the output of the Image Encoder here a tensor with the shape of (64, 64, 256) instead of a one-dimensional vector? #736

Open FangehaPlus opened 7 months ago

FangehaPlus commented 7 months ago

As far as I know, Image Encoder encodes an image into a one-dimensional vector. Why is the output of the Image Encoder here a tensor with the shape of (64, 64, 256) instead of a one-dimensional vector ? Does the output of the Image Encoder here represent the feature map of the input image ?

heyoeyo commented 7 months ago

Generally vision transformers (which is what the SAM model uses) output a vector for each of the input patches and often a global vector (often called a 'cls token'), which would be similar to the 1D vector you're describing (although the SAM model doesn't include this).

For the SAM encoder, it uses a patch size of 16 pixels and an input resolution of 1024x1024, which divides into 64x64 patches. Each of these ends up with 256 feature values (regardless of the model size), which is where that 64x64x256 shape comes from.

FangehaPlus commented 7 months ago

Generally vision transformers (which is what the SAM model uses) output a vector for each of the input patches and often a global vector (often called a 'cls token'), which would be similar to the 1D vector you're describing (although the SAM model doesn't include this).

For the SAM encoder, it uses a patch size of 16 pixels and an input resolution of 1024x1024, which divides into 64x64 patches. Each of these ends up with 256 feature values (regardless of the model size), which is where that 64x64x256 shape comes from.

Thank you!