Open FangehaPlus opened 7 months ago
Generally vision transformers (which is what the SAM model uses) output a vector for each of the input patches and often a global vector (often called a 'cls token'), which would be similar to the 1D vector you're describing (although the SAM model doesn't include this).
For the SAM encoder, it uses a patch size of 16 pixels and an input resolution of 1024x1024, which divides into 64x64 patches. Each of these ends up with 256 feature values (regardless of the model size), which is where that 64x64x256 shape comes from.
Generally vision transformers (which is what the SAM model uses) output a vector for each of the input patches and often a global vector (often called a 'cls token'), which would be similar to the 1D vector you're describing (although the SAM model doesn't include this).
For the SAM encoder, it uses a patch size of 16 pixels and an input resolution of 1024x1024, which divides into 64x64 patches. Each of these ends up with 256 feature values (regardless of the model size), which is where that 64x64x256 shape comes from.
Thank you!
As far as I know, Image Encoder encodes an image into a one-dimensional vector. Why is the output of the Image Encoder here a tensor with the shape of (64, 64, 256) instead of a one-dimensional vector ? Does the output of the Image Encoder here represent the feature map of the input image ?