Open idilsulo opened 2 months ago
The first SAM model is made out of an image encoder, prompt encoder & mask decoder. The new version adds some memory encoding/attention components as well. The mask + prompt components are only about 16MB (across all model sizes for both SAM v1 & v2). The new memory components are an additional 30MB (across all SAMv2 sizes). So the remaining model size comes from the image encoder. For example, the image encoder on the base SAM v1 model is about 360MB, whereas the base SAM v2 image encoder seems to be about 110MB.
So even though SAMv2 has extra model components, the smaller size is coming from a change to the image encoder. V2 uses a very different image encoder model called Hiera (the original SAM used a model based on this paper), which seems to be much smaller for the same/better image encoding performance.
Ah, this makes sense! Thanks a lot!
Hello, would like ask will there be a Huge version of SAM2 as well.
Why you didn't try train large base model as SAM1 does (huge) if there is not.
Hello all! I have a question regarding the comparison of SAM vs SAM2. In the paper in Table 6, there is a comparison provided between both models across 37 datasets.
Does this comparison compare the SAM-H with SAM2 large checkpoint? It is interesting to me that that the biggest model checkpoint for SAM is ~2.4GB while the biggest one for SAM2 is much smaller and shows superior results.
What is the main reason behind this difference? Will there be another huge checkpoint released for SAM2?
Thanks in advance!