Closed Huster-Hq closed 4 months ago
The difference between our MemSAM and TAM are as follows. (1) Motivation. TAM serves as a downstream application of the combination of SAM and XMem, while we target extending SAM to the medical video domain. (2) Fine-tune. TAM directly employs the pretrained SAM and XMem, without fine-tuning. Thus, TAM's segmentation quality depends heavily on initial mask generation by SAM with weak prompts and XMem's segmentation capability. TAM's solution is to rely on more human involvement to iteratively improve segmentation interactively. In contrast, our method aims to fine-tune SAM with sparsely annotated video data. (3) Feature propagation. When TAM has poor segmentation, it projects XMem's predicted probability and affinity matrices to point prompts and coarse masks as prompts for SAM. But this loses rich semantic information in XMem's features. Our memory prompt directly passes memory embeddings after memory reading to SAM as dense prompts, preserving better semantic consistency.
There is no doubt that TAM is a great job. We build for different motivations.
Thanks for your reply! In your reply#(3), you mean, in MemSAM, Memory Reading module' output is not a mask (1HW) but is a feature (CHW)?
Yes! Memory Reading' output is a dense prompt embedding, we named it Memory Embedding (B,256,32,32).
Got it! Thanks!
The innovation of the method does not seem to be enough, and the previous work [1] has combined SAM and XMem.
[1] Track anything: Segment anything meets videos. (https://arxiv.org/pdf/2304.11968)