SysCV / sam-hq

Segment Anything in High Quality [NeurIPS 2023]
https://arxiv.org/abs/2306.01567
Apache License 2.0
3.66k stars 220 forks source link

Question about Ablation study #120

Closed mcshih closed 8 months ago

mcshih commented 8 months ago

While reading your paper, I encountered a question. The row annotated in Table 2 appears to correspond to the following passage in the text: "computing the scaled dot product [18] between the original SAM’s output token and our HQ-Output token." Performing a dot product between two tokens results in a scalar. I am curious about how this scalar is used to generate the final output mask. Additionally, the citation [18] refers to "Visual Prompt Tuning," and this paper does not seem to mention the "scaled dot product," causing some confusion. Could you please provide specific details on how this experiment was conducted?

image

image

lkeab commented 8 months ago

Hi, this experiment is done by taking the newly initialized HQ-token and compute its dot product with the original pre-trained output token in SAM. We will get a new output token from this dot product, and then we do the model tuning only on this new output token to predict the high-quality masks.

mcshih commented 8 months ago

Thank you very much for your response, but unfortunately, my question remains unresolved. We understand that a token is a vector representation, and when performing a dot product operation on two vectors, we obtain a scalar, not a new output token. image If possible, could you please provide further clarification or share some PyTorch code snapshots to help me better understand? Thank you very much.

lkeab commented 8 months ago

Thanks for pointing out the typo mistake. It should be "element-wise product" (or Hadamard product) as here. There is no addition.

mcshih commented 8 months ago

Thank you for your response. I have no further questions, and I will proceed to close the issue.