ymzx commented 1 year ago

prompt_encoder.py 中位置编码如下（点prompt）： def _pe_encoding(self, coords: torch.Tensor) -> torch.Tensor: """Positionally encode points that are normalized to [0,1]."""

assuming coords are in [0, 1]^2 square and have d_1 x ... x d_n x 2 shape

  coords = 2 * coords - 1 # torch.Size([1, 2, 2])
  coords = coords @ self.positional_encoding_gaussian_matrix  # torch.Size([1, 2, 128])
  coords = 2 * np.pi * coords
  # outputs d_1 x ... x d_n x C shape
  return torch.cat([torch.sin(coords), torch.cos(coords)], dim=-1)  # torch.Size([1, 2, 256])

如何理解coords = coords @ self.positional_encoding_gaussian_matrix，为啥要乘以高斯随机噪声？

yifanpu001 commented 9 months ago

good question, I have the same concern

Jivitesh2001 commented 2 months ago

Hello,

The authors of the paper cited a paper talking about positional encoding - Tancik, Matthew, et al. "Fourier features let networks learn high-frequency functions in low dimensional domains." Advances in neural information processing systems 33 (2020): 7537-7547. I checked the code for the above-mentioned paper - Code and Colab Notebook ipynb.

The colab notebook used the same method of generating the positional encoding. The main idea of doing so was that the network tries to overcome the spectral bias.

Spectral Bias - the phenomenon where neural networks, particularly those trained using gradient-based methods, tend to learn lower-frequency components (smooth, slowly varying patterns) of a target function more quickly than higher-frequency components (rapidly oscillating, complex patterns).

Tancik et al. performed an experiment where they predicted an image's RGB intensity values from their corresponding coordinate pixels, i.e. Regression Task. They used the Gaussian Matrix and found that they were able to achieve better results than using classical sinusoidal positional encoding.

You can read the paper if you want to know more.

I hope this helps!

facebookresearch / segment-anything

why we pe_encoder need @ positional_encoding_gaussian_matrix ? #605

assuming coords are in [0, 1]^2 square and have d_1 x ... x d_n x 2 shape