Some misalignment of data2vec v2 between code and paper

❓ Questions and Help

Before asking:

This issue should be mentioned in data2vec v2 paper explicitly, instead of roughly explane in few phase. So, there have no sufficient info in document (paper) .

What is your question?

Why the inverse mask trick can "enable the student model to build semantically rich representations over local regions of the sample". Since the masking ratio (MR) and preserving ration (PR) is fixed!! (1-MR = PR) No matter what you implement it should be the same, isn't it ? then why inverse mask trick works ?

Code

Besides, only the vision config have inverse mask option, the other modality potentially support this (i guess). For example, the text modality just directly keep the preserved part. So, we can have a quick review :

# mask_length=3, a block contains 9 mask patchs (mask_length x mask_length)
    def compute_block_mask_2d(shape=(B, L), mask_prob=0.8, mask_length=3, mask_prob_adjust=0.07, inverse_mask=True):
        B, L = shape
        d = int(L**0.5)
        if inverse_mask:
                # what is the point if i set mask_prob = 0.2 without enable inverse mask ? 
            mask_prob = 1 - mask_prob
            if overlapping: # default is overlapping mask
                    mask = torch.zeros((B, d, d))
              mask_inds = torch.randint(
                  0,
                  L,
                  size=(  # paper formula = L * ((1-R)+A) / B, note notation is different
                      B,
                      int(
                          L
                          * ((mask_prob + mask_prob_adjust) / mask_length**2)
                          * (1 + mask_dropout)
                      ),
                  ),
              )
                    # scatter the starting point
              mask.view(B, -1).scatter_(1, mask_inds, 1)
              centers = mask.nonzero(as_tuple=True)

              inds = ([], [], [])

                    # chess-board 9 neightboard fill with 1
              offset = mask_length // 2
              for i in range(mask_length):
                  for j in range(mask_length):
                      k1 = i - offset
                      k2 = j - offset
                                    # batch dims
                      inds[0].append(centers[0])
                                    # x-axis cord's'
                      inds[1].append(centers[1] + k1)
                                    # y-axis cord's'
                      inds[2].append(centers[2] + k2)

              i0 = torch.cat(inds[0])
              i1 = torch.cat(inds[1]).clamp_(min=0, max=d - 1)
              i2 = torch.cat(inds[2]).clamp_(min=0, max=d - 1)
                    # masking..
              mask[(i0, i1, i2)] = 1

What have you tried?

read the code and paper..

What's your environment?

not important..

facebookresearch / fairseq