facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

Some misalignment of data2vec v2 between code and paper #5038

Open HuangChiEn opened 1 year ago

HuangChiEn commented 1 year ago

❓ Questions and Help

Before asking:

This issue should be mentioned in data2vec v2 paper explicitly, instead of roughly explane in few phase. So, there have no sufficient info in document (paper) .

What is your question?

Why the inverse mask trick can "enable the student model to build semantically rich representations over local regions of the sample". Since the masking ratio (MR) and preserving ration (PR) is fixed!! (1-MR = PR) No matter what you implement it should be the same, isn't it ? then why inverse mask trick works ?

Code

Besides, only the vision config have inverse mask option, the other modality potentially support this (i guess). For example, the text modality just directly keep the preserved part. So, we can have a quick review :

# mask_length=3, a block contains 9 mask patchs (mask_length x mask_length)
    def compute_block_mask_2d(shape=(B, L), mask_prob=0.8, mask_length=3, mask_prob_adjust=0.07, inverse_mask=True):
        B, L = shape
        d = int(L**0.5)
        if inverse_mask:
                # what is the point if i set mask_prob = 0.2 without enable inverse mask ? 
            mask_prob = 1 - mask_prob
            if overlapping: # default is overlapping mask
                    mask = torch.zeros((B, d, d))
              mask_inds = torch.randint(
                  0,
                  L,
                  size=(  # paper formula = L * ((1-R)+A) / B, note notation is different
                      B,
                      int(
                          L
                          * ((mask_prob + mask_prob_adjust) / mask_length**2)
                          * (1 + mask_dropout)
                      ),
                  ),
              )
                    # scatter the starting point
              mask.view(B, -1).scatter_(1, mask_inds, 1)
              centers = mask.nonzero(as_tuple=True)

              inds = ([], [], [])

                    # chess-board 9 neightboard fill with 1
              offset = mask_length // 2
              for i in range(mask_length):
                  for j in range(mask_length):
                      k1 = i - offset
                      k2 = j - offset
                                    # batch dims
                      inds[0].append(centers[0])
                                    # x-axis cord's'
                      inds[1].append(centers[1] + k1)
                                    # y-axis cord's'
                      inds[2].append(centers[2] + k2)

              i0 = torch.cat(inds[0])
              i1 = torch.cat(inds[1]).clamp_(min=0, max=d - 1)
              i2 = torch.cat(inds[2]).clamp_(min=0, max=d - 1)
                    # masking..
              mask[(i0, i1, i2)] = 1

What have you tried?

read the code and paper..

What's your environment?

not important..

lazerliu commented 10 months ago

the same question with Audio. While inverse_mask is an important role in paper, the "model.modalities.audio.inverse_mask" in "example/data2vec/config/v2/base&large_audio_only_task.yaml" is false in default official code.