arxyzan / data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI
MIT License
168 stars 26 forks source link

In data2vec.py #20

Closed HarshavardhanaTG closed 6 months ago

HarshavardhanaTG commented 6 months ago

In data2vec.py, in line 90,

y = self.ema.model(trg, ~mask, **kwargs)['encoder_states']

shouldn't it have been,

y = self.ema.model(trg, None, **kwargs)['encoder_states'] (going by the training strategy in the paper)?

arxyzan commented 6 months ago

Hello @HarshavardhanaTG , As far as I remember, the EMA model must take the ~mask. You can also verify this in the original fairseq implementation (V1 only)

HarshavardhanaTG commented 6 months ago

Hey @arxyzan, thank you so much for replying so quickly. Your repository has been of huge help! with torch.no_grad(): self.ema.model.eval()

        if self.cfg.ema_transformer_only:
            y, layer_results = self.ema.model.extract_features(
                pre_encoder_features,
                padding_mask=padding_mask,
                min_layer=self.cfg.encoder_layers - self.average_top_k_layers,
            )
            y = {
                "x": y,
                "padding_mask": padding_mask,
                "layer_results": layer_results,
            }
        else:
            y = self.ema.model.extract_features(
                source=source,
                padding_mask=orig_padding_mask,
                mask=False,
            )

        target_layer_results = [l[2] for l in y["layer_results"]]. 

I think they did fix that issue. It's totally possible that I am mistaken. Please let me know if I am wrong. I am a bit confused about this part, the rest of your repo seemed absolutely fine. Thanks again!

arxyzan commented 6 months ago

@HarshavardhanaTG Sorry for the late response. The code in the original implementation created extracted the mask in the forward method. But I decided to feed it in the dataset and as a parameter to the forward method. Either way is correct. The main thing to know here is that the original mask that is fed to the student model must be reversed and fed to the EMA (teacher) model.

HarshavardhanaTG commented 6 months ago

Thank you so much! That helps!