arxyzan / data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI
MIT License
172 stars 26 forks source link

Question on the visual modality mask method #18

Closed Ruizhuo-Xu closed 1 year ago

Ruizhuo-Xu commented 1 year ago

I am a beginner in deep learning. While reading your code, I noticed the following code snippet used in data processing for the visual modality(vision/dataset.py line:44):

masked_image = (image * mask).reshape(-1, self.input_size, self.input_size)

I have a question regarding this. Based on my understanding, in the mask matrix, 0 represents the areas that do not need to be masked, and 1 represents the areas to be masked. After examining the code for MaskingGenerator and the subsequent use of mask in the Data2Vec model, it seems like my understanding is correct.

Should the above code be modified to:

masked_image = (image * (1 - mask)).reshape(-1, self.input_size, self.input_size)

Or is my understanding incorrect? Please let me know.

arxyzan commented 1 year ago

Hi, I don't currently recall the details and don't have proper requirements to test the code, but as far as I remember, the mask parameter is only used in audio models because the model's feature extractor takes care of the masking, opposed to vision/text models that already have the data masked (happens in the dataset class). Also, the common practice is that in the mask tensor, the 0's represent the part that need to be masked not the other way around. I couldn't validate this by reading the code, but I don't think it's any different for data2vec.

Ruizhuo-Xu commented 1 year ago

Thank you for your response. If I understand correctly, according to your explanation, in the mask matrix, 0 represents the areas that need to be masked. In that case, in lines 107 and 108 of data2vec/data2vec.py,

x = x[mask]
y = y[mask]

the features of tokens that are not masked (mask == 1) are extracted for both the student and teacher models (where x and y represent the outputs of the student and teacher models, respectively). Does this practice not contradict the approach of predicting the masked region features and calculating the loss as mentioned in the paper? I apologize for my unclear English expression. Looking forward to your reply.

Hi, I don't currently recall the details and don't have proper requirements to test the code, but as far as I remember, the mask parameter is only used in audio models because the model's feature extractor takes care of the masking, opposed to vision/text models that already have the data masked (happens in the dataset class). Also, the common practice is that in the mask tensor, the 0's represent the part that need to be masked not the other way around. I couldn't validate this by reading the code, but I don't think it's any different for data2vec.