arxyzan / data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI
MIT License
172 stars 26 forks source link

Mask value overflowed in audio pre-training #14

Closed LuJunru closed 1 year ago

LuJunru commented 1 year ago

Hi @arxyzan,

I came up with a quite strange bug: the mask value overflowed in audio pretraining.

Here: https://github.com/arxyzan/data2vec-pytorch/blob/main/audio/encoder.py#L35. The mask is served as mask_time_indices during the computation of output hidden states.

The input mask is fine, a binary matrix (B, L): 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

However, after the computation, the mask value overflowed, like this: 1421638320, 1421638576, 1421638832, 1421639088, 1421639344, 1421639600, 1421639856, 1421640112, 1421640368, 1421640624, 1421640880, 1421641136, 1421641392, 1421641648, 1421641904, 1421642160...

Have you ever meet such issues? By the way, this only happens when run train.py. The debugging on audio/encoder.py only will not raise this bug.

My env is: torch 1.13.1 torchaudio 0.13.1 transformers 4.26.0 python 3.8.16

thanks, Junru

arxyzan commented 1 year ago

Hello @LuJunru, thanks for your feedback. This is indeed a strange bug. I remember once I faced this problem in another project and changed the mask tensor from torch.int64 to torch.bool in encoder forward like so:

# model forward in audio/encoder.py
def forward(self, inputs, mask=None, **kwargs):
    """
    Forward inputs through the encoder and extract transformer/attention layers outputs
    Args:
        inputs: raw audio array
        mask: bool masked indices
        **kwargs: keyword args specific to the encoder's forward method
    Returns:
        A dictionary of the encoder outputs including transformer layers outputs and attentions outputs
    """
    mask = mask.bool()  #<< CHANGE DTYPE LIKE THIS>>
    outputs = self.encoder(inputs, mask_time_indices=mask, output_hidden_states=True,
                           output_attentions=True, **kwargs)
    encoder_states = outputs['hidden_states'][:-1]  # encoder layers outputs separately
    encoder_out = outputs['hidden_states'][-1]  # last encoder output (accumulated)
    attentions = outputs['attentions']
    return {
        'encoder_states': encoder_states,
        'encoder_out': encoder_out,
        'attentions': attentions
    }

Please try this and let me know if it's resolved. Best, Aryan

LuJunru commented 1 year ago

Hello @LuJunru, thanks for your feedback. This is indeed a strange bug. I remember once I faced this problem in another project and changed the mask tensor from torch.int64 to torch.bool in encoder forward like so:

# model forward in audio/encoder.py
def forward(self, inputs, mask=None, **kwargs):
    """
    Forward inputs through the encoder and extract transformer/attention layers outputs
    Args:
        inputs: raw audio array
        mask: bool masked indices
        **kwargs: keyword args specific to the encoder's forward method
    Returns:
        A dictionary of the encoder outputs including transformer layers outputs and attentions outputs
    """
    mask = mask.bool()  #<< CHANGE DTYPE LIKE THIS>>
    outputs = self.encoder(inputs, mask_time_indices=mask, output_hidden_states=True,
                           output_attentions=True, **kwargs)
    encoder_states = outputs['hidden_states'][:-1]  # encoder layers outputs separately
    encoder_out = outputs['hidden_states'][-1]  # last encoder output (accumulated)
    attentions = outputs['attentions']
    return {
        'encoder_states': encoder_states,
        'encoder_out': encoder_out,
        'attentions': attentions
    }

Please try this and let me know if it's resolved. Best, Aryan

Hi @arxyzan ,

Thank you for the quick response. I follow your advice and directly add .bool() in audio/dataset.py. It works.

best, Junru