Closed LuJunru closed 1 year ago
Hello @LuJunru, thanks for your feedback.
This is indeed a strange bug. I remember once I faced this problem in another project and changed the mask tensor from torch.int64
to torch.bool
in encoder forward like so:
# model forward in audio/encoder.py
def forward(self, inputs, mask=None, **kwargs):
"""
Forward inputs through the encoder and extract transformer/attention layers outputs
Args:
inputs: raw audio array
mask: bool masked indices
**kwargs: keyword args specific to the encoder's forward method
Returns:
A dictionary of the encoder outputs including transformer layers outputs and attentions outputs
"""
mask = mask.bool() #<< CHANGE DTYPE LIKE THIS>>
outputs = self.encoder(inputs, mask_time_indices=mask, output_hidden_states=True,
output_attentions=True, **kwargs)
encoder_states = outputs['hidden_states'][:-1] # encoder layers outputs separately
encoder_out = outputs['hidden_states'][-1] # last encoder output (accumulated)
attentions = outputs['attentions']
return {
'encoder_states': encoder_states,
'encoder_out': encoder_out,
'attentions': attentions
}
Please try this and let me know if it's resolved. Best, Aryan
Hello @LuJunru, thanks for your feedback. This is indeed a strange bug. I remember once I faced this problem in another project and changed the mask tensor from
torch.int64
totorch.bool
in encoder forward like so:# model forward in audio/encoder.py def forward(self, inputs, mask=None, **kwargs): """ Forward inputs through the encoder and extract transformer/attention layers outputs Args: inputs: raw audio array mask: bool masked indices **kwargs: keyword args specific to the encoder's forward method Returns: A dictionary of the encoder outputs including transformer layers outputs and attentions outputs """ mask = mask.bool() #<< CHANGE DTYPE LIKE THIS>> outputs = self.encoder(inputs, mask_time_indices=mask, output_hidden_states=True, output_attentions=True, **kwargs) encoder_states = outputs['hidden_states'][:-1] # encoder layers outputs separately encoder_out = outputs['hidden_states'][-1] # last encoder output (accumulated) attentions = outputs['attentions'] return { 'encoder_states': encoder_states, 'encoder_out': encoder_out, 'attentions': attentions }
Please try this and let me know if it's resolved. Best, Aryan
Hi @arxyzan ,
Thank you for the quick response. I follow your advice and directly add .bool()
in audio/dataset.py
. It works.
best, Junru
Hi @arxyzan,
I came up with a quite strange bug: the mask value overflowed in audio pretraining.
Here: https://github.com/arxyzan/data2vec-pytorch/blob/main/audio/encoder.py#L35. The mask is served as mask_time_indices during the computation of output hidden states.
The input mask is fine, a binary matrix (B, L): 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
However, after the computation, the mask value overflowed, like this: 1421638320, 1421638576, 1421638832, 1421639088, 1421639344, 1421639600, 1421639856, 1421640112, 1421640368, 1421640624, 1421640880, 1421641136, 1421641392, 1421641648, 1421641904, 1421642160...
Have you ever meet such issues? By the way, this only happens when run
train.py
. The debugging onaudio/encoder.py
only will not raise this bug.My env is: torch 1.13.1 torchaudio 0.13.1 transformers 4.26.0 python 3.8.16
thanks, Junru