[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Thank you for the great work you've done on this model! Is there any way to batch the model using funasr? I've been trying to batch with padding and set the padding_mask to mask out the unused frames, but I'm not getting the same results as when I run inference sequentially.
Here's a sample of the code I'm using. I've tried a number of different configurations of arguments - there are several mask parameters, and it seems like mask refers to the MLM pretraining schema, and padding_mask refers to the attention mask? I'm not sure though because there's no documentation. Any guidance would be appreciated.
from funasr.utils.load_utils import load_audio_text_image_video
from funasr import AutoModel
from torch.nn.utils.rnn import pad_sequence
model = AutoModel(model="iic/emotion2vec_plus_large").model
model.eval()
model.to("cuda")
padding_value = -1
# Audios is a list of audio tensors resampled to 16kHz
x = load_audio_text_image_video(audios)
x = [torch.nn.functional.layer_norm(x_, x_.shape).squeeze() for x_ in x]
masked_x = pad_sequence(x, batch_first=True, padding_value=padding_value)
mask = masked_x == padding_value
out = model.extract_features(masked_x, mask=False, padding_mask=mask, remove_extra_tokens=True)
out_mask = out["padding_mask"]
feats = out["x"]
feats[out_mask] = 0
print(feats.sum(dim=1) / (~out_mask).sum(dim=1).unsqueeze(-1))
Hi,
Thank you for the great work you've done on this model! Is there any way to batch the model using
funasr
? I've been trying to batch with padding and set thepadding_mask
to mask out the unused frames, but I'm not getting the same results as when I run inference sequentially.Here's a sample of the code I'm using. I've tried a number of different configurations of arguments - there are several
mask
parameters, and it seems likemask
refers to the MLM pretraining schema, andpadding_mask
refers to the attention mask? I'm not sure though because there's no documentation. Any guidance would be appreciated.