facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

How do I get the embeddings in each layer of a speech-to-text model for a given LibriSpeech input? #4357

Open hacobe opened 2 years ago

hacobe commented 2 years ago

❓ Questions and Help

What is your question?

How do I get the embeddings in each layer of a speech-to-text model for a given LibriSpeech input?

What have you tried?

I followed the instructions here (https://github.com/pytorch/fairseq/blob/main/examples/speech_to_text/docs/librispeech_example.md) to prepare the LibriSpeech dataset and train a speech-to-text model.

I then load the model:

from fairseq.models.speech_to_text import S2TTransformerModel

model = S2TTransformerModel.from_pretrained(
    path/to/my/model/dir,
    checkpoint_file='avg_last_10_checkpoint.pt',
    config_yaml=path/to/config
)

model.eval()
model.cuda(0)

At this point, I'm not sure what to do. As a preliminary step, I tried to figure out how to predict from the model following (https://github.com/pytorch/fairseq/issues/3069), but when I run predict on the first input of dev-clean.tsv:

model.predict(".../fbank80.zip:85409674480:287808")

I get a "ValueError: Unknown value: input_type = fbank80" error.

What's your environment?

hacobe commented 2 years ago

I still don't know how to extract embeddings.

However, I can make model.predict work in the example above by doing the following:

1) Add these lines to the config.yaml:

hub:
  input_type: fbank80_w_utt_cmvn

2) Replace "feat.unsqueeze(0)" with "np.expand_dims(feat, 0)" in speech_to_text/hub_interface.py (feat is a NumPy array not a Torch tensor)

3) Run model.predict(".../ARCTIC/cmu_us_aew_arctic/wav/arctic_a0001.wav")