How do I get the embeddings in each layer of a speech-to-text model for a given LibriSpeech input?

❓ Questions and Help

What is your question?

What have you tried?

I followed the instructions here (https://github.com/pytorch/fairseq/blob/main/examples/speech_to_text/docs/librispeech_example.md) to prepare the LibriSpeech dataset and train a speech-to-text model.

I then load the model:

from fairseq.models.speech_to_text import S2TTransformerModel

model = S2TTransformerModel.from_pretrained(
    path/to/my/model/dir,
    checkpoint_file='avg_last_10_checkpoint.pt',
    config_yaml=path/to/config
)

model.eval()
model.cuda(0)

At this point, I'm not sure what to do. As a preliminary step, I tried to figure out how to predict from the model following (https://github.com/pytorch/fairseq/issues/3069), but when I run predict on the first input of dev-clean.tsv:

model.predict(".../fbank80.zip:85409674480:287808")

I get a "ValueError: Unknown value: input_type = fbank80" error.

What's your environment?

fairseq Version (e.g., 1.0 or main): 1.0
PyTorch Version (e.g., 1.0): 1.10.0
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source):
Python version: 3.7.11
CUDA/cuDNN version: 11.3
GPU models and configuration:
Any other relevant information:

facebookresearch / fairseq

How do I get the embeddings in each layer of a speech-to-text model for a given LibriSpeech input? #4357

❓ Questions and Help

What is your question?

What have you tried?

What's your environment?