NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

How to view raw and augmented spectrograms?[Question] #1645

Closed rbracco closed 3 years ago

rbracco commented 3 years ago

Describe your question

I would like to be able to display the spectrograms that are being generated by Quartznet and SpecAugment but I'm having trouble being able to do so.

Things I tried:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-198-408d2f6bb357> in <module>()
----> 1 empty_model.preprocessor(input_signal=y_resampled, input_signal_length=y_resampled.size(-1))

2 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/nemo/core/classes/common.py in __call__(self, wrapped, instance, args, kwargs)
    506 
    507         # Perform rudimentary input checks here
--> 508         instance._validate_input_types(input_types=input_types, **kwargs)
    509 
    510         # Call the method - this can be forward, or any other callable method

/usr/local/lib/python3.6/dist-packages/nemo/core/classes/common.py in _validate_input_types(self, input_types, **kwargs)
     96                 if key not in input_types:
     97                     raise TypeError(
---> 98                         f"Input argument {key} has no corresponding input_type match. "
     99                         f"Existing input_types = {input_types.keys()}"
    100                     )

TypeError: Input argument input_signal_length has no corresponding input_type match. Existing input_types = dict_keys(['input_signal', 'length'])

Environment overview (please complete the following information)

titu1994 commented 3 years ago

The type error mentions that "input_signal_length" has no corresponding input types - and the two valid input names are input_signal and length.

So if you replace the call function argument of empty_model.preprocessor() from input_signal_length to just length, it should work

rbracco commented 3 years ago

Thank you, I was able to get this working thanks to your help and will share code in case anyone else needs to replicate. That being said I'm still hoping there is a better way to do this. I also had to change normalization in config from per_feature to all_features to avoid a NaN issue.

Any insight into why there are two channels returned by the preprocessor and one of them is empty? Is it because I pass the length as a 2D tensor (shape: [1, num_samples]) instead of a 1D tensor? Thank you.

from IPython.display import display
import matplotlib.pyplot as plt
import librosa.display
plt.rcParams["figure.figsize"]=(12,9)
def display_specs(model, audio_file):
    display(Audio(audio_file))
    model.to('cpu')
    y0,sr = torchaudio.load(audio_file)
    y0r = torchaudio.transforms.Resample(sr,16000)(y0)
    y0_len = torch.tensor(y0r.shape)
    fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True)
    spec_result = model.preprocessor(input_signal=y0r, length=torch.tensor(y0_len))
    ax[0].set(title="First channel of spec")
    librosa.display.specshow(spec_result[0][0].numpy(), ax=ax[0])
    ax[1].set(title="Second channel of spec")
    librosa.display.specshow(spec_result[0][1].numpy(), ax=ax[1])
    aug_result = model.spec_augmentation(input_spec=spec_result[0][1].unsqueeze(0))
    ax[2].set(title="Augmented second channel of spec")
    librosa.display.specshow(aug_result[0].numpy(), ax=ax[2])

Use by calling display_specs(<your_model_instance>, <path_to_audiofile>)

titu1994 commented 3 years ago

I also had to change normalization in config from per_feature to all_features to avoid a NaN issue.

This should not be needed

Any insight into why there are two channels returned by the preprocessor and one of them is empty? Is it because I pass the length as a 2D tensor (shape: [1, num_samples]) instead of a 1D tensor?

Length is supposed to be a 1D tensor with each element representing the duration per sample. So len(length) == batch size; length[0] = len(sample_0).

You can take a look at FilterbankFeatures to make out whats happening inside the preprocessor module.