NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

Weird inference #281

Closed dmitrytyrin closed 3 years ago

dmitrytyrin commented 4 years ago

Hello! I trained quartznet15x5 and wanted to perform inference on some audio. I used this script (from examples/applications/asr_service):

`import os from ruamel.yaml import YAML import nemo import nemo_asr from nemo_asr.helpers import post_process_predictions

MODEL_YAML = ‘path/to/quartznet15x5.yaml' CHECKPOINT_ENCODER = ‘path/to/JasperEncoder-STEP-200000.pt' CHECKPOINT_DECODER = ‘path/to/JasperDecoderForCTC-STEP-200000.pt' manifest = ‘path/to/manifest.json'

yaml = YAML(typ="safe") with open(MODEL_YAML) as f: jasper_model_definition = yaml.load(f) labels = jasper_model_definition['labels']

neural_factory = nemo.core.NeuralModuleFactory( placement=nemo.core.DeviceType.CPU, backend=nemo.core.Backend.PyTorch)

data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor( factory=neural_factory)

jasper_encoder = nemo_asr.JasperEncoder( jasper=jasper_model_definition['JasperEncoder']['jasper'], activation=jasper_model_definition['JasperEncoder']['activation'], feat_in=jasper_model_definition[ 'AudioToMelSpectrogramPreprocessor']['features'])

jasper_encoder.restore_from(CHECKPOINT_ENCODER, local_rank=0)

jasper_decoder = nemo_asr.JasperDecoderForCTC( feat_in=1024, num_classes=len(labels))

jasper_decoder.restore_from(CHECKPOINT_DECODER, local_rank=0)

greedy_decoder = nemo_asr.GreedyCTCDecoder()

data_layer = nemo_asr.AudioToTextDataLayer( shuffle=False, manifest_filepath=manifest, labels=labels, batch_size=1)

audio_signal, audio_signallen, , _ = data_layer() processed_signal, processed_signal_len = data_preprocessor( input_signal=audio_signal, length=audio_signal_len) encoded, encoded_len = jasper_encoder(audio_signal=processed_signal, length=processed_signal_len) log_probs = jasper_decoder(encoder_output=encoded) predictions = greedy_decoder(log_probs=log_probs) eval_tensors = [predictions] tensors = neural_factory.infer(tensors=eval_tensors) prediction = post_process_predictions(tensors[0], labels) ` but on every audio I got constant prediction (just first letter from labels). I tried inference on different audios, including training audio. I tried to shuffle labels in config file and then I got another constant prediction (again first letter from list of labels).

During training I got some NaN/inf warnings but training continued. Also, I got correct predictions during training (not just constant letter).

How is it possible? How can I fix my problem?

okuchaiev commented 4 years ago

can you share more details on your dataset? What is the sampling rate? (e.g. the output of play or aplay linux tools)

dmitrytyrin commented 4 years ago

@okuchaiev, the output from sox info (soxi) for audio from dataset: Channels : 1 Sample Rate : 16000 Precision : 16-bit Duration : 00:00:09.88 = 158000 samples ~ 740.625 CDDA sectors File Size : 316k Bit Rate : 256k Sample Encoding: 16-bit Signed Integer PCM

I tried to perform inference on these audio with another model (quartznet5x3) trained for 8k audio. And even then I got non-constant predictions.

Can you please help me to construct inference exactly as in training process (quartznet.py)?

vsl9 commented 4 years ago

@dmitrytyrin, can you please try the following script:

import nemo, nemo_asr
from nemo_asr.helpers import post_process_predictions
from ruamel.yaml import YAML
from nemo.backends.pytorch.nm import DataLayerNM
from nemo.core.neural_types import NeuralType, BatchTag, TimeTag, AxisType
import torch
import scipy.io.wavfile as wave

MODEL_YAML = 'examples/asr/configs/quartznet15x5.yaml'

# TODO: update to your checkpoints
CHECKPOINT_ENCODER = 'quartznet15x5/JasperEncoder-STEP-247400.pt'
CHECKPOINT_DECODER = 'quartznet15x5/JasperDecoderForCTC-STEP-247400.pt'

# TODO: update to your audio file
AUDIO_FILE = "./input.wav"

class AudioDataLayer(DataLayerNM):
    @property
    def output_ports(self):
        return {
            "audio_signal": NeuralType({0: AxisType(BatchTag),
                                        1: AxisType(TimeTag)}),

            "a_sig_length": NeuralType({0: AxisType(BatchTag)}),
        }

    def __init__(self, **kwargs):
        DataLayerNM.__init__(self, **kwargs)
        self.output = True

    def __iter__(self):
        return self

    def __next__(self):
        if not self.output:
            raise StopIteration
        self.output = False
        return torch.as_tensor(self.signal, dtype=torch.float32), \
               torch.as_tensor(self.signal_shape, dtype=torch.int64)

    def set_signal(self, signal):
        self.signal = np.reshape(signal.astype(np.float32)/32768., [1, -1])
        self.signal_shape = np.expand_dims(self.signal.size, 0).astype(np.int64)
        self.output = True

    def __len__(self):
        return 1

    @property
    def dataset(self):
        return None

    @property
    def data_iterator(self):
        return self

yaml = YAML(typ="safe")
with open(MODEL_YAML) as f:
    model_definition = yaml.load(f)
labels = model_definition['labels']
model_definition['AudioToMelSpectrogramPreprocessor']['dither'] = 0

neural_factory = nemo.core.NeuralModuleFactory(
    placement=nemo.core.DeviceType.GPU,
    backend=nemo.core.Backend.PyTorch)
data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(
    factory=neural_factory,
    **model_definition["AudioToMelSpectrogramPreprocessor"])
jasper_encoder = nemo_asr.JasperEncoder(
    feat_in=model_definition["AudioToMelSpectrogramPreprocessor"]["features"],
    **model_definition["JasperEncoder"])
jasper_decoder = nemo_asr.JasperDecoderForCTC(
    feat_in=model_definition["JasperEncoder"]["jasper"][-1]["filters"],
    num_classes=len(labels))
greedy_decoder = nemo_asr.GreedyCTCDecoder()

jasper_encoder.restore_from(CHECKPOINT_ENCODER)
jasper_decoder.restore_from(CHECKPOINT_DECODER)

# Instantiate necessary neural modules
data_layer = AudioDataLayer()

# Define inference DAG
audio_signal, audio_signal_len = data_layer()
processed_signal, processed_signal_len = data_preprocessor(
    input_signal=audio_signal,
    length=audio_signal_len)
encoded, encoded_len = jasper_encoder(audio_signal=processed_signal,
                                      length=processed_signal_len)
log_probs = jasper_decoder(encoder_output=encoded)
predictions = greedy_decoder(log_probs=log_probs)

_, signal = wave.read(AUDIO_FILE)
data_layer.set_signal(signal)
tensors = neural_factory.infer([predictions], verbose=False)
preds = tensors[0][0]
transcript = post_process_predictions([preds], labels)[0]

print('Transcript: "{}"'.format(transcript))
dmitrytyrin commented 4 years ago

@vsl9, thank you very much for your script! Unfortunately, there is still constant prediction (first letter of my labels)...

Same audio during training:

Screenshot 2020-01-23 at 15 12 57

and on inference:

Screenshot 2020-01-23 at 19 26 35

I tried to change labels in config file to English but again I got just first letter...

Some audio representations: wave:

Screenshot 2020-01-23 at 19 35 04

Mel spectrogram:

Screenshot 2020-01-23 at 19 35 24

Array from wave.read:

Screenshot 2020-01-23 at 19 37 15

Is it possible that my checkpoints can't be correctly restored or is corrupted?

vsl9 commented 4 years ago

The signal looks fine. Some things to check:

dmitrytyrin commented 4 years ago

So, for each letter model gives nan logprobs and they are converted to nulls. But why nans? And why I can see predictions during training?) Do you have any ideas?

wjyfelicity commented 4 years ago

@dmitrytyrin ,I have met the same problem as you, did you solve it?

dmitrytyrin commented 4 years ago

@wjyfelicity, I did not solve it. Maybe some of my audiofiles were corrupted. Try to run one epoch with batch_size=1 and find broken audiofile (it gives NaN during training). Also, try to rebuild your venv with latest torch and NeMo.

wjyfelicity commented 4 years ago

@dmitrytyrin , I am sure my audiofile is good, because when the audios can be infered by another trained model, but in this trained model(I just adjusted some parameters), all the infers are " ", and when training these models, there were no problems, image

vsl9 commented 4 years ago

NaNs, zeros, empty strings during inference on valid audio files certainly don't look good. Can you please provide more information on your experiment? What version of NeMo do you use?How do you run training and inference? Which scripts/models do you use? Is there any small example to reproduce the issue?