Weird inference - Githubissues

dmitrytyrin commented 4 years ago

Hello! I trained quartznet15x5 and wanted to perform inference on some audio. I used this script (from examples/applications/asr_service):

`import os from ruamel.yaml import YAML import nemo import nemo_asr from nemo_asr.helpers import post_process_predictions

MODEL_YAML = ‘path/to/quartznet15x5.yaml' CHECKPOINT_ENCODER = ‘path/to/JasperEncoder-STEP-200000.pt' CHECKPOINT_DECODER = ‘path/to/JasperDecoderForCTC-STEP-200000.pt' manifest = ‘path/to/manifest.json'

yaml = YAML(typ="safe") with open(MODEL_YAML) as f: jasper_model_definition = yaml.load(f) labels = jasper_model_definition['labels']

neural_factory = nemo.core.NeuralModuleFactory( placement=nemo.core.DeviceType.CPU, backend=nemo.core.Backend.PyTorch)

data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor( factory=neural_factory)

jasper_encoder = nemo_asr.JasperEncoder( jasper=jasper_model_definition['JasperEncoder']['jasper'], activation=jasper_model_definition['JasperEncoder']['activation'], feat_in=jasper_model_definition[ 'AudioToMelSpectrogramPreprocessor']['features'])

jasper_encoder.restore_from(CHECKPOINT_ENCODER, local_rank=0)

jasper_decoder = nemo_asr.JasperDecoderForCTC( feat_in=1024, num_classes=len(labels))

jasper_decoder.restore_from(CHECKPOINT_DECODER, local_rank=0)

greedy_decoder = nemo_asr.GreedyCTCDecoder()

data_layer = nemo_asr.AudioToTextDataLayer( shuffle=False, manifest_filepath=manifest, labels=labels, batch_size=1)

audio_signal, audio_signallen, , _ = data_layer() processed_signal, processed_signal_len = data_preprocessor( input_signal=audio_signal, length=audio_signal_len) encoded, encoded_len = jasper_encoder(audio_signal=processed_signal, length=processed_signal_len) log_probs = jasper_decoder(encoder_output=encoded) predictions = greedy_decoder(log_probs=log_probs) eval_tensors = [predictions] tensors = neural_factory.infer(tensors=eval_tensors) prediction = post_process_predictions(tensors[0], labels) ` but on every audio I got constant prediction (just first letter from labels). I tried inference on different audios, including training audio. I tried to shuffle labels in config file and then I got another constant prediction (again first letter from list of labels).

During training I got some NaN/inf warnings but training continued. Also, I got correct predictions during training (not just constant letter).

How is it possible? How can I fix my problem?

okuchaiev commented 4 years ago

can you share more details on your dataset? What is the sampling rate? (e.g. the output of play or aplay linux tools)

dmitrytyrin commented 4 years ago

@okuchaiev, the output from sox info (soxi) for audio from dataset: Channels : 1 Sample Rate : 16000 Precision : 16-bit Duration : 00:00:09.88 = 158000 samples ~ 740.625 CDDA sectors File Size : 316k Bit Rate : 256k Sample Encoding: 16-bit Signed Integer PCM

I tried to perform inference on these audio with another model (quartznet5x3) trained for 8k audio. And even then I got non-constant predictions.

Can you please help me to construct inference exactly as in training process (quartznet.py)?

vsl9 commented 4 years ago

@dmitrytyrin, can you please try the following script:

import nemo, nemo_asr
from nemo_asr.helpers import post_process_predictions
from ruamel.yaml import YAML
from nemo.backends.pytorch.nm import DataLayerNM
from nemo.core.neural_types import NeuralType, BatchTag, TimeTag, AxisType
import torch
import scipy.io.wavfile as wave

MODEL_YAML = 'examples/asr/configs/quartznet15x5.yaml'

# TODO: update to your checkpoints
CHECKPOINT_ENCODER = 'quartznet15x5/JasperEncoder-STEP-247400.pt'
CHECKPOINT_DECODER = 'quartznet15x5/JasperDecoderForCTC-STEP-247400.pt'

# TODO: update to your audio file
AUDIO_FILE = "./input.wav"

class AudioDataLayer(DataLayerNM):
    @property
    def output_ports(self):
        return {
            "audio_signal": NeuralType({0: AxisType(BatchTag),
                                        1: AxisType(TimeTag)}),

            "a_sig_length": NeuralType({0: AxisType(BatchTag)}),
        }

    def __init__(self, **kwargs):
        DataLayerNM.__init__(self, **kwargs)
        self.output = True

    def __iter__(self):
        return self

    def __next__(self):
        if not self.output:
            raise StopIteration
        self.output = False
        return torch.as_tensor(self.signal, dtype=torch.float32), \
               torch.as_tensor(self.signal_shape, dtype=torch.int64)

    def set_signal(self, signal):
        self.signal = np.reshape(signal.astype(np.float32)/32768., [1, -1])
        self.signal_shape = np.expand_dims(self.signal.size, 0).astype(np.int64)
        self.output = True

    def __len__(self):
        return 1

    @property
    def dataset(self):
        return None

    @property
    def data_iterator(self):
        return self

yaml = YAML(typ="safe")
with open(MODEL_YAML) as f:
    model_definition = yaml.load(f)
labels = model_definition['labels']
model_definition['AudioToMelSpectrogramPreprocessor']['dither'] = 0

neural_factory = nemo.core.NeuralModuleFactory(
    placement=nemo.core.DeviceType.GPU,
    backend=nemo.core.Backend.PyTorch)
data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(
    factory=neural_factory,
    **model_definition["AudioToMelSpectrogramPreprocessor"])
jasper_encoder = nemo_asr.JasperEncoder(
    feat_in=model_definition["AudioToMelSpectrogramPreprocessor"]["features"],
    **model_definition["JasperEncoder"])
jasper_decoder = nemo_asr.JasperDecoderForCTC(
    feat_in=model_definition["JasperEncoder"]["jasper"][-1]["filters"],
    num_classes=len(labels))
greedy_decoder = nemo_asr.GreedyCTCDecoder()

jasper_encoder.restore_from(CHECKPOINT_ENCODER)
jasper_decoder.restore_from(CHECKPOINT_DECODER)

# Instantiate necessary neural modules
data_layer = AudioDataLayer()

# Define inference DAG
audio_signal, audio_signal_len = data_layer()
processed_signal, processed_signal_len = data_preprocessor(
    input_signal=audio_signal,
    length=audio_signal_len)
encoded, encoded_len = jasper_encoder(audio_signal=processed_signal,
                                      length=processed_signal_len)
log_probs = jasper_decoder(encoder_output=encoded)
predictions = greedy_decoder(log_probs=log_probs)

_, signal = wave.read(AUDIO_FILE)
data_layer.set_signal(signal)
tensors = neural_factory.infer([predictions], verbose=False)
preds = tensors[0][0]
transcript = post_process_predictions([preds], labels)[0]

print('Transcript: "{}"'.format(transcript))

dmitrytyrin commented 4 years ago

@vsl9, thank you very much for your script! Unfortunately, there is still constant prediction (first letter of my labels)...

Same audio during training:

and on inference:

I tried to change labels in config file to English but again I got just first letter...

Some audio representations: wave:

Mel spectrogram:

Array from wave.read:

Is it possible that my checkpoints can't be correctly restored or is corrupted?

vsl9 commented 4 years ago

The signal looks fine. Some things to check:

Are you sure that the model converged? I see from your log that WER is 55% on a training batch. Prediction/reference pair shows just a single utterance from a batch.
Can you please look at preds tensor in my script? It should contain array of characters' indices before greedy merging.
Another option is to look at raw log probabilities at each time step. Just pass into neural_factory.infer method log_probs instead of predictions.
Have you tried to run evaluation on whole train or val dataset (instead of inference on a single audio clip)? Are WERs from checkpoints similar to evaluation WERs during training?

dmitrytyrin commented 4 years ago

I think model is converging (it's not a final epoch, ~2/3). After each step/epoch train batch loss and WER is decreasing... During training I get some NaN/inf warnings but training continues. Evaluation on test dataset always gives me nan loss and 100% WER. (I don't expect high quality because there are a lot of noisy audio in my dataset.)
preds:
How to get log_probs?) tensors = neural_factory.infer([log_probs], verbose=False) gave me error "TypeError: unhashable type: 'list'". tensors = neural_factory.infer(log_probs, verbose=False) gave me "TypeError: 'NmTensor' object is not iterable"
I ran evaluation with NeMo/examples/asr/jasper_eval.py. It gave me WER 99,5%. I printed greedy_hypotheses and all of them are constant (first letter of dictionary). I also printed logprobs and it gave me that:

So, for each letter model gives nan logprobs and they are converted to nulls. But why nans? And why I can see predictions during training?) Do you have any ideas?

wjyfelicity commented 4 years ago

@dmitrytyrin ，I have met the same problem as you, did you solve it?

dmitrytyrin commented 4 years ago

@wjyfelicity, I did not solve it. Maybe some of my audiofiles were corrupted. Try to run one epoch with batch_size=1 and find broken audiofile (it gives NaN during training). Also, try to rebuild your venv with latest torch and NeMo.

wjyfelicity commented 4 years ago

@dmitrytyrin , I am sure my audiofile is good, because when the audios can be infered by another trained model, but in this trained model(I just adjusted some parameters), all the infers are " "， and when training these models, there were no problems,

vsl9 commented 4 years ago

NaNs, zeros, empty strings during inference on valid audio files certainly don't look good. Can you please provide more information on your experiment? What version of NeMo do you use?How do you run training and inference? Which scripts/models do you use? Is there any small example to reproduce the issue?

NVIDIA / NeMo

Weird inference #281