Closed dmitrytyrin closed 3 years ago
can you share more details on your dataset? What is the sampling rate? (e.g. the output of play or aplay linux tools)
@okuchaiev, the output from sox info (soxi) for audio from dataset: Channels : 1 Sample Rate : 16000 Precision : 16-bit Duration : 00:00:09.88 = 158000 samples ~ 740.625 CDDA sectors File Size : 316k Bit Rate : 256k Sample Encoding: 16-bit Signed Integer PCM
I tried to perform inference on these audio with another model (quartznet5x3) trained for 8k audio. And even then I got non-constant predictions.
Can you please help me to construct inference exactly as in training process (quartznet.py)?
@dmitrytyrin, can you please try the following script:
import nemo, nemo_asr
from nemo_asr.helpers import post_process_predictions
from ruamel.yaml import YAML
from nemo.backends.pytorch.nm import DataLayerNM
from nemo.core.neural_types import NeuralType, BatchTag, TimeTag, AxisType
import torch
import scipy.io.wavfile as wave
MODEL_YAML = 'examples/asr/configs/quartznet15x5.yaml'
# TODO: update to your checkpoints
CHECKPOINT_ENCODER = 'quartznet15x5/JasperEncoder-STEP-247400.pt'
CHECKPOINT_DECODER = 'quartznet15x5/JasperDecoderForCTC-STEP-247400.pt'
# TODO: update to your audio file
AUDIO_FILE = "./input.wav"
class AudioDataLayer(DataLayerNM):
@property
def output_ports(self):
return {
"audio_signal": NeuralType({0: AxisType(BatchTag),
1: AxisType(TimeTag)}),
"a_sig_length": NeuralType({0: AxisType(BatchTag)}),
}
def __init__(self, **kwargs):
DataLayerNM.__init__(self, **kwargs)
self.output = True
def __iter__(self):
return self
def __next__(self):
if not self.output:
raise StopIteration
self.output = False
return torch.as_tensor(self.signal, dtype=torch.float32), \
torch.as_tensor(self.signal_shape, dtype=torch.int64)
def set_signal(self, signal):
self.signal = np.reshape(signal.astype(np.float32)/32768., [1, -1])
self.signal_shape = np.expand_dims(self.signal.size, 0).astype(np.int64)
self.output = True
def __len__(self):
return 1
@property
def dataset(self):
return None
@property
def data_iterator(self):
return self
yaml = YAML(typ="safe")
with open(MODEL_YAML) as f:
model_definition = yaml.load(f)
labels = model_definition['labels']
model_definition['AudioToMelSpectrogramPreprocessor']['dither'] = 0
neural_factory = nemo.core.NeuralModuleFactory(
placement=nemo.core.DeviceType.GPU,
backend=nemo.core.Backend.PyTorch)
data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(
factory=neural_factory,
**model_definition["AudioToMelSpectrogramPreprocessor"])
jasper_encoder = nemo_asr.JasperEncoder(
feat_in=model_definition["AudioToMelSpectrogramPreprocessor"]["features"],
**model_definition["JasperEncoder"])
jasper_decoder = nemo_asr.JasperDecoderForCTC(
feat_in=model_definition["JasperEncoder"]["jasper"][-1]["filters"],
num_classes=len(labels))
greedy_decoder = nemo_asr.GreedyCTCDecoder()
jasper_encoder.restore_from(CHECKPOINT_ENCODER)
jasper_decoder.restore_from(CHECKPOINT_DECODER)
# Instantiate necessary neural modules
data_layer = AudioDataLayer()
# Define inference DAG
audio_signal, audio_signal_len = data_layer()
processed_signal, processed_signal_len = data_preprocessor(
input_signal=audio_signal,
length=audio_signal_len)
encoded, encoded_len = jasper_encoder(audio_signal=processed_signal,
length=processed_signal_len)
log_probs = jasper_decoder(encoder_output=encoded)
predictions = greedy_decoder(log_probs=log_probs)
_, signal = wave.read(AUDIO_FILE)
data_layer.set_signal(signal)
tensors = neural_factory.infer([predictions], verbose=False)
preds = tensors[0][0]
transcript = post_process_predictions([preds], labels)[0]
print('Transcript: "{}"'.format(transcript))
@vsl9, thank you very much for your script! Unfortunately, there is still constant prediction (first letter of my labels)...
Same audio during training:
and on inference:
I tried to change labels in config file to English but again I got just first letter...
Some audio representations: wave:
Mel spectrogram:
Array from wave.read:
Is it possible that my checkpoints can't be correctly restored or is corrupted?
The signal looks fine. Some things to check:
Are you sure that the model converged? I see from your log that WER is 55% on a training batch. Prediction/reference pair shows just a single utterance from a batch.
Can you please look at preds
tensor in my script? It should contain array of characters' indices before greedy merging.
Another option is to look at raw log probabilities at each time step. Just pass into neural_factory.infer
method log_probs
instead of predictions
.
Have you tried to run evaluation on whole train or val dataset (instead of inference on a single audio clip)? Are WERs from checkpoints similar to evaluation WERs during training?
I think model is converging (it's not a final epoch, ~2/3). After each step/epoch train batch loss and WER is decreasing... During training I get some NaN/inf warnings but training continues. Evaluation on test dataset always gives me nan loss and 100% WER. (I don't expect high quality because there are a lot of noisy audio in my dataset.)
preds:
How to get log_probs?) tensors = neural_factory.infer([log_probs], verbose=False) gave me error "TypeError: unhashable type: 'list'". tensors = neural_factory.infer(log_probs, verbose=False) gave me "TypeError: 'NmTensor' object is not iterable"
I ran evaluation with NeMo/examples/asr/jasper_eval.py. It gave me WER 99,5%. I printed greedy_hypotheses and all of them are constant (first letter of dictionary). I also printed logprobs and it gave me that:
So, for each letter model gives nan logprobs and they are converted to nulls. But why nans? And why I can see predictions during training?) Do you have any ideas?
@dmitrytyrin ,I have met the same problem as you, did you solve it?
@wjyfelicity, I did not solve it. Maybe some of my audiofiles were corrupted. Try to run one epoch with batch_size=1 and find broken audiofile (it gives NaN during training). Also, try to rebuild your venv with latest torch and NeMo.
@dmitrytyrin , I am sure my audiofile is good, because when the audios can be infered by another trained model, but in this trained model(I just adjusted some parameters), all the infers are " ", and when training these models, there were no problems,
NaNs, zeros, empty strings during inference on valid audio files certainly don't look good. Can you please provide more information on your experiment? What version of NeMo do you use?How do you run training and inference? Which scripts/models do you use? Is there any small example to reproduce the issue?
Hello! I trained quartznet15x5 and wanted to perform inference on some audio. I used this script (from examples/applications/asr_service):
`import os from ruamel.yaml import YAML import nemo import nemo_asr from nemo_asr.helpers import post_process_predictions
MODEL_YAML = ‘path/to/quartznet15x5.yaml' CHECKPOINT_ENCODER = ‘path/to/JasperEncoder-STEP-200000.pt' CHECKPOINT_DECODER = ‘path/to/JasperDecoderForCTC-STEP-200000.pt' manifest = ‘path/to/manifest.json'
yaml = YAML(typ="safe") with open(MODEL_YAML) as f: jasper_model_definition = yaml.load(f) labels = jasper_model_definition['labels']
neural_factory = nemo.core.NeuralModuleFactory( placement=nemo.core.DeviceType.CPU, backend=nemo.core.Backend.PyTorch)
data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor( factory=neural_factory)
jasper_encoder = nemo_asr.JasperEncoder( jasper=jasper_model_definition['JasperEncoder']['jasper'], activation=jasper_model_definition['JasperEncoder']['activation'], feat_in=jasper_model_definition[ 'AudioToMelSpectrogramPreprocessor']['features'])
jasper_encoder.restore_from(CHECKPOINT_ENCODER, local_rank=0)
jasper_decoder = nemo_asr.JasperDecoderForCTC( feat_in=1024, num_classes=len(labels))
jasper_decoder.restore_from(CHECKPOINT_DECODER, local_rank=0)
greedy_decoder = nemo_asr.GreedyCTCDecoder()
data_layer = nemo_asr.AudioToTextDataLayer( shuffle=False, manifest_filepath=manifest, labels=labels, batch_size=1)
audio_signal, audio_signallen, , _ = data_layer() processed_signal, processed_signal_len = data_preprocessor( input_signal=audio_signal, length=audio_signal_len) encoded, encoded_len = jasper_encoder(audio_signal=processed_signal, length=processed_signal_len) log_probs = jasper_decoder(encoder_output=encoded) predictions = greedy_decoder(log_probs=log_probs) eval_tensors = [predictions] tensors = neural_factory.infer(tensors=eval_tensors) prediction = post_process_predictions(tensors[0], labels) ` but on every audio I got constant prediction (just first letter from labels). I tried inference on different audios, including training audio. I tried to shuffle labels in config file and then I got another constant prediction (again first letter from list of labels).
During training I got some NaN/inf warnings but training continued. Also, I got correct predictions during training (not just constant letter).
How is it possible? How can I fix my problem?