Bad asr prediction on audio with a bit of noise

Hi, first of all, thank you for providing this repo! I was able to set up speech recognition on my Jetson Nano 2GB relatively easily with it. However, the quality of the prediction with the microphone I'm using is quite poor:

First I checked the provided dusty.wav file with the asr.py example. The predicted full sentences are, just as in the readme, pretty good:

hi hi this is dusty check on two two three.
what's the weather going to be tomorrow in pittsburg.
today is wednesday tomorrow is thursday.
i would like to order a large pepperoni pizza.

Then I tried to play this audio on a speaker and record it with the microphone that I intend to use for detection. It produced this audio file. If you play it, you can hear some noise, but you can still hear the voice very clearly (apart from the first 5 seconds). Still, the prediction on it is pretty bad:

they're going to be.
dawned.
thursday.
larger.
i going tomorrow.
this.
chat.
so.
three.
what weather.
tomorrow pittsburgh.
today is wednesday.
rotary.
ron.
is going tomorrow.
this is dusty.
ca no.
the.
what the weather tomorrow in pittsburgh.
today is wednesday tomorrow's thursday.

When I talk myself, the prediction is similarily bad.

Do you have an idea what might be the cause of it? Maybe there is a relatively simple fix to the preprocessing pipeline or some configuration that I can try? I noticed that my recording has a very tiny echo. Maybe it's worth a shot to augment the training data in a similar way and retrain it? If you think that might help, can you outline how I would be able to do that? Or is there maybe a better version of the quarznet model out there? You mentioned RIVA in another issue. Sadly I cannot use that because I need to make it work on the Jetson Nano 2GB. And quarznet already uses 95% of the memory I have. So it would be nice to make it work.

quartznet is very light network for this stuff, try citrinet or conformer NN. Conformer has much better results but you need to implement your own bpe decoder

Thanks for the reply! I downloaded stt_en_conformer_ctc_small.nemo model and converted it to onnx with scripts/nemo_export_onnx.py (I had to do this on my desktop pc, the nano didn't have enough ram to do this). I then copied the resulting .onnx and .json file into a copy of quartznet-15x5_en and edited the json file to include "ctc_decoder". When I run it I get this error message:

[2023-04-08 14:19:00] asr.py:77 - Traceback (most recent call last):
  File "examples/asr.py", line 63, in <module>
    results = asr(samples)
  File "/jetson-voice/jetson_voice/models/asr/asr_engine.py", line 180, in __call__
    logits = self.model.execute(x)
  File "/jetson-voice/jetson_voice/backends/tensorrt/trt_model.py", line 114, in execute
    setup_binding(self.bindings[idx], input)
  File "/jetson-voice/jetson_voice/backends/tensorrt/trt_model.py", line 109, in setup_binding
    binding.set_shape(input.shape)
  File "/jetson-voice/jetson_voice/backends/tensorrt/trt_binding.py", line 80, in set_shape
    raise ValueError(f"failed to set binding '{self.name}' with shape {shape}")
ValueError: failed to set binding 'audio_signal' with shape (1, 80, 154)

This is probably due to the fact that the conformer model outputs "bpe". Can you tell me how to implement this bpe decoder?

Conformers have another input shape: 1x80x1, so edit asr_engine.py like this:

    # load the model
    features = self.config.preprocessor.n_mels if self.classification else self.config.preprocessor.features
    time_to_fft = self.sample_rate * (1.0 / 160.0)     # rough conversion from samples to MEL spectrogram dims

  dynamic_shapes = {
        'min' : (1, features, int(0.1 * time_to_fft)), 'min2': (1,), 
        'opt' : (1, features, int(5 * time_to_fft)), 'opt2' : (1,), 
        'max' : (1, features, int(15 * time_to_fft)), 'max2' : (1,)  
    }

...

    # apply pre-processing
    preprocessed_signal, audio_length = self.preprocessor(
        input_signal=torch.as_tensor(self.buffer, dtype=torch.float32).unsqueeze(dim=0), 
        length=torch.as_tensor(self.buffer.size, dtype=torch.int32).unsqueeze(dim=0)
    )

...

    # run the asr model
    if not self.classification and self.config['preprocessor']['features'] == 64:
        logits = self.model.execute((torch_to_numpy(preprocessed_signal)))
    else:
        logits = self.model.execute((torch_to_numpy(preprocessed_signal),torch_to_numpy(audio_length)))
    logits = np.squeeze(logits)
    logits = softmax(logits, axis=-1)

...

    if self.classification:
        argmax = np.argmax(logits)
        prob = logits[argmax]
        return (self.config['labels'][argmax], prob)
    else:
        self.ctc_decoder.set_timestep_duration(self.timestep_duration)
        self.ctc_decoder.set_timestep_delta(self.n_timesteps_frame)

...

        transcripts = self.ctc_decoder.decode(logits)

...

        return transcripts
    `

edit json file like this:

"preprocessor": { "_target_": "nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor", "normalize": "per_feature", "window_size": 0.025, "sample_rate": 16000, "window_stride": 0.01, "window": "hann", "features": 80, "n_fft": 512, "frame_splicing": 1, "dither": 1e-05, "stft_conv": false, "pad_to": 16 },

int(15 * time_to_fft) - is approx 3-4 chunk seconds for input voice. 1,80,X where 1 is a batch, 80 is a features number for the preprocessing, X - dimension of input voice after preprocessing I could take large comformer model, jetson nano can run it with onnxruntime if you have 4Gb. I didn't tried small or medium models, so nothing can say about them. I tried to use tensorrt but my TRT version failed to build TRT model from ONNX.

You can change default backend engine in utils/config.py by editing _default_global_config: 'default_backend' : 'onnxruntime'

What is yout stt_en_conformer_ctc_small.nemo version? https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_small I see this version has 1024 and 128 vocabulary size.

I wrote both decoders, my 128 version is like:

` class CTCConformerDecoder(CTCDecoder): def init(self, config, whole_config, vocab, resource_path=None):

    super().__init__(config, vocab)
    self.config.setdefault('word_threshold', -1000.0)
    self.reset()

    logging.info('creating CTCDecoder')
    logging.info(str(self.config))

    self.tokenizer_path = resource_path+whole_config['tokenizer']['dir'].lower()
    vocab = []
    with open(self.tokenizer_path+'/vocab.txt') as f:
        for line in f:
            vocab.append(line.rstrip('\n'))
        if '_' not in vocab:
            #vocab.insert(1023,'_')
            vocab[0]='n'
            vocab.insert(129,'_')
    self.vocab = vocab
    print(self.vocab)
    self.reset()

def decode(self, logits):
    text = []
    test = []

    test_probs = []
    prob = 1.0
    probs = []
    prev = ''
    timespend = 0
    new_word = False
    # select the chars with the max probability
    for i in range(logits.shape[0]):
        argmax = np.argmax(logits[i])
        argmax2 = np.argmax(np.delete(logits[i],argmax))
        test.append(self.vocab[argmax-1])
        test_probs.append(argmax-1)
        if "n" in self.vocab[argmax-1]:
            new_word = True

        elif "##" in self.vocab[argmax-1] and new_word == False:
            text.append((self.vocab[argmax-1]).replace('##',''))
            probs.append(logits[i][argmax])
        elif "##" in self.vocab[argmax-1] and new_word == True:
            text.append(' ')
            probs.append(1)     
            new_word = False
            timespend -= 1
            text.append((self.vocab[argmax-1]).replace('##',''))
            probs.append(logits[i][argmax])

        elif "_" in self.vocab[argmax-1]:
            text.append('')
            probs.append(1)
        elif self.vocab[argmax-1] != prev:
            text.append(' ')
            probs.append(1)
            text.append(self.vocab[argmax-1])
            probs.append(logits[i][argmax])
            timespend -= 1
        prev = self.vocab[argmax-1]

    #print("in test: {} ".format(test))
    # get the max number of sequential silent timesteps (continuing from last frame)
    silent_timesteps = self.end_silent_timesteps
    max_silent_timesteps = 0

    for i in range(len(text)):
        if text[i] == '':
            silent_timesteps += 1
        else:
            max_silent_timesteps = max(silent_timesteps, max_silent_timesteps) if i > 0 else 0
            silent_timesteps = 0

    if text[-1] == '':
        self.end_silent_timesteps = silent_timesteps

    # merge repeating chars and blank symbols
    text_merged, words = self.merge_chars(text, probs, timespend)  #text[:len(text)-self.config['offset']]
    #print("text_merged: {}, words: {}".format(text_merged, words))
    #print(text_merged)

    # merge new words with past words
    words = merge_words(self.words, words, self.config['word_threshold'], 'overlap')
    #print(words)

    # increment timestep (after this frame's timestep is done being used, and before a potential EOS reset)
    self.timestep += self.timestep_delta+1

    # check for EOS
    end = False

    transcript = (transcript_from_words(words, scores=global_config.debug, times=global_config.debug, end=end, add_punctuation=self.config['add_punctuation'])
    words = words[punct_position:]     

    if silent_timesteps > self.timesteps_silence:
        end = True
        self.reset()
    else:
        self.words = words

    return [{
        'text' : transcript,
        'words' : words,
        'end' : end
    }]

def merge_chars(self, text, probs, timespend):
    """
    Merge repeating chars and blank symbols into words.
    """
    text_merged = ''

    word = None
    words = []

    for i in range(len(text)):
        #print("current:{} prev:{}".format(text[i],self.prev_char))
        timespend += 1

        if (text[i] != self.prev_char and text[i] != '_'):
            self.prev_char = text[i]

            if text[i] != '_':
                text_merged += text[i]

                if " " not in text[i]:
                    if word is None:
                        word = {
                            'text' : text[i],
                            'score' : probs[i],
                            'start_time' : self.timestep + timespend,
                            'end_time' : self.timestep + timespend
                        }
                    else:
                        word['text'] += text[i]
                        word['score'] = (word['score'] + probs[i])/2
                        word['end_time'] = self.timestep + timespend

            if " " in text[i] and word is not None:
                words.append(word)
                word = None

    if word is not None:
        words.append(word)

    return text_merged, words

def reset(self):
    """
    Reset the CTC decoder state at EOS (end of sentence)
    """
    self.prev_char = ''
    self.end_silent_timesteps = 0
    self.timestep = 0
    self.words = []   

@property
def language_model(self):
    return self.config['language_model']

` you can create new file ctc_conformer.py and init this ctc_conformer decoder in ctc_decoder.py like this:

def from_config(config, vocab, whole_config, resource_path=None):
    type = config['type'].lower()

    if type == 'greedy':
        from .ctc_greedy import CTCGreedyDecoder
        return CTCGreedyDecoder(config, vocab)
    elif type == "beamsearch":
        from .ctc_beamsearch import CTCBeamSearchDecoder
        return CTCBeamSearchDecoder(config, vocab, resource_path)
    elif type == "subwords":
        from .ctc_subwords import CTCSubwordsDecoder
        return CTCSubwordsDecoder(config, whole_config, vocab, resource_path)      
    elif type == "ctc_conformer":
        from .ctc_conformer import CTCConformerDecoder
        return CTCConformerDecoder(config, whole_config, vocab, resource_path)        
    else:
        raise ValueError(f"invalid/unrecognized CTC decoder type '{type}'")

edit json file like this:

` "decoder": { "target": "nemo.collections.asr.modules.ConvASRDecoder", "feat_in": 512, "num_classes": 128 },

"ctc_decoder" : { "type": "ctc_conformer" }, `

Thanks again for the detailled response!

I tried the "STT En Conformer-CTC Small" version rc1.0.0 and 1.00 (both with vocabulary size 128), but wasn't able to make it work yet.

I have a question about this line of code:

self.model.execute((torch_to_numpy(preprocessed_signal),torch_to_numpy(audio_length))

It seems your generated onnxfile expects two input tensors. But mine only wants one, so when I call it like you suggested, it crashes. When I remove the second tensor, it still crashes. It's possible that my onnxfile is corrupt. Reason why I think that is because when I call scripts/nemo_export_onnx.py it outputs this error message in the end (but produces an onnx file):

[W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was:
[ONNXRuntimeError] : 1 : FAIL : Node (Transpose_3244) Op (Transpose) [TypeInferenceError] Invalid attribute perm {1, 0}, input shape = {0}

exported data/stt_en_conformer_ctc_small10.nemo to data/con_small_10.onnx

Can you send me your onnx file?

Hi! I exported stt_en_conformer_ctc_small.nemo to onnx: https://drive.google.com/file/d/1vRy695flgcvC5vPnZ9C4D8wUsiFA4tgl/view?usp=sharing. It has inputs: audio_signal (type float32[audio_signal_dynamic_axes_1,80,audio_signal_dynamic_axes_2] and length (type: int64[length_dynamic_axes_1]) Output: logprobs (type: float32[logprobs_dynamic_axes_1,logprobs_dynamic_axes_2,129]) The model is trained with librispeech dataset, Im not sure if it's need fine turning for better quality.. I checked vocabulary for this model, it has 128 tokens (word-pair encoding), but model output has dimesion 129, I think that token 129 is a merge symbol, although I didnt dig it deep, need to print a raw logits output from softmax function

Thanks again! Sadly I cannot load your model:

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /jetson-voice/data/networks/asr/stt_en_conformer_ctc_small/quartznet.onnx failed:/home/onnxruntime/onnxruntime-py36/onnxruntime/core/graph/model.cc:111 onnxruntime::Model::Model(onnx::ModelProto&&, const PathString&, const IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&) Unknown model file format version.

I also tried other versions of onnxruntime, but none was successful.

Can you tell me what docker container you are using? The one in this repository has onnxruntime 1.7 installed. And that version cannot load your model. Or did you install the python packages by yourself? If so, can you tell me which packages in which version you installed?

You are right, onnx doesn't load on jetson nano, I have the 1.7 version too. I will try to export model with another opset and make a folding onnx file when I recover my nvidia drivers with toolkit on x86.. ohh ubuntu... it happened again...

dusty-nv / jetson-voice

Bad asr prediction on audio with a bit of noise #18