NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

How to fix the delay in timestamps that builds up for large audio files? #1979

Closed Vishaal-MK closed 2 years ago

Vishaal-MK commented 3 years ago

I'm using the code provided here to extract timestamps along with the transcription. The audio files are fairly long (30 to 60 mins). Problem is, the timestamps get progressively offset until they are completely misaligned halfway through the document. Any idea how to fix this?

Environment overview

marlon-br commented 3 years ago

@vsl9 Could you please describe how the next parameters are calculated?

# 20ms is duration of a timestep at output of the model
time_stride = 0.02

# calibration offset for timestamps: 180 ms
offset = -0.18

and what should be the numbers for stt_en_citrinet_1024 model?

vsl9 commented 3 years ago

@Vishaal-MK, have you solved the issue? If not, can you please share an audio file to reproduce the issue? I've tested the notebook on long conference talks (~40 minutes long), it works fine. @marlon-br, time_stride is a duration of timestep for logits tensor (output of ASR model). QuartzNet (or Jasper) takes a mel spectrogram (computed using overlapping windows with stride=10ms) as input. The first convolutional layer has stride=2 (so the ASR model downsamples input by 2). That's why timestep of logits tensor has duration of 20ms. offset parameter is used to compensate slight delay of model's output. It was set manually by tuning the offset and listening to audio (but, of course, it can be calibrated automatically if ground truth timestamps at word level are available). Citrinet models have 3 convolutional layers with stride=2, so the resulting downsampling factor is 8 and time_stride=0.08. It also means that resolution for word timestamps is 4 times worse in comparison with QuartzNet/Jasper. Another difference is that Citrinet is word-piece based model (not character based model like QuartzNet/Jasper), thus code for prediction of space character timestamps becomes slightly more complicated (depending on Citrinet's tokens).

marlon-br commented 3 years ago

@vsl9 thanks for the answer, now it is more clear. I am experimenting with stt_en_citrinet_1024 model now because it gives best results on the files I have in test.

Could you please confirm or correct my logic? (I use this wav file as example)

Model output is: let me guess you're the kind of guy that ignores the rules because it makes you feel uncontroll am i right you're not wrong you think that's cute do you think it's cute six feet at all times you both know the rules

And by chars:

blank
blank
let
blank
blank
blank
me
blank
guess
blank
blank
blank
blank
you
##'
##re
blank
the
blank
kind
blank
blank
of
blank
gu
##y
blank
that
blank
i
##g
##no
##re
##re
##s
blank
the
ru
blank
##les

etc.

If we ignore offset for now, does the next output for words look correct?

let 0.16 0.24
me 0.48 0.56
guess 0.64 0.72
you're 1.04 1.28
the 1.36 1.44
kind 1.52 1.6
of 1.76 1.84
guy 1.92 2.08
that 2.16 2.24
ignores 2.32 2.8000000000000003
the 2.88 2.96
rules 2.96 3.2800000000000002