Audio offset get wrong after some time when streaming audios

GrenardAntoine commented 2 months ago

Hello,

I use microsoft-cognitiveservices-speech-sdk (1.38.0) in order to do real time speech to text. It seems like the offset is right when I send a full audio but it is wrong when I send it cut in a lot of audio chunks.

The more there is audio chunks the more inaccurate the offset is :

No chunks : 1 726 300 000
369 chunks of 0.5 seconds : 1 729 600 000
923 chunks of 0.2 seconds : 1 744 600 000
1443 chunks of 0.1 seconds : 1 757 900 000

To reproduce here is some piece of code :

    const speechConfig = SpeechConfig.fromSubscription(<KEY>, <REGION);

    const pushStream = AudioInputStream.createPushStream();
    const audioConfig = AudioConfig.fromStreamInput(pushStream);
    const speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);

    speechRecognizer.recognized = async (recognizer, event) => {console.log(event)}
    speechRecognizer.canceled = async (recognizer, event) => {console.log(event)}
    speechRecognizer.startContinuousRecognitionAsync();

    for (let i = 1; i <= 1443; i++) {
      const formattedNumber = i.toString().padStart(4, '0');
      const buffer = fs.readFileSync(`/var/tmp/chunks/output_${formattedNumber}.wav`);
      pushStream.write(buffer);
    }

To create the audio chunks :

ffmpeg -i <INPUT_FILE> -f segment -segment_time 0.1 -c copy output_%04d.wav

Here is the audio link : https://drive.google.com/file/d/1H_RJuqMiBaVkpo9XHrgp1bpuFdgQl64O/view?usp=sharing

Thanks for your help

github-actions[bot] commented 1 month ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.

GrenardAntoine commented 1 month ago

Since the offset is getting more and more wrong over time. I decided to restart startContinuousRecognitionAsync() every minute. This clearly doesn't fix the problem but mitigates it.

CDSFounder commented 1 month ago

We are also seeing that the word offset resets after every ~10 min - the SDK creates a new connection and resets the word offsets.

@Azure Speech team - how are we supposed to keep track of word and phrase time stamps for real time speech to text which extends longer than 10 min??

rhurey commented 3 weeks ago

@CDSFounder are you looking at the JSON or the .offset property on the result?

The .offset property should be fixed up to produce an increasing offset, the JSON was only being partially corrected and is something we'll look at for a future release.

Azure-Samples / cognitive-services-speech-sdk

Audio offset get wrong after some time when streaming audios #2578