Provide Server Timestamp so we can relate the Conversation Transcription Result 'offset' to a a real time

josephsctan commented 2 years ago

With Conversation Transcription (or Speech Recognition), the result that we receive contains an 'offset' value.

e.g.

            transcriber.transcribed = function (s, e) {
                console.log (e.offset); // offset given in 'ticks'
            }

When using a 'live' audio stream , like fromDefaultMicrophoneInput(), it is not certain where the offset is calculated from.

I tried to capture the 'start' time when the session starts:

            var CTSessionStartTime;
            ....
            transcriber.sessionStarted = function (s, e) {
                      CTSessionStartTime = dayjs();
            }

But that value is inaccurate, as is evident when I review the audio recording.

Worse, it drifts , if I start and stop the transcription without leaving the conversation:

           // #A 
           transcriber.stopTranscribingAsync( () => {    self.IsCTMuted = true;       }); // Mute 
           // #B 
           transcriber.startTranscribingAsync( () => {    self.IsCTMuted = false;       }); // UnMute <- this will trigger the 'sessionStarted ' event again

After #B, all the offsets 'reset' (start from zero) so I have to cumulatively keep track of CTSessionStartTime. After each mute/unmute, the offset drifts by 1-2 seconds, relative to CTSessionStartTime .

So, all that is to say, can the API provide a UTC timestamp which is definitive start time, from which each offset is calculated?

I know this would be the UTC time according to Microsoft's server, but we can work out the time difference between the browser and Microsoft's server if we assume that the latency is half the round trip time., e.g.

1. let browserUtcNow1 = dayjs() 
2. let serverUtcNow  = transcriber.getTime() // returns UTC NOW at Microsoft server
3. let browserUtcNow2 = dayjs() 
4. let browserUtcNow = browserUtcNow1  + (browserUtcNow2  - browserUtcNow1 ) / 2 ; // this is the time at the browser, at the (nearly same) moment in time as serverUtcNow   
5. let browserOffsetMs = serverUtcNow    - browserUtcNow

So, knowing the browserOffsetMs , and given a start Timestamp in each result, we can work out what e.result.offset actually means in actual time.

Unless there's an even easier solution?

BrianMouncer commented 2 years ago

I will open a work item to improve our documentation for our various "offset" properties. And I will do some experimenting with one of our samples to verify how this is actually working.

In the meantime, this is how I remember trying to calculate, wall clock times for recognition when I needed them some time ago. See if this will help you make progress.

The "offset" is from the start of the current audio stream. e.g. Using "session started", "audio started" could have a variable amount of drift because of startup delays for the connection and engine starting to read the beginning silence and skipping over it.

For a good "wall clock" time, try subscribing to the "SpeechStartDetected" event. This will have an offset that is the start of the first detectable speech in the stream. e.g. skipping over any beginning silence or non-speech noise.

From what I know, you can use offsets on later results to get an approximate "wall clock" for those event by adding the tick count "total time" to the wall clock time you recorded in SpeechStartDetected... If you stop recognition, and restart again, it will be a new stream, and will have a new "offsets" again with the first detected speech in the new stream.

I hope that helps,

josephsctan commented 2 years ago

Hi There @BrianMouncer

Thanks for the information.

That would probably work, if we capture the browser time when we receive the speechStartDetected event, and adjust by the ~~browserOffsetMs~~ browser-to-server latency

Unfortunately, that event seems to be missing in the Conversation Transcriber (CT). It does exist in the Speech Recognizer, though. Is there a equivalent for CT?

Thanks

josephsctan commented 2 years ago

Hi There,

Some updates:

The offset reported in the "speechStartDetected " is just the offset of the first snippet. But we can use it to work back to the start of the audio.

When using Conversation Transcriber, it is possible to do this:

                    transcriber.sessionStarted = function (s, e) {
                        // keep track of session start time (a session may start more than once 
                        // during the conversation because of mute/unmute)
                        CTSessionStartTime = dayjs();
                    }

                    transcriber.privRecognizer.speechStartDetected = function (s, e) {
                         // keep track of when speech detection started
                        let eventReceivedAtBrowser = dayjs();

                          // when did speech detection actually start?  A 'moment' ago, based on the network latency
                        let speechStartedUtc = eventReceivedAtBrowser.subtract(CTNetworkLatencyMs);

                        // start offset is the time between the start of the session 
                        // and the start of speech detection (approximately)
                        let speechStartOffsetMs = speechStartedUtc.diff(self.CTSessionStartTime);

                        // e.offset contains the offset (in ticks) from when the AudioStarted.
                        // therefore, AudioOffset time is 'speechStartOffset' - e.offset
                        CTAudioOffsetMs = speechStartOffsetMs - e.offset / 10000;

                    }

                    transcriber.transcribed = function (s, e) {
                               // adjust offset 
                               let adjOffsetMs = (e.result.offset/10000)  + CTAudioOffsetMs;

                    }

Using Adjusted Offset, it will be possible to work out the 'real' time an event happened.
If start/stop is used during the conversation, we also need to account for session drift. Here is an illustration:

BrianMouncer commented 2 years ago

@josephsctan

I'm talking with some team members about how to best do this, but there does not seem to be an easy way to do this. Internally, we do keep a timestamp of when the first events came back from the service, and we use that the calculate some "user perceived latency" metrics around how long the caller had to wait from starting the recognition, to receiving the first results back from the service. It might be possible to expose a timestamp like that to make what you are trying easier.

Before I add anything to our backlog of possible improvements, can you describe what the "use scenario" is for this? I want to make sure any change we might make would enable the entire scenario, rather than just this one property.

Thanks.

josephsctan commented 2 years ago

Hi There,

Thanks. I'm basically starting a recording and piping that audio stream through to Conversation transcriber using a push stream. My use case was to be able to relate the offset recorded in the result to the start time of the recording, which, for various unavoidable reasons, can be 2-3 seconds before the start of the session.

With the workaround that I used above, I am able to derive meaningful offsets - it doesn't drift, and its only off by about 300-500ms, which is acceptable.

Thanks

BrianMouncer commented 2 years ago

@chris-lindstrom

Do you have any other feedback from the work you did, either to help this customer or to take back to improve our API surface?

Thanks,

pankopon commented 2 years ago

Changed status to accepted, as we have created a work item to expose the timestamp mentioned in https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1410#issuecomment-1043345580. (No ETA for now.)

pankopon commented 2 years ago

Internal work item ref. 4076527.

dargilco commented 2 years ago

Closing this GitHub issue as it is now tracked as a feature request in our internal system.

Azure-Samples / cognitive-services-speech-sdk

Provide Server Timestamp so we can relate the Conversation Transcription Result 'offset' to a a real time #1410