deepgram / deepgram-dotnet-sdk

.NET SDK for Deepgram's automated speech recognition APIs.
https://developers.deepgram.com
MIT License
33 stars 31 forks source link

CloseStream Option to Queue Message #341

Closed sgodin closed 3 weeks ago

sgodin commented 1 month ago

Proposed changes

Add an option to the current SendClose() API to be able to queue the request behind any audio packets still waiting to be sent.

Context

We are doing call transcriptions, using the ListenWebScoketClient. We need a reliable way to know when all transcriptions are complete so that we can perform post translation analysis (ie: Deepgram TextAnalysis) on the transcribed text. Initially I was using the Finalize request, and waiting for results where FromFinalize=True, however the documentation states there are cases where FromFinalize=True may not be returned. (https://developers.deepgram.com/docs/finalize). After speaking with a Deepgram integration engineer, they recommended to instead use the CloseStream request, where all pending audio would be processed, final results sent and then we are to wait for the websocket to be closed to know transcription is complete.

I have done this and it seems to be working well. However, since the SendClose API uses SendMessageImmediately, I have seen cases where I don't transcription text for last few seconds of audio. On a theory that perhaps there was still unsent audio queued up in the client, I implemented a new version of SendClose myself and used SendMessage instead to queue it behind all of the audio, and this appeared to fix the issue.

Possible Implementation

Since I could see that one might want SendMessageImmediately behaviour, and in my case I want the queued SendMessage behaviour, it might be best to add an option to the SendClose API to send immediately or send queued. Probably also makes sense to add this option to the SendFinalize API as well.

Somewhat related question

What is the purpose of the nullbyte optional bool argument on the current SendClose method?

davidvonthenen commented 1 month ago

Hi @sgodin

I will take a look at this. I noticed this was missing in the last release.

In the other SDK, we send the CloseStream message on behalf of the user for convenience, and we introduce a small delay for any final messages to arrive on the client side. This usually works for 99% of customers.

For the 1% of customers this does not work for (latency, etc.), we recommend manually (i.e., you send this on your own) using the send function and then waiting for the appropriate amount of time as determined by your needs. In the meantime, this is how you can do this today. You can do this by calling this function: https://github.com/deepgram/deepgram-dotnet-sdk/blob/main/Deepgram/Clients/Listen/v1/WebSocket/Client.cs#L328

sgodin commented 1 month ago

Thanks for the quick response David!

I don't see the point in waiting some amount of time and only catching 99% of cases, when all that is needed is to queue the request behind the audio waiting to go out. :) Note: I have already implemented CloseStream myself, so I'm in no rush - it was just a suggestion. Here's what I did for reference....

// This message will send a shutdown command to the server instructing it to finish processing any cached data,
// send the response to the client, send a summary metadata object, and then terminate the WebSocket connection.
// activeRecognition.SpeechClient.SendClose();  // Sends immediately - we want to queue it, so implementing manually below
byte[] data = Encoding.ASCII.GetBytes("{\"type\": \"CloseStream\"}");
speechClient.SendMessage(data);  // queued to ensure all audio get's there first

On a related note (I'm hoping you can shed some light on): What is the purpose of the nullbyte optional bool argument on the current SendClose method?

/// <summary>
/// Sends a Close message to Deepgram
/// </summary>
public void SendClose(bool nullByte = false);

Thanks for your time, Scott

sgodin commented 1 month ago

FYI - I see you send the CloseStream in the Stop API - however, since you are shutting down the websocket (cancelling it), then the application cannot receive any final transcription responses. Remember the whole problem I was trying to solve was knowing when all the results were in and it was ok to close to websocket.... without some sort of odd delay logic. :) Cheers :)

davidvonthenen commented 1 month ago

I see you send the CloseStream in the Stop API - however, since you are shutting down the websocket (cancelling it), then the application cannot receive any final transcription responses.

This is the part that needs to be fixed. There needs to be a slight delay before cancelling/exiting.

So the options are:

  1. You call SendClose(), you wait however long you want to wait to retrieve messages coming from DG, when you have collected what you wanted OR waited your determined length of time, you call Stop() which just cleans things up (and doesn't actually close the connection because it should be closed at this point anyways) OR
  2. wait for the fix to just call stop() which will handle this for you by introducing a small delay.

What is the purpose of the null byte optional bool argument on the current SendClose method?

The null byte is just another method for stopping or canceling the websocket. It is a deepgram-specific method that signals "I'm done" at the API level.

davidvonthenen commented 3 weeks ago

Merged. will have a release after addressing another issue