deepgram / deepgram-js-sdk

Official JavaScript SDK for Deepgram's automated speech recognition APIs.
https://developers.deepgram.com
MIT License
127 stars 45 forks source link

Deepgram API does not return data upon disconnecting and reconnecting Websocket #301

Closed tieje3 closed 1 week ago

tieje3 commented 2 weeks ago

What is the current behavior?

Deepgram API does not return data upon disconnecting and reconnecting Websocket.

Disconnect definition

const disconnectFromDeepgram = () => {
    if (connection) {
        connection.finish();
        setConnection(null);
    }
};

First Connection

First Connection is successful. Deepgram sends transcription data. The connection is closed properly.

first_connection.txt

Response Header

HTTP/1.1 101 Switching Protocols
connection: upgrade
upgrade: websocket
sec-websocket-accept: 4+QamH27qm2OIALhefVT5Uo5yss=
access-control-allow-origin: http://localhost:60330
access-control-allow-credentials: true
vary: origin
vary: access-control-request-method
vary: access-control-request-headers
access-control-expose-headers: dg-model-name,dg-model-uuid,dg-char-count,dg-request-id,dg-error
sec-websocket-protocol: token
dg-request-id: 0caee309-30e3-4bf1-9bcc-478437df5c9d
date: Tue, 11 Jun 2024 23:52:34 GMT

Request Header

GET /v1/listen?language=en-US&model=nova-2-medical&vad_events=true&smart_format=true&interim_results=true&utterance_end_ms=2000 HTTP/1.1
Host: api.deepgram.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:127.0) Gecko/20100101 Firefox/127.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br, zstd
Sec-WebSocket-Version: 13
Origin: http://localhost:60330
Sec-WebSocket-Protocol: token, efc6[same token as second connection]
Sec-WebSocket-Extensions: permessage-deflate
Sec-WebSocket-Key: 9ipjJXwMJD2BpBtqcEG45A==
Connection: keep-alive, Upgrade
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: websocket
Sec-Fetch-Site: cross-site
Pragma: no-cache
Cache-Control: no-cache
Upgrade: websocket
first_connection

Second Connection

Second Connection is successful, but Deepgram does not send transcription data. The websocket connection times out from Deepgram's side stating that data was not sent, but it clearly was.

second_connection.txt

Response Headers

HTTP/1.1 101 Switching Protocols
connection: upgrade
upgrade: websocket
sec-websocket-accept: h+nFfjAKiy/zx698pSwXCu765+Q=
access-control-allow-origin: http://localhost:60330
access-control-allow-credentials: true
vary: origin
vary: access-control-request-method
vary: access-control-request-headers
access-control-expose-headers: dg-model-name,dg-model-uuid,dg-char-count,dg-request-id,dg-error
sec-websocket-protocol: token
dg-request-id: 9f153fcd-bdfd-44c7-badc-9edd06c6ef66
date: Tue, 11 Jun 2024 23:52:37 GMT

Request Headers

GET /v1/listen?language=es&model=nova-2&vad_events=true&smart_format=true&interim_results=true&utterance_end_ms=2000 HTTP/1.1
Host: api.deepgram.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:127.0) Gecko/20100101 Firefox/127.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br, zstd
Sec-WebSocket-Version: 13
Origin: http://localhost:60330
Sec-WebSocket-Protocol: token, efc6[same token as first connection]
Sec-WebSocket-Extensions: permessage-deflate
Sec-WebSocket-Key: KxAN8XlgnOQ0QEzjXWJZjg==
Connection: keep-alive, Upgrade
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: websocket
Sec-Fetch-Site: cross-site
Pragma: no-cache
Cache-Control: no-cache
Upgrade: websocket
second_connection

Steps to reproduce

Use useEffect() from React to disconnect and then reconnect with Deepgram API.

    useEffect(() => {
        if ((microphoneState === MicrophoneState.Ready // initial setup
            || microphoneState === MicrophoneState.Paused) // language change
            && connectionState === LiveConnectionState.CLOSED) {
            console.log('connect to deepgram');
            const newLiveSchema = LiveSchemaBuilder(DEEPGRAM_DEFAULTS, languageSelected);
            connectToDeepgram(newLiveSchema);

            return;
        }

        // languageSelected

        if (microphoneState === MicrophoneState.Paused
            && connectionState === LiveConnectionState.OPEN
            && changingLanguage.current
        ) {
            changingLanguage.current = false;
            disconnectFromDeepgram();

            return;
        }

        // connectionState

        if (!connection) return;
        if (!microphone) return;

        const onTranscript = (data: LiveTranscriptionEvent) => {
            const { is_final: isFinal } = data;
            const thisTranscription = data.channel.alternatives[0].transcript;

            if (isFinal && thisTranscription !== '') {
                const spacer = transcriptionRef.current === '' ? '' : ' ';
                const newTranscription = `${transcriptionRef.current}${spacer}${thisTranscription}`;
                transcriptionRef.current = newTranscription;
                setCommunication(newTranscription);
            }
        };

        const onUtteranceEnd = () => {
            onTranscriptionComplete(transcriptionRef.current);
        };

        // TODO: Add if connectionState is Error then...

        if ((microphoneState === MicrophoneState.Ready
            || microphoneState === MicrophoneState.Paused)
            && connectionState === LiveConnectionState.OPEN) {
            connection.addListener(LiveTranscriptionEvents.Transcript, onTranscript);
            connection.addListener(LiveTranscriptionEvents.UtteranceEnd, onUtteranceEnd);
            microphone.addEventListener(MicrophoneEvents.DataAvailable, onDataSendToDeepgram);

            startMicrophone(); // does not change micState

            return () => {
                changingLanguage.current = true;
                connection.removeListener(LiveTranscriptionEvents.Transcript, onTranscript);
                connection.removeListener(LiveTranscriptionEvents.UtteranceEnd, onUtteranceEnd);
                microphone.removeEventListener(MicrophoneEvents.DataAvailable, onDataSendToDeepgram);
                stopMicrophone();
                // invalidate temporary token
                // function to invalidate temporary token
            };
        }
    // eslint-disable-next-line react-hooks/exhaustive-deps
    }, [microphoneState, connectionState, languageSelected]);

Expected behavior

Successful re-connection with Deepgram API should return results.

Please tell us about your environment

Other information

lukeocodes commented 2 weeks ago

This is not an SDK issue, but an idiosyncrasy with the API. When you send an audio stream, the start (first 4 bytes) of the audio stream contains the header information that helps Deepgram discover the audio type.

You can tackle this in few ways. Either;

  1. Keep those 4 bytes in your app, and when you reconnect, intersect them into your audio stream for the first packet you then send to the API.
  2. (and probably easier) Configure the Deepgram client with encoding/etc
  3. (easiest but can be used case specific) Stop your audio stream and start it again after the connection is live
tieje3 commented 2 weeks ago

This is not an SDK issue, but an idiosyncrasy with the API. When you send an audio stream, the start (first 4 bytes) of the audio stream contains the header information that helps Deepgram discover the audio type.

You can tackle this in few ways. Either;

1. Keep those 4 bytes in your app, and when you reconnect, intersect them into your audio stream for the first packet you then send to the API.

2. (and probably easier) Configure the Deepgram client with encoding/etc

3. (easiest but can be used case specific) Stop your audio stream and start it again after the connection is live

Thank you, I've got it working now. Instead of pausing the mic, I should have stopped it and set it up again. Specifying encoding did not work, however.