deepgram / deepgram-python-sdk

Official Python SDK for Deepgram's automated speech recognition APIs.
https://developers.deepgram.com
MIT License
178 stars 48 forks source link

Implement Flush Feature #351

Closed dvonthenen closed 2 weeks ago

dvonthenen commented 3 months ago

Proposed changes

Context

Possible Implementation

Other information

saleshwaram commented 1 month ago

Hi @dvonthenen ,

I've noticed that the recent changes involving the flush feature for speech-to-text are not reflected in the SDK. I'm currently using deepgram-sdk 3.2.7 and have not seen the expected functionality (Finalize). Could you please provide some guidance on this or a timeline for when these changes might be integrated into the SDK?

Thank you!

dvonthenen commented 1 month ago

hi @saleshwaram

It's in the queue.

You don't need to wait for this to be implemented in the SDK. You can use this right now. You can send the following message in the send() function:

{ "type": "Finalize" }
saleshwaram commented 1 month ago

Hi @dvonthenen,

It seems there is some confusion regarding the functionality of the "Finalize" type in the send() function, as my implementation is not receiving the expected final transcription when using this feature. Specifically, I am trying to address an edge case where I do not receive speech_final as true after finishing speaking. To handle this, I'm attempting to send a "Finalize" payload when no interim transcript is coming every 2 seconds, with the expectation that it will provide a finalized transcript up to that point. Below, I am including the relevant code snippets, the output I'm receiving, and the output I expect. Could you please clarify how the flush feature should work in this context? Are there any specific implementation details that might be missing or need to be adjusted in my code?

Thank you for your help!

Here's my code:

deepgramstt.py

  import datetime
  import threading
  from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
  from dotenv import load_dotenv
  import json
  load_dotenv()

  class DeepgramSTT:
      def __init__(self):
          self.full_transcription = ""
          self.final_transcription = ""
          self.other_text = ""
          self.transcript_ready = threading.Event()
          self.connection_status = False
          self.deepgram = DeepgramClient()
          self.connection = self.deepgram.listen.live.v("1")
          self.setup_events()
          self.timer = None

      def setup_events(self):
          self.connection.on(LiveTranscriptionEvents.Open, self.on_open)
          self.connection.on(LiveTranscriptionEvents.Close, self.on_close)
          self.connection.on(LiveTranscriptionEvents.Transcript, self.on_message)
          self.connection.on(LiveTranscriptionEvents.SpeechStarted, self.on_speech_started)
          self.connection.on(LiveTranscriptionEvents.Metadata, self.on_metadata)

      def on_open(self, *args, **kwargs):
          self.connection_status = True
          print("Connection opened")

      def on_speech_started(self, x, speech_started, **kwargs):
          print("Speech started")

      def on_metadata(self, x, metadata, **kwargs):
          print(f"\n\n{metadata}\n\n")

      def on_close(self, *args, **kwargs):
          self.connection_status = False
          print("Connection closed")

      def on_message(self, x, result, **kwargs):
          sentence = result.channel.alternatives[0].transcript
          # print(f"{datetime.datetime.now()}: {result.is_final}: {result.speech_final}  {sentence}")
           # Reset the timer whenever a new sentence is received
          if len(sentence) == 0:
              return
          if result.is_final and result.speech_final:
              self.final_transcription = self.full_transcription + sentence
              if self.final_transcription!="":
                  self.transcript_ready.set()
                  return
              else:
                  print("final")
                  return
          elif result.is_final and not result.speech_final:
              self.reset_timer() 
              self.full_transcription += sentence + " "
              return
          else:
              self.reset_timer() 
              self.other_text = sentence
              print("Interim sentence: ", sentence)

      def reset_timer(self):
          if self.timer and self.other_text!="":
              self.timer.cancel()
          self.timer = threading.Timer(2.0, self.send_finalize)
          self.timer.start()

      def send_finalize(self):
          self.connection.send(json.dumps({"type": "Finalize"}))
          print("Finalize sent due to 2 seconds of silence")

      def start_connection(self):
          options = LiveOptions(
              model="nova-2",
              language="en-US",
              punctuate=True,
              encoding="linear16",
              channels=1,
              sample_rate=16000,
              vad_events=True,
              endpointing=300,
                      interim_results=True,
              utterance_end_ms="1000",
          )
          if not self.connection.start(options):
              print("Failed to start connection")
              return False
          return True

      def send_audio_data(self, data):
          self.connection.send(data)

      def finish(self):
          if self.timer:
              self.timer.cancel()
          self.connection.finish()
          print("Finished")
          self.print_final_transcript()

      def print_final_transcript(self):
          print("Complete final transcript:")
          print(self.full_transcription)

      def is_connection_active(self):
          return self.connection_status

test.py

  from deepgramstt import DeepgramSTT
  from datetime import datetime
  import threading
  import pyaudio

  def main():
      # Audio stream configuration
      FORMAT = pyaudio.paInt16
      CHANNELS = 1
      SAMPLE_RATE = 16000
      FRAMES_PER_BUFFER = 3200

      # Initialize PyAudio
      p = pyaudio.PyAudio()
      try:
          stream = p.open(format=FORMAT, channels=CHANNELS, rate=SAMPLE_RATE, input=True, frames_per_buffer=FRAMES_PER_BUFFER)
      except IOError as e:
          print(f"Could not open audio stream: {e}")
          p.terminate()
          return

      # Initialize DeepgramSTT
      dg_connection = DeepgramSTT()
      if not dg_connection.start_connection():
          print("Failed to start Deepgram connection")
          stream.stop_stream()
          stream.close()
          p.terminate()
          return

      print("Connection started. Begin speaking now.")

      # Start the audio stream thread immediately
      exit_flag = False

      def audio_stream_thread():

          try:
              while not exit_flag and dg_connection.is_connection_active():
                  try:
                      data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
                  except IOError as e:
                      print(f"Error reading audio data: {e}")
                      break  # Exit the loop if we can't read the data
                  dg_connection.send_audio_data(data)

                  if dg_connection.transcript_ready.is_set():  # Non-blocking check for the event
                      print(f"final: {dg_connection.final_transcription}\ttime: {datetime.utcnow().isoformat(timespec='milliseconds') + 'Z'}")
                      dg_connection.final_transcription = ""
                      dg_connection.transcript_ready.clear()  # Reset the event
          except Exception as e:
              print(f"Unexpected error: {e}")
          finally:
              stream.stop_stream()
              stream.close()
              p.terminate()
              dg_connection.finish()

      audio_thread = threading.Thread(target=audio_stream_thread)
      audio_thread.start()

      input("Press Enter to stop recording...\n")

      exit_flag = True
      audio_thread.join()

      print("Finished recording and processing.")

  if __name__ == "__main__":
      main()

Output:

Received behaviour:


$ python -m test
Connection opened
Connection started. Begin speaking now.
Press Enter to stop recording...
Speech started
Interim sentence:  Early one morning,
Interim sentence:  Early one morning, while the sun was
Interim sentence:  Early one morning, while the sun was just
Interim sentence:  Early one morning, while the sun was just starting to rise, a
Interim sentence:  Early one morning, while the sun was just starting to rise, a young and energetic dog
Speech started
Interim sentence:  excitedly ran around
Interim sentence:  excitedly ran around the park. Juncker gave a
Interim sentence:  excitedly ran around the park, jumping over small bushes, and chasing
Interim sentence:  excitedly ran around the park, jumping over small bushes and chasing after brightly colored
Speech started
Interim sentence:  a group of children
Interim sentence:  a group of children laughed and played nearby
Interim sentence:  a group of children laughed and played nearby, enjoying the
Interim sentence:  a group of children laughed and played nearby, enjoying the warm weather and the free
Speech started
Interim sentence:  before school started.
final: Early one morning, while the sun was just starting to rise, a young and energetic dog excitedly ran around the park, jumping over small bushes and chasing after brightly colored butterflies a group of children laughed and played nearby, enjoying the warm weather and the freedom of being outside before school started.  time: 2024-05-22T11:44:15.713Z
Finalize sent due to 2 seconds of silence
Speech started

Connection closed

{
    "type": "Metadata",
    "transaction_key": "deprecated",
    "request_id": "457e4f1b-e9a7-4e99-a704-d2f0f045d00a",
    "sha256": "f91f59bcb63d46d4ea6e3a9b647d65e940d83373d9f929f71ff32940342c578e",
    "created": "2024-05-22T11:43:53.803Z",
    "duration": 23.6,
    "channels": 1,
    "models": [
        "1dbdfb4d-85b2-4659-9831-16b3c76229aa"
    ],
    "model_info": {
        "1dbdfb4d-85b2-4659-9831-16b3c76229aa": {
            "name": "2-general-nova",
            "version": "2024-01-11.36317",
            "arch": "nova-2"
        }
    }
}

Finished

Another output:


$ python -m test
Connection opened
Connection started. Begin speaking now.
Press Enter to stop recording...
Speech started
Interim sentence:  Early one more
Interim sentence:  Early one morning, while the sun was just
Interim sentence:  Early one morning, while the sun was just starting to rise, a
Interim sentence:  Early one morning, while the sun was just starting to rise, a young and energetic
Speech started
Interim sentence:  a young and energetic dog excited
Interim sentence:  a young and energetic dog excitedly ran around the path
Interim sentence:  a young and energetic dog excitedly ran around the park jumping over small
Speech started
Interim sentence:  and chasing after prey
Interim sentence:  and chasing up brightly colored butterflies.
Interim sentence:  and chasing up brightly colored butterflies as a group of children
Interim sentence:  and chasing after brightly colored butterflies as a group of children laughed and played near
Speech started
Interim sentence:  and played nearby, enjoying the
Interim sentence:  and played nearby, enjoying the warm weather and the free
Interim sentence:  and played nearby, enjoying the warm weather and the freedom of being outside
Interim sentence:  and played nearby, enjoying the warm weather and the freedom of being outside before school started.
Speech started
Finalize sent due to 2 seconds of silence

Connection closed

{
    "type": "Metadata",
    "transaction_key": "deprecated",
    "request_id": "d3380ca8-175b-470f-b514-84f4199b5baa",
    "sha256": "798f63a6df80a3ae1bee4548708d5ea0190e5508e4d357debe807402cf944e31",
    "created": "2024-05-22T11:43:07.732Z",
    "duration": 25.4,
    "channels": 1,
    "models": [
        "1dbdfb4d-85b2-4659-9831-16b3c76229aa"
    ],
    "model_info": {
        "1dbdfb4d-85b2-4659-9831-16b3c76229aa": {
            "name": "2-general-nova",
            "version": "2024-01-11.36317",
            "arch": "nova-2"
        }
    }
}

Finished

Expected output:


$ python -m test
Connection opened
Connection started. Begin speaking now.
Press Enter to stop recording...
Speech started
Interim sentence:  Early one more
Interim sentence:  Early one morning, while the sun was just
Interim sentence:  Early one morning, while the sun was just starting to rise, a
Interim sentence:  Early one morning, while the sun was just starting to rise, a young and energetic
Speech started
Interim sentence:  a young and energetic dog excited
Interim sentence:  a young and energetic dog excitedly ran around the path
Interim sentence:  a young and energetic dog excitedly ran around the park jumping over small
Speech started
Interim sentence:  and chasing after prey
Interim sentence:  and chasing up brightly colored butterflies.
Interim sentence:  and chasing up brightly colored butterflies as a group of children
Interim sentence:  and chasing after brightly colored butterflies as a group of children laughed and played near
Speech started
Interim sentence:  and played nearby, enjoying the
Interim sentence:  and played nearby, enjoying the warm weather and the free
Interim sentence:  and played nearby, enjoying the warm weather and the freedom of being outside
Interim sentence:  and played nearby, enjoying the warm weather and the freedom of being outside before school started.
Speech started
Finalize sent due to 2 seconds of silence
final: Early one morning, while the sun was just starting to rise, a young and energetic dog excitedly ran around the park, jumping over small bushes and chasing after brightly colored butterflies a group of children laughed and played nearby, enjoying the warm weather and the freedom of being outside before school started.  time: 2024-05-22T11:44:15.713Z

Connection closed

{
    "type": "Metadata",
    "transaction_key": "deprecated",
    "request_id": "d3380ca8-175b-470f-b514-84f4199b5baa",
    "sha256": "798f63a6df80a3ae1bee4548708d5ea0190e5508e4d357debe807402cf944e31",
    "created": "2024-05-22T11:43:07.732Z",
    "duration": 25.4,
    "channels": 1,
    "models": [
        "1dbdfb4d-85b2-4659-9831-16b3c76229aa"
    ],
    "model_info": {
        "1dbdfb4d-85b2-4659-9831-16b3c76229aa": {
            "name": "2-general-nova",
            "version": "2024-01-11.36317",
            "arch": "nova-2"
        }
    }
}

Finished
dvonthenen commented 1 month ago

If I understand the output correctly, Tthe first example I wouldn't expect anything to happen since the final: happened just before the flush.

The second doesn't seem right, but I haven't experimented with the feature much. There are people using this in production, so it seems like there might be an issue in your code.

saleshwaram commented 1 month ago

In the first example, the 'final' transcript is received and then 'finalize' is sent. Since the final transcript has already been received, I am not expecting anything further.

However, in the second transcript implementation, I have tried it a couple of times but it never produced any final response. If you could provide a working sample, I could test it on my side because I don't see any issue on my code side.

dvonthenen commented 2 weeks ago

Apparently, I partially implemented this: https://github.com/deepgram/deepgram-python-sdk/pull/396

Going to reproduce what I did in the Go SDK now: https://github.com/deepgram/deepgram-go-sdk/pull/237

dvonthenen commented 2 weeks ago

This is available in the latest release: https://github.com/deepgram/deepgram-python-sdk/releases/tag/v3.3.0