deepgram / deepgram-rust-sdk

Rust SDK for Deepgram's automated speech recognition APIs.
https://developers.deepgram.com
MIT License
39 stars 23 forks source link

Why is the `simple_stream` example so slow? Over 31 seconds to transcribe the wav, which is slower than typing #99

Open Boscop opened 1 month ago

Boscop commented 1 month ago

Hi, thanks for making this crate 🙂

I'm trying to figure out why websocket transcription, e.g. via the simple_stream example is slow for me. I already commented out these lines, hoping it would speed it up:

//        .endpointing(Endpointing::CustomDurationMs(300))
//        .interim_results(true)
//        .utterance_end_ms(1000)
//        .vad_events(true)

But it still takes over 31 seconds to transcribe this wav audio, which is MUCH slower than it would be to type what is said in that wav!

$ time cargo run --example simple_stream
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.17s
     Running `target\debug\examples\simple_stream.exe`
Deepgram Request ID: [...]
got: Ok(TranscriptResponse { type_field: "Results", start: 0.0, duration: 2.24, is_final: true, speech_final: true, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "", words: [], confidence: 0.0 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 2.24, duration: 1.53, is_final: true, speech_final: true, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "", words: [], confidence: 0.0 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 3.77, duration: 2.46, is_final: true, speech_final: true, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "Yep.", words: [Word { word: "yep", start: 5.6272583, end: 5.864355, confidence: 0.99365234, speaker: None, punctuated_word: Some("Yep.") }], confidence: 0.99365234 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 6.23, duration: 0.84000015, is_final: true, speech_final: true, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "", words: [], confidence: 0.0 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 7.07, duration: 2.0899997, is_final: true, speech_final: true, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "I said it before and I'll say it again.", words: [Word { word: "i", start: 7.27, end: 7.4300003, confidence: 0.9091797, speaker: None, punctuated_word: Some("I") }, Word { word: "said", start: 7.4300003, end: 7.59, confidence: 0.85791016, speaker: None, punctuated_word: Some("said") }, Word { word: "it", start: 7.59, end: 7.83, confidence: 0.9980469, speaker: None, punctuated_word: Some("it") }, Word { word: "before", start: 7.83, end: 8.07, confidence: 0.9970703, speaker: None, punctuated_word: Some("before") }, Word { word: "and", start: 8.07, end: 8.15, confidence: 0.9980469, speaker: None, punctuated_word: Some("and") }, Word { word: "i'll", start: 8.2300005, end: 8.39, confidence: 0.9897461, speaker: None, punctuated_word: Some("I'll") }, Word { word: "say", start: 8.39, end: 8.47, confidence: 0.99853516, speaker: None, punctuated_word: Some("say") }, Word { word: "it", start: 8.47, end: 8.71, confidence: 0.9980469, speaker: None, punctuated_word: Some("it") }, Word { word: "again", start: 8.71, end: 8.87, confidence: 0.9995117, speaker: None, punctuated_word: Some("again.") }], confidence: 0.9980469 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 9.16, duration: 0.89000034, is_final: true, speech_final: true, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "", words: [], confidence: 0.0 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 10.05, duration: 1.6599998, is_final: true, speech_final: true, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "Life moves pretty fast.", words: [Word { word: "life", start: 10.167857, end: 10.403572, confidence: 0.97802734, speaker: None, punctuated_word: Some("Life") }, Word { word: "moves", start: 10.403572, end: 10.717857, confidence: 0.99072266, speaker: None, punctuated_word: Some("moves") }, Word { word: "pretty", start: 10.717857, end: 11.032143, confidence: 0.99853516, speaker: None, punctuated_word: Some("pretty") }, Word { word: "fast", start: 11.032143, end: 11.425, confidence: 0.99902344, speaker: None, punctuated_word: Some("fast.") }], confidence: 0.99853516 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 11.71, duration: 2.87, is_final: true, speech_final: true, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "You don't stop and look around once in a while.", words: [Word { word: "you", start: 12.146944, end: 12.305834, confidence: 0.99902344, speaker: None, punctuated_word: Some("You") }, Word { word: "don't", start: 12.305834, end: 12.623611, confidence: 0.99658203, speaker: None, punctuated_word: Some("don't") }, Word { word: "stop", start: 12.623611, end: 12.7825, confidence: 0.99902344, speaker: None, punctuated_word: Some("stop") }, Word { word: "and", start: 12.7825, end: 12.941389, confidence: 0.97021484, speaker: None, punctuated_word: Some("and") }, Word { word: "look", start: 12.941389, end: 13.179722, confidence: 0.9941406, speaker: None, punctuated_word: Some("look") }, Word { word: "around", start: 13.179722, end: 13.418056, confidence: 0.9995117, speaker: None, punctuated_word: Some("around") }, Word { word: "once", start: 13.418056, end: 13.656389, confidence: 0.9995117, speaker: None, punctuated_word: Some("once") }, Word { word: "in", start: 13.656389, end: 13.735833, confidence: 0.97802734, speaker: None, punctuated_word: Some("in") }, Word { word: "a", start: 13.735833, end: 13.894722, confidence: 0.95654297, speaker: None, punctuated_word: Some("a") }, Word { word: "while", start: 13.894722, end: 14.053612, confidence: 0.98535156, speaker: None, punctuated_word: Some("while.") }], confidence: 0.99658203 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 14.58, duration: 3.0094376, is_final: true, speech_final: false, from_finalize: true, channel: Channel { alternatives: [Alternatives { transcript: "You could miss it.", words: [Word { word: "you", start: 14.777369, end: 14.935263, confidence: 0.99853516, speaker: None, punctuated_word: Some("You") }, Word { word: "could", start: 14.935263, end: 15.093158, confidence: 0.99365234, speaker: None, punctuated_word: Some("could") }, Word { word: "miss", start: 15.093158, end: 15.251053, confidence: 0.9975586, speaker: None, punctuated_word: Some("miss") }, Word { word: "it", start: 15.251053, end: 15.408947, confidence: 0.9946289, speaker: None, punctuated_word: Some("it.") }], confidence: 0.9975586 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TranscriptResponse { type_field: "Results", start: 17.589437, duration: 6.2942505e-5, is_final: true, speech_final: false, from_finalize: false, channel: Channel { alternatives: [Alternatives { transcript: "", words: [], confidence: 0.0 }] }, metadata: Metadata { request_id: "e438b919-52a7-4e6e-a53c-b5e06f6a97ac", model_info: ModelInfo { name: "general", version: "2024-01-26.8851", arch: "base" }, model_uuid: "1ed36bac-f71c-4f3f-a31f-02fd6525c489" }, channel_index: [0, 1] })
got: Ok(TerminalResponse { request_id: [...], created: "2024-10-09T09:06:55.964Z", duration: 17.5895, channels: 1 })

real    0m31.475s
user    0m0.015s
sys     0m0.015s

I'm looking for a way to have faster transcription than typing, to implement push-to-talk voice typing for my app. If it's slower than typing there's no point in transcribing voice input..

Any idea how I can speed it up? I'd really appreciate it 🙂

DamienDeepgram commented 23 hours ago

Streaming transcribes realtime audio.

If the file you are streaming is 31sec long then it would get streaming @ 1sec per sec and take 31sec or so to process

If you want to transcribe a file and not use realtime streaming use our pre-recorded API