Closed sibbl closed 2 years ago
Thanks for getting in touch, @sibbl. I very much agree that the manual JSON parsing is undesirable and a distraction.
Your comment is very timely since we're working on adding more formal support for this property right now--it's in scope for the next (v1.21) release. Given this is a small and related addition to word-level details, the intended approach is that you'll just get word-level confidence scores automatically when word-level timings are requested, with the C#-specific DetailedSpeechRecognitionResult
's Words
collection augmented to include a Confidence
property (alongside Offset
and Duration
). Further, we're also considering (not finalized yet) making word-level details (timing, confidence) automatically included when detailed output format (SpeechConfig.OutputFormat = OutputFormat.Detailed
) is requested, having word-level information (via the existing WordLevelTiming property) become an "opt out" rather than an "opt in" for detailed results.
Would those approaches meet your needs? This is a great case where we may be able to incorporate your feedback in real-time.
Thanks again!
Hi @trrwilson, thanks for the quick response!
Sounds like very good timing indeed!
To be honest I don't recall any specific documentation on OutputFormat.Detailed
but would agree that it makes sense to include as much detail as possible with this setting by default. On the other hand, I also don't have any insight into what this does performance-wise and whether you'd get faster responses with a more fine-grained API in time-critical applications.
For our particular use case, we only need the confidence and no timings for now. So we would be happy to be able to disable things via opt-out if we benefit from it but it's no must have.
From a user perspective, I could imagine a flag enum like the following as well to a) have a very fine-grained control and b) let users know that they're not just setting some output format to detailed
(which can be anything...) but request word level details with specific values:
// proposal:
speechConfig.WordLevelDetails = WordLevelDetails.All;
// or
speechConfig.WordLevelDetails = WordLevelDetails.Duration | WordLevelDetails.Confidence;
Again, I'm not sure how this integrates with the same SpeechConfig
entity being shared with the speech synthesizing APIs and what side effects it might have there.
But first and foremost, we're looking forward to version 1.21 and the possibility to get the Confidence in the WordLevelTimingResult
! 👍
I implemented the speech to SRT function using the Recognizing
and Recognized
methods, and it is word level(Infinity Approach).
I didn't found an open source implementation of the same method as mine, so I'm going to put my method on Github in the next few days, and I'll let you know then.
add speech synthesis sample to generate srt subtitle file #1286 [DRAFT] [DO NOT MERGE] Add captioning samples #1435
Update: my implemention: Azure speech to subtitle (word-level timestamp)
@CodingOctocat just to make sure that you've not replied to the wrong issue or got us wrong: this issue is about the confidence values not being part of the word level entities provided by the SDK.
While it's in the JSON delivered from the MS server to our client (as can be seen in the JsonResult string), it's currently not parsed and not accessible on an SDK level.
I'm not sure how your project helps here, as it doesn't make use of the confidence at all. But thanks anyway for sharing! Using STT for creating subtitles might also be a great case for a sample project, but that might be another feature/sample request and not be in scope for this issue ☺️👍
Update:
WordLevelTimingResult.Confidence
is available with Speech SDK 1.22.0 and later (ref. https://docs.microsoft.com/dotnet/api/microsoft.cognitiveservices.speech.wordleveltimingresult.confidence)speechConfig.OutputFormat = OutputFormat.Detailed
. No need to specifically request word level detail anymore.Closed as the API enhancements have been implemented and released. Please open a new issue if further support is needed.
It's currently already possible to request word level timings using
The results then are accessible in each
WordLevelTimingResult
of the enumerablespeechRecognitionResult.Best().FirstOrDefault().Words
.On the same level, I'd appreciate if a
RequestWordLevelConfidence
method could be implemented.I know that it's currently already possible to use
and then parse
into my own C# entities.
However, as the query parameter is already there and you already parse the JSON result into an entity, what would speak against implementing this field as well? It would save us users some critical time to not having to parse the JSON again into our own entities.
I think the SDK should take care of it and expose this part of the backend's API as well.