Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.94k stars 1.85k forks source link

C# SDK: add RequestWordLevelConfidence and expose word level confidence in WordLevelTimingResult #1438

Closed sibbl closed 2 years ago

sibbl commented 2 years ago

It's currently already possible to request word level timings using

var speechConfig = SpeechConfig.FromSubscription(subscriptionKey, region);
speechConfig.RequestWordLevelTimestamps();

The results then are accessible in each WordLevelTimingResult of the enumerable speechRecognitionResult.Best().FirstOrDefault().Words.

On the same level, I'd appreciate if a RequestWordLevelConfidence method could be implemented.

I know that it's currently already possible to use

speechConfig.SetServiceProperty("wordLevelConfidence", "true", ServicePropertyChannel.UriQueryParameter);

and then parse

speechRecognitionResult.Properties.GetProperty(PropertyId.SpeechServiceResponse_JsonResult)

into my own C# entities.

However, as the query parameter is already there and you already parse the JSON result into an entity, what would speak against implementing this field as well? It would save us users some critical time to not having to parse the JSON again into our own entities.

I think the SDK should take care of it and expose this part of the backend's API as well.

trrwilson commented 2 years ago

Thanks for getting in touch, @sibbl. I very much agree that the manual JSON parsing is undesirable and a distraction.

Your comment is very timely since we're working on adding more formal support for this property right now--it's in scope for the next (v1.21) release. Given this is a small and related addition to word-level details, the intended approach is that you'll just get word-level confidence scores automatically when word-level timings are requested, with the C#-specific DetailedSpeechRecognitionResult's Words collection augmented to include a Confidence property (alongside Offset and Duration). Further, we're also considering (not finalized yet) making word-level details (timing, confidence) automatically included when detailed output format (SpeechConfig.OutputFormat = OutputFormat.Detailed) is requested, having word-level information (via the existing WordLevelTiming property) become an "opt out" rather than an "opt in" for detailed results.

Would those approaches meet your needs? This is a great case where we may be able to incorporate your feedback in real-time.

Thanks again!

sibbl commented 2 years ago

Hi @trrwilson, thanks for the quick response!

Sounds like very good timing indeed!

To be honest I don't recall any specific documentation on OutputFormat.Detailed but would agree that it makes sense to include as much detail as possible with this setting by default. On the other hand, I also don't have any insight into what this does performance-wise and whether you'd get faster responses with a more fine-grained API in time-critical applications.

For our particular use case, we only need the confidence and no timings for now. So we would be happy to be able to disable things via opt-out if we benefit from it but it's no must have.

From a user perspective, I could imagine a flag enum like the following as well to a) have a very fine-grained control and b) let users know that they're not just setting some output format to detailed (which can be anything...) but request word level details with specific values:

// proposal:
speechConfig.WordLevelDetails = WordLevelDetails.All;
// or
speechConfig.WordLevelDetails = WordLevelDetails.Duration | WordLevelDetails.Confidence;

Again, I'm not sure how this integrates with the same SpeechConfig entity being shared with the speech synthesizing APIs and what side effects it might have there.

But first and foremost, we're looking forward to version 1.21 and the possibility to get the Confidence in the WordLevelTimingResult! 👍

CodingOctocat commented 2 years ago

I implemented the speech to SRT function using the Recognizing and Recognized methods, and it is word level(Infinity Approach). I didn't found an open source implementation of the same method as mine, so I'm going to put my method on Github in the next few days, and I'll let you know then.

add speech synthesis sample to generate srt subtitle file #1286 [DRAFT] [DO NOT MERGE] Add captioning samples #1435

Update: my implemention: Azure speech to subtitle (word-level timestamp) azure speech to subtitle word level timestamp

sibbl commented 2 years ago

@CodingOctocat just to make sure that you've not replied to the wrong issue or got us wrong: this issue is about the confidence values not being part of the word level entities provided by the SDK.

While it's in the JSON delivered from the MS server to our client (as can be seen in the JsonResult string), it's currently not parsed and not accessible on an SDK level.

I'm not sure how your project helps here, as it doesn't make use of the confidence at all. But thanks anyway for sharing! Using STT for creating subtitles might also be a great case for a sample project, but that might be another feature/sample request and not be in scope for this issue ☺️👍

pankopon commented 2 years ago

Update:

pankopon commented 2 years ago

Closed as the API enhancements have been implemented and released. Please open a new issue if further support is needed.