Multiple speakers(skype, teams meeting or any audio source like youtube) audio input in continuous speech recognition python

Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK

MIT License

2.86k stars 1.84k forks source link

Multiple speakers(skype, teams meeting or any audio source like youtube) audio input in continuous speech recognition python #802

Closed yashugupta786 closed 3 years ago

yashugupta786 commented 4 years ago

I have a created a small application for continuous speech to text Transcription and translation .For a single user getting input from microphone is working fine . But when we have multiple speaker's(skype meeting ,teams conference call, zoom meeting or any audio source ) how to fetch audio for all speakers and pass to azure speech to text service . As of now only options are microphone, Audio File

How to Achieve this in python so that multiple speakers voice can be feed to Azure speech services and can transcribe or Translate them

BrianMouncer commented 4 years ago

If you haven't noticed, Teams Client has the transcription feature built in now. :-)

We do not have a Python API surface for our Conversational Transcription Service yet, or the Speaker ID API (but we are working on them). You could do your experiments in C# though.

Keep in mind that the conversation transcription APIs are in preview, and will undergo some refactoring as we take customer feedback into the design.

https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/multi-device-conversation?pivots=programming-language-csharp

If your question was actually about how to do the audio input when you don't have a file or a microphone as input.... You would usually get the audio from the call/meeting as a stream, and them put that stream into a push or pull audio stream class when passing it to the speech SDK.

https://docs.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.audio.pushaudioinputstream

yashugupta786 commented 4 years ago

Thanks for the response Brian .Teams have the Transcription functionality . I am looking for Translation Functionality so that we can Transcribe and translate the conference meetings into different language .However on Teams only Transcription is available and its for only English language (Speech to text in English only). so according to you like we cannot fetch the audio of multiple speakers (in a meeting) from teams and pass to AZURE speech translation.

Any other work around in python . As i have observed when using python speech recognition library i am able to capture the audio of all speakers/users but the accuracy is very bad .If any solution in python how i can capture the audio for all users/speakers in a meeting using the azure service it would be great

pankopon commented 3 years ago

@yashugupta786 Sorry about the lack of response - is this issue still valid for you? Unfortunately there is probably not much to add to what has been written so far.

In general, if you have a direct access to an audio stream from a source other than the microphone or a file, the recommended approach currently is to use a push audio stream to feed audio data from the source stream to the Speech SDK. If there are several such source streams that you want to be processed simultaneously then you need to mix their audio together before the Speech SDK.

pankopon commented 3 years ago

Closed as answered. Please create a new issue if you need further support on any specific topic.