Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.97k stars 1.87k forks source link

Does the AudioConfig.fromDefaultMicrophoneInput() is Java SDK has any system dependency? #986

Closed bfoysal closed 3 years ago

bfoysal commented 3 years ago

Hi,

I am trying to use speech translation service with the system audio/speaker output. For continuous translation the microphone input works fine, but the result is inconsistent when AudioConfig object is created using fromDefaultSpeakerOutput() method. With the Java SDK sometimes it captures partially or whole first sentence then continues to give ResultReason 0. Sometimes it captures and translates one or two sentence from middle of the run. I have found Java audio capture requires Stereo Mix to be enabled. Is the there any system level decencies for speaker audio capture to be consistent? Also is there any dependency on the system volume?

I have also tried fromDefaultSpeakerOutput() for JavaScript(browser) SDK with no success. Could someone provide any working sample or explain how to capture/tap into speaker audio for translation?

Thank you.

brandom-msft commented 3 years ago

Hi @bfoysal

I'm querying the team about your Java question and I'll get back to you.

For your JavaScript question, here's a draft code snippet for using microphone input in the browser. This is not an end-to-end sample, we're working to improve our public sample offerings across our languages, but hopefully it can help you get some progress. Of course, please reach out with further questions.

 // Starts continuous speech translation.
sdkStartContinousTranslationBtn.addEventListener("click", function () {
    audioConfig = SpeechSDK.AudioConfig.fromDefaultMicrophoneInput();

    var speechConfig = SpeechSDK.SpeechTranslationConfig.fromSubscription(key.value, regionOptions.value);

    // Set the source language.
    speechConfig.speechRecognitionLanguage = languageOptions.value;
    speechConfig.speechSynthesisOutputFormat = SpeechSDK.SpeechSynthesisOutputFormat.Riff16Khz16BitMonoPcm;

    // Defines the language(s) that speech should be translated to.
    speechConfig.addTargetLanguage( <set your target language here> );

    // If voice output is requested, set the target voice.
    if (voiceOutput.checked) {
        speechConfig.setProperty(SpeechSDK.PropertyId.SpeechServiceConnection_TranslationVoice, languageTargetOptions.value);
    }

    reco = new SpeechSDK.TranslationRecognizer(speechConfig, audioConfig);

    // Before beginning speech recognition, setup the callbacks to be invoked when an event occurs.

    // The event recognizing signals that an intermediate recognition result is received.
    reco.recognizing = function (s, e) {

        // Handle intermediate results as desired
    };

    // The event recognized signals that a final recognition result is received.
    reco.recognized = function (s, e) {
        // Handle recognition results
    };

    // Setup remaining event handlers

    // Begin the recognition
    reco.startContinuousRecognitionAsync();

});
bfoysal commented 3 years ago

Hi @brandom-msft

thank you for sharing the snippet, but I am trying to the same with JavaScript as Java. not from the microphone rather initializing the audioConfig with fromDefaultSpeakerOutput(). please share example of that if you have it.

brandom-msft commented 3 years ago

@bfoysal yep, I definitely misread that - apologies and thanks for readdressing! Let's try this again.. for usages of fromDefaultSpeakerOutput

For usage in JavaScript, you can view the tests we have. Here is one using fromDefaultSpeakerOutput

For Java, there is this sample

If these don't prove to give you assistance with your scenario, when you follow-up could you also include details about your dev environment, like OS and SDK version? (If they do help, it'd be great to know what the issue was - for our own learnings and/or for improving our docs, etc.)

bfoysal commented 3 years ago

@brandom-msft my Java configuration is JDK 8, speech SDK version 1.10.0, OS windows 10 problem I'm facing is the audio recognition is inconsistent when DefaultSpeakerOutput is used. it works just fine with default microphone input.

for JavaScript I'm using the latest bundle.js file available from the documentation. I am running the program inside an iframe. Although it can capture and translate microphone audio but when DefaultSpeakerOutput() is used as audioConfig there are no event triggers. does the TranslationRecognizer has to run in the same context of the audio output source? are there any permission dependency?

brandom-msft commented 3 years ago

@bfoysal Can you share more about the scenario you're attempting to achieve? You mention you're using speech translation and that it does work as expected with the audio config set to microphone input. What more/different are you trying to achieve beyond that? I'm curious to learn/understand what it is you're wanting to do so we can help get you to that goal :)

bfoysal commented 3 years ago

@brandom-msft Let's say in a virtual meeting or conference I'm not familiar with the language being spoken. I want to translate the speech for my understanding. Right now I can get the microphone audio translated, but I want the audio I'm receiving to be translated. Hope this explains my intention :)

brandom-msft commented 3 years ago

@bfoysal Thanks for the clarification, now I see what you're aiming for. The fromDefaultSpeakerOutput API provides configuration for playback on the default speaker setup, but won't provide a direct "pipe" to use speaker output as an input. This is an interesting translation scenario! I have opened an item on the team backlog to consider for future deliveries.

I'll update the tags and close this issue but please reach out with any additional questions!