Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.98k stars 1.87k forks source link

Android custom audio track implementation .pause() after .play() not working (TTS is starting from first word instead of resuming where it is paused previously) #1610

Closed satish-osi closed 2 years ago

satish-osi commented 2 years ago

PFA of source code :

Android implementation of custom audio track as per the sample Android project provided in github. We are able to call .pause() method while TTS has speaking. But if we are pause() TTS and .play() method again called with 2 seconds delay it is not working. Once we use .pause() method if we called .play() for resume It is not working anymore. Please let us know how do fix this issue.

//
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE.md file in the project root for full license information.
//
package com.microsoft.cognitiveservices.speech.samples.speechsynthesis;

import androidx.appcompat.app.AppCompatActivity;
import androidx.core.app.ActivityCompat;

import android.graphics.Color;
import android.media.AudioAttributes;
import android.media.AudioDeviceCallback;
import android.media.AudioFormat;
import android.media.AudioManager;
import android.media.AudioPlaybackCaptureConfiguration;
import android.media.AudioTrack;
import android.os.Bundle;
import android.os.Handler;
import android.provider.MediaStore;
import android.text.Spannable;
import android.text.method.ScrollingMovementMethod;
import android.text.style.ForegroundColorSpan;
import android.util.Log;
import android.view.View;
import android.widget.EditText;
import android.widget.TextView;
import android.widget.Toast;

import com.microsoft.cognitiveservices.speech.AudioDataStream;
import com.microsoft.cognitiveservices.speech.Connection;
import com.microsoft.cognitiveservices.speech.KeywordRecognitionEventArgs;
import com.microsoft.cognitiveservices.speech.PropertyId;
import com.microsoft.cognitiveservices.speech.SpeechConfig;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisBookmarkEventArgs;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisCancellationDetails;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisOutputFormat;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisResult;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisWordBoundaryEventArgs;
import com.microsoft.cognitiveservices.speech.SpeechSynthesizer;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import com.microsoft.cognitiveservices.speech.util.EventHandler;
import com.microsoft.cognitiveservices.speech.util.EventHandlerImpl;

import java.io.File;
import java.util.Locale;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import static android.Manifest.permission.INTERNET;

public class MainActivity extends AppCompatActivity {

    // Replace below with your own subscription key
    private static String speechSubscriptionKey = "subscription_key";
    // Replace below with your own service region (e.g., "westus").
    private static String serviceRegion = "region_key";

    private SpeechConfig speechConfig;
    private SpeechSynthesizer synthesizer;
    private Connection connection;
    private AudioTrack audioTrack;

    private TextView outputMessage;

    //This will consume text which require to convert speech
    private SpeakingRunnable speakingRunnable;
    private ExecutorService singleThreadExecutor;
    private final Object synchronizedObj = new Object();
    private boolean stopped = false;
    private boolean paused = false;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        // Note: we need to request the permissions
        int requestCode = 5; // Unique code for the permission request
        ActivityCompat.requestPermissions(MainActivity.this, new String[]{INTERNET}, requestCode);

        singleThreadExecutor = Executors.newSingleThreadExecutor();
        speakingRunnable = new SpeakingRunnable();

        outputMessage = this.findViewById(R.id.outputMessage);
        outputMessage.setMovementMethod(new ScrollingMovementMethod());

        audioTrack = new AudioTrack(
                new AudioAttributes.Builder()
                        .setUsage(AudioAttributes.USAGE_MEDIA)
                        .setContentType(AudioAttributes.CONTENT_TYPE_SPEECH)
                        .build(),
                new AudioFormat.Builder()
                        .setEncoding(AudioFormat.ENCODING_PCM_16BIT)
                        .setSampleRate(24000)
                        .setChannelMask(AudioFormat.CHANNEL_OUT_MONO)
                        .build(),
                AudioTrack.getMinBufferSize(
                        24000,
                        AudioFormat.CHANNEL_OUT_MONO,
                        AudioFormat.ENCODING_PCM_16BIT) * 2,
                AudioTrack.MODE_STREAM,
                AudioManager.AUDIO_SESSION_ID_GENERATE);

        /*audioTrack.play();
        audioTrack.pause();*/
        //audioTrack.getFormat().

    }

    @Override
    protected void onDestroy() {
        super.onDestroy();

        // Release speech synthesizer and its dependencies
        if (synthesizer != null) {
            synthesizer.close();
            connection.close();
        }
        if (speechConfig != null) {
            speechConfig.close();
        }
        singleThreadExecutor.shutdownNow();
        if (audioTrack != null) {
            singleThreadExecutor.shutdownNow();
            audioTrack.flush();
            audioTrack.stop();
            audioTrack.release();
        }
    }

    public void onCreateSynthesizerButtonClicked(View v) {
        paused=false;
        if (synthesizer != null) {
            speechConfig.close();
            synthesizer.close();
            connection.close();
        }

        // Reuse the synthesizer to lower the latency.
        // I.e. create one synthesizer and speak many times using it.
        clearOutputMessage();
        updateOutputMessage("Initializing synthesizer...\n");

        speechConfig = SpeechConfig.fromSubscription(speechSubscriptionKey, serviceRegion);
        // Use 24k Hz format for higher quality.
        speechConfig.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw24Khz16BitMonoPcm);
        // Set voice name.
        speechConfig.setSpeechSynthesisVoiceName("en-US-JennyNeural");
        //speechConfig.getSpeechSynthesisOutputFormat().toString();

      //  File editionImagesDir = new File(getExternalFilesDir(null),"path/to/write/file.wav");
        //AudioConfig audioConfig = AudioConfig.fromWavFileOutput(editionImagesDir.getAbsolutePath());
       // AudioConfig audioConfig = AudioConfig.fromDefaultSpeakerOutput();
        /*AudioConfig audioConfig = AudioConfig.fromDefaultSpeakerOutput();*/
        //AudioConfig audioConfig = AudioConfig.fromSpeakerOutput(audioTrack.toString());

        //audioOutputKeepAlive
        synthesizer = new SpeechSynthesizer(speechConfig, null);

        connection = Connection.fromSpeechSynthesizer(synthesizer);

        Locale current = getResources().getConfiguration().locale;

        connection.connected.addEventListener((o, e) -> {
            updateOutputMessage("Connection established.\n");
        });

        connection.disconnected.addEventListener((o, e) -> {
            updateOutputMessage("Disconnected.\n");
        });
        /*connection.messageReceived.addEventListener();*/

        synthesizer.BookmarkReached.addEventListener((o, e) -> {
            Toast.makeText(this,"BookmarkReached "+e.getText().toString(),Toast.LENGTH_SHORT).show();
            /*updateOutputMessage(String.format(current,
                    "BookmarkReached started. Result Id: %s.\n",
                    e.getText().toString()));*/
            //e.close();
        });

        synthesizer.SynthesisStarted.addEventListener((o, e) -> {
            updateOutputMessage(String.format(current,
                "Synthesis started. Result Id: %s.\n",
                e.getResult().getResultId()));
            e.close();
        });

        synthesizer.Synthesizing.addEventListener((o, e) -> {
            updateOutputMessage(String.format(current,
                "Synthesizing. received %d bytes.\n",
                e.getResult().getAudioLength()));
            e.close();
        });

        synthesizer.SynthesisCompleted.addEventListener((o, e) -> {
            updateOutputMessage("Synthesis finished.\n");
            updateOutputMessage("\tFirst byte latency: " + e.getResult().getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs) + " ms.\n");
            updateOutputMessage("\tFinish latency: " + e.getResult().getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs) + " ms.\n");
            e.close();
        });

        synthesizer.SynthesisCanceled.addEventListener((o, e) -> {
            String cancellationDetails =
                    SpeechSynthesisCancellationDetails.fromResult(e.getResult()).toString();
            updateOutputMessage("Error synthesizing. Result ID: " + e.getResult().getResultId() +
                    ". Error detail: " + System.lineSeparator() + cancellationDetails +
                    System.lineSeparator() + "Did you update the subscription info?\n",
                true, true);
            e.close();
        });

        synthesizer.WordBoundary.addEventListener((o, e) -> {

            updateOutputMessage(String.format(current,
                "Word boundary. Text offset %d, length %d; audio offset %d ms.\n",
                e.getTextOffset(),
                e.getWordLength(),
                e.getAudioOffset() / 10000));

        });

     /*   speechRecognizer.recognized.addEventListener((o, speechRecognitionEventArgs) -> {
                    parsePhrase(speechRecognitionEventArgs.getResult().getText());
                    showUiAppIsListening(false);
                }*/

    }

    public void onPreConnectButtonClicked(View v) {
        // This method could pre-establish the connection to service to lower the latency
        // This method is useful when you want to synthesize audio in a short time, but the text is
        // not available. E.g. for speech bot, you can warm up the TTS connection when the user is speaking;
        // then call speak() when dialogue utterance is ready.
        if (connection == null) {
            updateOutputMessage("Please initialize the speech synthesizer first\n", true, true);
            return;
        }
        connection.openConnection(true);
        updateOutputMessage("Opening connection.\n");
    }

    public void onSpeechButtonClicked(View v) {
        /*if(paused){
            if (synthesizer != null) {
                //synthesizer.Start();
            }
            //audioTrack.play();

        }else{*/
            clearOutputMessage();

            if (synthesizer == null) {
                updateOutputMessage("Please initialize the speech synthesizer first\n", true, true);
                return;
            }

            EditText speakText = this.findViewById(R.id.speakText);

            speakingRunnable.setContent(speakText.getText().toString());
            singleThreadExecutor.execute(speakingRunnable);
        /*}*/

    }

    private void stopSynthesizing() {
        if (synthesizer != null) {
            //  synthesizer.StopSpeakingAsync();
        }

        // audioTrack.pause();

        if (audioTrack != null) {
            synchronized (synchronizedObj) {
                //stopped = true;
            }
            //Here am pause tts audio which playing
            audioTrack.pause();
            paused = true;
            showToast("Pause method called");
            //  audioTrack.flush();
        }
        paused = true;

        //Here am resume audio with 2 secons delay, but it is not working.
        new Handler().postDelayed(new Runnable() {
            @Override
            public void run() {
                audioTrack.play();
                showToast("Play method called");
            }
        }, 2000);
    }

    private void showToast(String message) {
        Toast.makeText(this,message,Toast.LENGTH_SHORT).show();
    }

    public void onStopButtonClicked(View v) {
        if (synthesizer == null) {
            updateOutputMessage("Please initialize the speech synthesizer first\n", true, true);
            return;
        }

        stopSynthesizing();
    }

    class SpeakingRunnable implements Runnable {
        private String content;

        public void setContent(String content) {
            this.content = content;
        }

        @Override
        public void run() {
            try {
                audioTrack.play();
                synchronized (synchronizedObj) {
                    stopped = false;
                }

                SpeechSynthesisResult result = synthesizer.StartSpeakingTextAsync(content).get();
                //synthesizer.wordBoundaryEventCallback()
                //SpeechSynthesisWordBoundaryEventArgs ards =new SpeechSynthesisWordBoundaryEventArgs(1);
                AudioDataStream audioDataStream = AudioDataStream.fromResult(result);

                // Set the chunk size to 50 ms. 24000 * 16 * 0.05 / 8 = 2400
                byte[] buffer = new byte[2400];
                while (!stopped) {
                    Log.d("buffer : ",""+buffer.length);
                    long len = audioDataStream.readData(buffer);
                    if (len == 0) {
                        break;
                    }
                    audioTrack.write(buffer, 0, (int) len);
                }

                //File editionImagesDir = new File(getExternalFilesDir(null), "/recordedAudio");
                //AudioConfig.fromWavFileOutput(editionImagesDir.getAbsolutePath().toString());
                //audioDataStream.saveToWavFile(editionImagesDir.getAbsolutePath().toString());
                //audioDataStream.saveToWavFileAsync(editionImagesDir.getAbsolutePath().toString());

                audioDataStream.close();
            } catch (Exception ex) {
                Log.e("Speech Synthesis Demo", "unexpected " + ex.getMessage());
                ex.printStackTrace();
                assert(false);
            }
        }
    }

    private void updateOutputMessage(String text) {
        updateOutputMessage(text, false, true);
    }

    private synchronized void updateOutputMessage(String text, boolean error, boolean append) {
        this.runOnUiThread(() -> {
            if (append) {
                outputMessage.append(text);
            } else {
                outputMessage.setText(text);
            }
            if (error) {
                Spannable spannableText = (Spannable) outputMessage.getText();
                spannableText.setSpan(new ForegroundColorSpan(Color.RED),
                    spannableText.length() - text.length(),
                    spannableText.length(),
                    0);
            }
        });
    }

    private void clearOutputMessage() {
        updateOutputMessage("", false, false);
    }
}
ralph-msft commented 2 years ago

Could you please enable SDK logging and share those logs? You can find the instructions on how to do that here: https://docs.microsoft.com/azure/cognitive-services/speech-service/how-to-use-logging#android

Could you please also include any logcat logs - particularly errors?

satish-osi commented 2 years ago

Hi @ralph-msft , PFA of logfile. logfile.txt

Case : Am installed application with above mentioned source code and clicked on play button and clicked on stop button. Collected log file and shared with you.

Please let me know if you need any additional information. Am glad to help you!

jpalvarezl commented 2 years ago

Hi @satish-osi,

I've tried running your code locally. I did so by arranging the callbacks behind buttons like so:

        findViewById<Button>(R.id.buttonCreateSynth)?.let { button ->
            button.setOnClickListener {
                onCreateSynthesizerButtonClicked(button)
            }
        }

        findViewById<Button>(R.id.buttonPlayVoice)?.let { button ->
            button.setOnClickListener {
                onSpeechButtonClicked(button)
            }
        }

        findViewById<Button>(R.id.buttonStopVoice)?.let { button ->
            button.setOnClickListener {
                onStopButtonClicked(button)
            }
        }

I tapped the buttons in the following order:

  1. Tap buttonCreateSynth
  2. I pasted a long text in the content EditText that would allow for playback to go on long enough
  3. Tap buttonPlayVoice
  4. Tap buttonStopVoice (at this point I noticed some UI stuttering)
  5. The audio play back eventually stops after a little over a second

I noticed too that if you let the audio play out, then it is possible to reproduce the TTS again.

I ended up commenting out the contents of your updateOutputMessage method. Since you use this to log things into the UI, and given the you are logging a lot of events, it seems like the UI thread gets spammed and then has a hard time registering the tap for the buttonStopVoice. Once you comment out the contents of the method updateOutputMessage my reproduction of your code appears to run smoothly.

satish-osi commented 2 years ago

Hi @jpalvarezl , you are correct. But, as per your order,

  1. Tap buttonCreateSynth
  2. I pasted a long text in the content EditText that would allow for playback to go on long enough
  3. Tap buttonPlayVoice
  4. Tap buttonStopVoice (at this point I noticed some UI stuttering)
  5. The audio play back eventually stops after a little over a second

If we click on buttonPlayVoice (step-3) again after step-4. "I noticed too that if you let the audio play out, then it is possible to reproduce the TTS again." This is even working fine for me , but it is starting form first word of sentence. We want to resume execution of TTS from where it is paused/stopped previously.

Let me know if you need any additional information.

PFA, of my entire sample app. https://drive.google.com/drive/folders/1VwPtVda82rLabnx7Kg6tCUtqqaD31oOp?usp=sharing

jpalvarezl commented 2 years ago

Ok, I think I understand better your use case.

The sample you are basing your code from I believe is meant to always reproduce the TTS audio from the top. In order to achieve that, it requests a new result each time and each time there is a new audio stream returned.

I modified the sample so that we keep track of the audio stream returned in the first result. Whenever we pause playback, we need to stop passing frames to the Android's AudioTrack (for some reason, it doesn't seem to buffer while paused, but that could be a configuration issue on my side)

This function, is what you want to call when you want to start both reading from the stream and the playback. Notice that we check whether audioDataStream == null, to see that we use the same audio stream later when we resume playback (we basically keep a reference to it, which is just one way to do it).

    private fun startStreamPlaybackPressed() {
        if (synthesizer == null) {
            Log.e("TTS_SAMPLE", "Please initialize the speech synthesizer first")
            return
        }
        val speakText = findViewById<EditText>(R.id.speakText)

        audioTrack!!.play()
        if (audioDataStream == null) {
            singleThreadExecutor!!.execute { startAudioStream(speakText.text.toString()) }
        } else {
            synchronized(synchronizedObj) { stopped = false }
            singleThreadExecutor!!.execute { readAudioData() }
        }
    }

You will need then these two additional functions:

    private fun startAudioStream(content: String) {
        synchronized(synchronizedObj) { stopped = false }
        val result = synthesizer!!.StartSpeakingTextAsync(content).get()
        audioDataStream = AudioDataStream.fromResult(result)

        // Set the chunk size to 50 ms. 24000 * 16 * 0.05 / 8 = 2400
        readAudioData()
        Log.d("AUDIOTRACK", "Finished reading from audio stream result")
        // audioDataStream!!.close()
    }

    private fun readAudioData() {
        val buffer = ByteArray(2400)
        while (!stopped) {
            val len = audioDataStream!!.readData(buffer)
            if (len == 0L) {
                break
            }
            val bytesWritten = audioTrack!!.write(buffer, 0, len.toInt())
            Log.d("AUDIOTRACK", "$bytesWritten bytes")
        }
    }

These will start or resume writing data from the audio stream into your AudioTrack object.

Finally, pausing playback will look like this:

private fun pausePlayback() {
        if (audioTrack != null) {
            synchronized(synchronizedObj) { stopped = true }
            audioTrack!!.pause()
        }
    }

I would also point out that this will only work once. You would need to implement the logic for when you are done with an audio stream, creating a new one (kind of like what you can see in the original android sample).

pankopon commented 2 years ago

There will be a new public sample for Android that demonstrates how to pause an audio stream generated by TTS, expected to be available at the time of the Speech SDK 1.24.0 release.

Internal work item ref. 4454557.

satish-osi commented 2 years ago

Thank you @pankopon , This feature may give more value to mobile applications.

pankopon commented 2 years ago

To be closed when the Speech SDK 1.24.0 release and updated samples (@jpalvarezl) are available.

pankopon commented 2 years ago

Closed as the Speech SDK 1.24.0 has been released and samples are available (https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/kotlin/android/tts-pause-example).