How can i track or know currently reading sentence or word from audioTrack in Android while using sdk

harsha-osi commented 2 years ago

Hi, am currently looking to identify or track, what currently reading from the given input to the synthesizer for audioTrack. What callback is called once speaking is completed? Please help to solve my issue.

rhurey commented 2 years ago

Have you had an opportunity to look at the Java Android sample? From the description it sounds like the SynthesisCompleted and WordBoundary events would be helpful.

satish-osi commented 2 years ago

Yes, but during audio play we don't have any callback. Even we .pause() and than after use .play() than its not working. [WordBoundary] This will pick 1 by 1 word from given input and send to server. [SynthesisCompleted] this is called all the words are send to server and process is completed than it will trigger automatically. There is no-way to detect where we are stopped. Please let me know if you need any information. PFA of example code.

//

// Copyright (c) Microsoft. All rights reserved. // Licensed under the MIT license. See LICENSE.md file in the project root for full license information. // package com.microsoft.cognitiveservices.speech.samples.speechsynthesis;

import androidx.appcompat.app.AppCompatActivity; import androidx.core.app.ActivityCompat;

import android.graphics.Color; import android.media.AudioAttributes; import android.media.AudioDeviceCallback; import android.media.AudioFormat; import android.media.AudioManager; import android.media.AudioPlaybackCaptureConfiguration; import android.media.AudioTrack; import android.os.Bundle; import android.provider.MediaStore; import android.text.Spannable; import android.text.method.ScrollingMovementMethod; import android.text.style.ForegroundColorSpan; import android.util.Log; import android.view.View; import android.widget.EditText; import android.widget.TextView;

import com.microsoft.cognitiveservices.speech.AudioDataStream; import com.microsoft.cognitiveservices.speech.Connection; import com.microsoft.cognitiveservices.speech.KeywordRecognitionEventArgs; import com.microsoft.cognitiveservices.speech.PropertyId; import com.microsoft.cognitiveservices.speech.SpeechConfig; import com.microsoft.cognitiveservices.speech.SpeechSynthesisCancellationDetails; import com.microsoft.cognitiveservices.speech.SpeechSynthesisOutputFormat; import com.microsoft.cognitiveservices.speech.SpeechSynthesisResult; import com.microsoft.cognitiveservices.speech.SpeechSynthesizer; import com.microsoft.cognitiveservices.speech.audio.AudioConfig;

import java.io.File; import java.util.Locale; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors;

import static android.Manifest.permission.INTERNET;

public class MainActivity extends AppCompatActivity {

// Replace below with your own subscription key
private static String speechSubscriptionKey = "";
// Replace below with your own service region (e.g., "westus").
private static String serviceRegion = "";

private SpeechConfig speechConfig;
private SpeechSynthesizer synthesizer;
private Connection connection;
private AudioTrack audioTrack;

private TextView outputMessage;

//This will consume text which require to convert speech
private SpeakingRunnable speakingRunnable;
private ExecutorService singleThreadExecutor;
private final Object synchronizedObj = new Object();
private boolean stopped = false;
private boolean paused = false;

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);

    // Note: we need to request the permissions
    int requestCode = 5; // Unique code for the permission request
    ActivityCompat.requestPermissions(MainActivity.this, new String[]{INTERNET}, requestCode);

    singleThreadExecutor = Executors.newSingleThreadExecutor();
    speakingRunnable = new SpeakingRunnable();

    outputMessage = this.findViewById(R.id.outputMessage);
    outputMessage.setMovementMethod(new ScrollingMovementMethod());

    audioTrack = new AudioTrack(
            new AudioAttributes.Builder()
                    .setUsage(AudioAttributes.USAGE_MEDIA)
                    .setContentType(AudioAttributes.CONTENT_TYPE_SPEECH)
                    .build(),
            new AudioFormat.Builder()
                    .setEncoding(AudioFormat.ENCODING_PCM_16BIT)
                    .setSampleRate(24000)
                    .setChannelMask(AudioFormat.CHANNEL_OUT_MONO)
                    .build(),
            AudioTrack.getMinBufferSize(
                    24000,
                    AudioFormat.CHANNEL_OUT_MONO,
                    AudioFormat.ENCODING_PCM_16BIT) * 2,
            AudioTrack.MODE_STREAM,
            AudioManager.AUDIO_SESSION_ID_GENERATE);

    audioTrack.play();
    audioTrack.pause();

}

@Override
protected void onDestroy() {
    super.onDestroy();

    // Release speech synthesizer and its dependencies
    if (synthesizer != null) {
        synthesizer.close();
        connection.close();
    }
    if (speechConfig != null) {
        speechConfig.close();
    }

    if (audioTrack != null) {
        singleThreadExecutor.shutdownNow();
        audioTrack.flush();
        audioTrack.stop();
        audioTrack.release();
    }
}

public void onCreateSynthesizerButtonClicked(View v) {
    paused=false;
    if (synthesizer != null) {
        speechConfig.close();
        synthesizer.close();
        connection.close();
    }

    // Reuse the synthesizer to lower the latency.
    // I.e. create one synthesizer and speak many times using it.
    clearOutputMessage();
    updateOutputMessage("Initializing synthesizer...\n");

    speechConfig = SpeechConfig.fromSubscription(speechSubscriptionKey, serviceRegion);
    // Use 24k Hz format for higher quality.
    speechConfig.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw24Khz16BitMonoPcm);
    // Set voice name.
    speechConfig.setSpeechSynthesisVoiceName("en-US-JennyNeural");

    /*AudioConfig audioConfig = AudioConfig.fromDefaultSpeakerOutput();*/
    AudioConfig audioConfig = AudioConfig.fromSpeakerOutput(audioTrack.toString());

    synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);

    connection = Connection.fromSpeechSynthesizer(synthesizer);

    Locale current = getResources().getConfiguration().locale;

    connection.connected.addEventListener((o, e) -> {
        updateOutputMessage("Connection established.\n");
    });

    connection.disconnected.addEventListener((o, e) -> {
        updateOutputMessage("Disconnected.\n");
    });
    /*connection.messageReceived.addEventListener();*/

    synthesizer.SynthesisStarted.addEventListener((o, e) -> {
        updateOutputMessage(String.format(current,
            "Synthesis started. Result Id: %s.\n",
            e.getResult().getResultId()));
        e.close();
    });

    synthesizer.Synthesizing.addEventListener((o, e) -> {
        updateOutputMessage(String.format(current,
            "Synthesizing. received %d bytes.\n",
            e.getResult().getAudioLength()));
        e.close();
    });

    synthesizer.SynthesisCompleted.addEventListener((o, e) -> {
        updateOutputMessage("Synthesis finished.\n");
        updateOutputMessage("\tFirst byte latency: " + e.getResult().getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs) + " ms.\n");
        updateOutputMessage("\tFinish latency: " + e.getResult().getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs) + " ms.\n");
        e.close();
    });

    synthesizer.SynthesisCanceled.addEventListener((o, e) -> {
        String cancellationDetails =
                SpeechSynthesisCancellationDetails.fromResult(e.getResult()).toString();
        updateOutputMessage("Error synthesizing. Result ID: " + e.getResult().getResultId() +
                ". Error detail: " + System.lineSeparator() + cancellationDetails +
                System.lineSeparator() + "Did you update the subscription info?\n",
            true, true);
        e.close();
    });

    synthesizer.WordBoundary.addEventListener((o, e) -> {
        updateOutputMessage(String.format(current,
            "Word boundary. Text offset %d, length %d; audio offset %d ms.\n",
            e.getTextOffset(),
            e.getWordLength(),
            e.getAudioOffset() / 10000));

    });

 /*   speechRecognizer.recognized.addEventListener((o, speechRecognitionEventArgs) -> {
                parsePhrase(speechRecognitionEventArgs.getResult().getText());
                showUiAppIsListening(false);
            }*/

}

public void onPreConnectButtonClicked(View v) {
    // This method could pre-establish the connection to service to lower the latency
    // This method is useful when you want to synthesize audio in a short time, but the text is
    // not available. E.g. for speech bot, you can warm up the TTS connection when the user is speaking;
    // then call speak() when dialogue utterance is ready.
    if (connection == null) {
        updateOutputMessage("Please initialize the speech synthesizer first\n", true, true);
        return;
    }
    connection.openConnection(true);
    updateOutputMessage("Opening connection.\n");
}

public void onSpeechButtonClicked(View v) {
    if(paused){
        audioTrack.play();
    }else{
        clearOutputMessage();

        if (synthesizer == null) {
            updateOutputMessage("Please initialize the speech synthesizer first\n", true, true);
            return;
        }

        EditText speakText = this.findViewById(R.id.speakText);

        speakingRunnable.setContent(speakText.getText().toString());
        singleThreadExecutor.execute(speakingRunnable);
    }

}

public void onStopButtonClicked(View v) {
    if (synthesizer == null) {
        updateOutputMessage("Please initialize the speech synthesizer first\n", true, true);
        return;
    }

    stopSynthesizing();
}

class SpeakingRunnable implements Runnable {
    private String content;

    public void setContent(String content) {
        this.content = content;
    }

    @Override
    public void run() {
        try {
            audioTrack.play();
            synchronized (synchronizedObj) {
                stopped = false;
            }

            SpeechSynthesisResult result = synthesizer.StartSpeakingTextAsync(content).get();
            AudioDataStream audioDataStream = AudioDataStream.fromResult(result);

            // Set the chunk size to 50 ms. 24000 * 16 * 0.05 / 8 = 2400
            byte[] buffer = new byte[2400];
            while (!stopped) {
                long len = audioDataStream.readData(buffer);
                if (len == 0) {
                    break;
                }
                audioTrack.write(buffer, 0, (int) len);
            }

            //File editionImagesDir = new File(getExternalFilesDir(null), "/edition-thumbnails1.amr");
            //AudioConfig.fromWavFileOutput(editionImagesDir.getAbsolutePath().toString());
            //audioDataStream.saveToWavFile(editionImagesDir.getAbsolutePath().toString());
            //audioDataStream.saveToWavFileAsync(editionImagesDir.getAbsolutePath().toString());

            audioDataStream.close();
        } catch (Exception ex) {
            Log.e("Speech Synthesis Demo", "unexpected " + ex.getMessage());
            ex.printStackTrace();
            assert(false);
        }
    }
}

private void stopSynthesizing() {
    if (synthesizer != null) {
        synthesizer.StopSpeakingAsync();
    }
    if (audioTrack != null) {
        synchronized (synchronizedObj) {
          //  stopped = true;
        }
        audioTrack.pause();
        paused = true;
        //audioTrack.flush();
    }
}

private void updateOutputMessage(String text) {
    updateOutputMessage(text, false, true);
}

private synchronized void updateOutputMessage(String text, boolean error, boolean append) {
    this.runOnUiThread(() -> {
        if (append) {
            outputMessage.append(text);
        } else {
            outputMessage.setText(text);
        }
        if (error) {
            Spannable spannableText = (Spannable) outputMessage.getText();
            spannableText.setSpan(new ForegroundColorSpan(Color.RED),
                spannableText.length() - text.length(),
                spannableText.length(),
                0);
        }
    });
}

private void clearOutputMessage() {
    updateOutputMessage("", false, false);
}

}

rhurey commented 2 years ago

I'm going to need to talk to some of my team members to get some additional ideas.

I also edited the code to remove the keys from it. You should rotate the ones that were in the message.

satish-osi commented 2 years ago

Noted @rhurey , thank you.

satish-osi commented 2 years ago

Hi @rhurey , am just sharing exact what we are looking in this SDK.

i.e Here if we have 10 sentences and total containing n words. If we click on play button and speaking of 4 sentences completed and we are mid of 5th sentence. No if we click on stop/pause button In this case do we have possibility to continue user from mid of 5th sentence where he is previously paused. We are using same sdk in Website, it have some callbacks for player of play,pause,resume and stop. Now we are try to implementing same in Android and iOS applications. But we are blocked in above case.

Please let us know if you need any additional information. We are glad to support you.

pankopon commented 2 years ago

Example:

SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, null);

synthesizer.WordBoundary.addEventListener((o, e) -> {
    // Unit of AudioOffset and Duration is 1 tick = 100 nanoseconds
    System.out.println("Word \"" + e.getText() + "\", offset: " + e.getAudioOffset() / 10000.0 + "ms, duration: " + e.getDuration() / 10000.0 + "ms");
});

String text = "what's the weather like";
SpeechSynthesisResult result = synthesizer.SpeakTextAsync(text).get();

if (result.getReason() == ResultReason.SynthesizingAudioCompleted) {
    AudioDataStream audioDataStream = AudioDataStream.fromResult(result);

    byte[] buffer = new byte[16000];
    long totalSize = 0;
    long filledSize = audioDataStream.readData(buffer);

    while (filledSize > 0) {
        double filledLengthMs = filledSize / 2.0 / 16; // 16-bit, 16 kHz
        System.out.println("Read " + filledSize + " bytes (" + filledLengthMs + "ms)");
        totalSize += filledSize;
        filledSize = audioDataStream.readData(buffer);
    }

    double totalLengthMs = totalSize / 2.0 / 16; // 16-bit, 16 kHz
    System.out.println("Total " + totalSize + " bytes (" + totalLengthMs + "ms) for text \"" + text + "\"");
    audioDataStream.close();
}

Output:

Word "what's", offset: 50.0ms, duration: 312.5ms
Word "the", offset: 375.0ms, duration: 112.5ms
Word "weather", offset: 500.0ms, duration: 312.5ms
Word "like", offset: 825.0ms, duration: 487.5ms
Read 16000 bytes (500.0ms)
Read 16000 bytes (500.0ms)
Read 16000 bytes (500.0ms)
Read 16000 bytes (500.0ms)
Read 4402 bytes (137.5625ms)
Total 68402 bytes (2137.5625ms) for text "what's the weather like"

Word boundary events give you the offset and duration of each word in the synthesis output. There is no other way at the moment. With this information, if you play audio from the result data stream and pause at X bytes / Y ms then you should be able to determine which word was the last one played (fully or completely). Ref. the JavaScript example for matching the current playback time to word boundaries: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/4af1b8e7f2ff0c47a12d627a0b7a4bb37b7b4063/samples/js/browser/synthesis.html#L276

To know the playback position at the time of pause, use a playback API method to get the current input offset, if there is such method available, or maintain this info in the application based on the total bytes fed into the playback API (like audioTrack.write). In any case you first need to synthesize all audio you want to play at a time, then use data from the result AudioDataStream as input to the player. Currently there are no similar "player helpers" as in the mentioned JavaScript sample, but we may add in the future.

pankopon commented 2 years ago

Closed as answered, please open a new issue if more support is needed.

harsha-osi commented 2 years ago

Thanks for the support! As of now play and pause are working as expected. But while testing we faced one more scenario i.e Once the given string is completely played, we would like to know how to reset and play the same text again. Please help us.

Azure-Samples / cognitive-services-speech-sdk

How can i track or know currently reading sentence or word from audioTrack in Android while using sdk #1605