alphacep / vosk-android-demo

Offline speech recognition for Android with Vosk library.
Apache License 2.0
755 stars 207 forks source link

Playback Capture #3

Closed MuhammadRashid closed 5 years ago

MuhammadRashid commented 5 years ago

Hi Nickolay, First I really much appreciate you for such a great effort.

I am using your demo app for continuous speech recognition with Microphone. I experienced about 80% accuracy on Audio to Text transcribe from a video/audio playing in android device or outside. However in-person conversation, it shows much poor accuracy. How can I overcome this problem?

There are few other questions. Can you please entertain them?

  1. Is there a role of accent? How can an accent be used like North American English Accent?
  2. Is it possible to use audio input without a microphone. Suppose we capture audio from other apps internally without microphone, how can we put this captured data (audio buffers) as input to Kaldi Android App to get transcribed info back?

Kind regards, Muhammad Rashid

nshmyrev commented 5 years ago

Is there a role of accent? How can an accent be used like North American English Accent?

Accuracy tuning for mobile device is complicated and might require analysis of the data, training of the model, etc. Current model is pretty basic and optimized for realtime. More advanced models could be more accurate. Also, realtime conversations are hard to transcribe, much harder than broadcast or dictation.

Is it possible to use audio input without a microphone. Suppose we capture audio from other apps internally without microphone, how can we put this captured data (audio buffers) as input to Kaldi Android App to get transcribed info back?

Yes, it is demonstrated in the code, see

https://github.com/alphacep/kaldi-android-demo/blob/e4053d67ff626f0e24e5e36fdcf1360c67b6199d/app/src/main/java/org/kaldi/demo/KaldiActivity.java#L235

https://github.com/alphacep/kaldi-android-demo/blob/e4053d67ff626f0e24e5e36fdcf1360c67b6199d/app/src/main/java/org/kaldi/demo/KaldiActivity.java#L115

MuhammadRashid commented 5 years ago

Is there a role of accent? How can an accent be used like North American English Accent?

Accuracy tuning for mobile device is complicated and might require analysis of the data, training of the model, etc. Current model is pretty basic and optimized for realtime. More advanced models could be more accurate. Also, realtime conversations are hard to transcribe, much harder than broadcast or dictation.

Is it possible to use audio input without a microphone. Suppose we capture audio from other apps internally without microphone, how can we put this captured data (audio buffers) as input to Kaldi Android App to get transcribed info back?

Yes, it is demonstrated in the code, see

https://github.com/alphacep/kaldi-android-demo/blob/e4053d67ff626f0e24e5e36fdcf1360c67b6199d/app/src/main/java/org/kaldi/demo/KaldiActivity.java#L235

https://github.com/alphacep/kaldi-android-demo/blob/e4053d67ff626f0e24e5e36fdcf1360c67b6199d/app/src/main/java/org/kaldi/demo/KaldiActivity.java#L115

Thank you very much for your kind response.

Yes, I found your referred code. By going through it again (although before I thought it only works for audio file that already kept in raw/asset folder).

KaldiRecognizer rec = new KaldiRecognizer(activityReference.get().model);
InputStream ais = ... // It can be from any audio source either from other apps like Youtube audio or from file on SD card/Gallery/ or inside app in assets/raw directory.
                if (ais.skip(44) != 44) {
                    return "";
                }
                byte[] b = new byte[4096];
                int nbytes;
                while ((nbytes = ais.read(b)) >= 0) {
                    **rec.AcceptWaveform(b, nbytes);**
                }

So it means we can give any Input stream buffer from any audio playing either inside current app or capturing from other apps (YouTube etc) silently without microphone by using android's PlaybackCapture APIs (android 10 support only) or other third party APIs, etc.

Kinldy confirm?

nshmyrev commented 5 years ago

Correct, but for longer file processing the work should be a bit different, it should use the voice activity detection as python API:

while ((nbytes = ais.read(b)) >= 0) {
    if (rec.AcceptWaveform(b, nbytes))
        System.out.println(rec.Result())
   else
        System.out.println(rec.PartialResult());
}
System.out.println(rec.FinalResult());
MuhammadRashid commented 5 years ago

Okay, thanks a lot.

nshmyrev commented 5 years ago

Feel free to reopen

MuhammadRashid commented 5 years ago

Feel free to reopen

I am stuck with a scenario. I need your help. I can get byte array of audio data continuously from playback. Each time I receive 1024 bytes. How can I pass this data continuosly to Kaldi Android Speech APIs in order to transcribe it. Suppose I am getting audio data inside app by using android Visiualizer class.

nshmyrev commented 5 years ago

I can get byte array of audio data continuously from playback. Each time I receive 1024 bytes. How can I pass this data continuosly to Kaldi Android Speech APIs in order to transcribe it. Suppose I am getting audio data inside app by using android Visiualizer class.

in constructor or some other init:

Model model = new Model(assetDir.toString() + "/model-android");
KaldiRecognizer rec = new KaldiRecognizer(model);

when you recieve bytes

    if(rec.AcceptWaveform(b, nbytes)) {
        Log.d(TAG, rec.PartialResult());
    } else {
        Log.d(TAG, rec.Result());
    }