aws-amplify / amplify-js

A declarative JavaScript library for application development using cloud services.
https://docs.amplify.aws/lib/q/platform/js
Apache License 2.0
9.44k stars 2.13k forks source link

Microphone buffer to Predictions.convert using React Native #6404

Open nathander opened 4 years ago

nathander commented 4 years ago

Describe the bug When I send my microphone buffer to Transcribe using Predictions.convert, I only get empty strings back. I'm not sure whether this should be a bug report or feature request. I'm guessing my audio buffer is formatted incorrectly for Predictions.convert, but the docs don't give enough information to verify that. This may be related to this open issue: https://github.com/aws-amplify/amplify-js/issues/4163

To Reproduce Steps to reproduce the behavior:

  1. Follow the Amplify React-Native tutorial to this point: https://docs.amplify.aws/start/getting-started/nextsteps/q/integration/react-native
  2. Import this module to read the microphone stream: https://github.com/chadsmith/react-native-microphone-stream (the only one i've found that works with React Native)
  3. Convert the buffer using the pcmEncode function here: https://github.com/aws-samples/amazon-transcribe-websocket-static/blob/master/lib/audioUtils.js
  4. Send buffer using Predictions.convert as described here: https://docs.amplify.aws/lib/predictions/transcribe/q/platform/js#working-with-the-api
  5. Build app on Android phone (tested on Pixel 2). Verify that the app has microphone permissions. Press Start and talk into the microphone.

Expected behavior Expected a transcription of the spoken text to return -- instead got only empty strings back.

Code Snippet My App.js is here:

import React, {useState} from 'react';
import {
  View,
  Text,
  StyleSheet,
  TextInput,
  Button,
  TouchableOpacity,
} from 'react-native';
import Amplify from 'aws-amplify';
import Predictions, {
  AmazonAIPredictionsProvider,
} from '@aws-amplify/predictions';
import awsconfig from './aws-exports';
// import LiveAudioStream from 'react-native-live-audio-stream';
// import AudioRecord from 'react-native-audio-record';
import MicStream from 'react-native-microphone-stream';

// from https://github.com/aws-samples/amazon-transcribe-websocket-static/tree/master/lib
const util_utf8_node = require('@aws-sdk/util-utf8-node'); // utilities for encoding and decoding UTF8
const marshaller = require('@aws-sdk/eventstream-marshaller'); // for converting binary event stream messages to and from JSON
const eventStreamMarshaller = new marshaller.EventStreamMarshaller(
  util_utf8_node.toUtf8,
  util_utf8_node.fromUtf8,
);

Amplify.configure(awsconfig);
Amplify.addPluggable(new AmazonAIPredictionsProvider());
const initialState = {name: '', description: ''};

global.Buffer = global.Buffer || require('buffer').Buffer;

function VoiceCapture() {
  const [text, setText] = useState('');

  // from https://github.com/aws-samples/amazon-transcribe-websocket-static/tree/master/lib
  function pcmEncode(input) {
    var offset = 0;
    var buffer = new ArrayBuffer(input.length * 2);
    var view = new DataView(buffer);
    for (var i = 0; i < input.length; i++, offset += 2) {
      var s = Math.max(-1, Math.min(1, input[i]));
      view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
    }
    return buffer;
  }

  async function transcribe(bytes) {
    await Predictions.convert({
      transcription: {
        source: {
          bytes,
        },
        language: 'en-US',
      },
    })
      .then(({transcription: {fullText}}) => console.log({fullText}))
      .catch((err) => console.log({err}));
  }

  var listener = MicStream.addListener((data) => {
    // console.log(data);

    // encode the mic input
    let pcmEncodedBuffer = pcmEncode(data);

    // // add the right JSON headers and structure to the message
    // let audioEventMessage = getAudioEventMessage(
    //   global.Buffer.from(pcmEncodedBuffer),
    // );

    // //convert the JSON object + headers into a binary event stream message
    // let binary = eventStreamMarshaller.marshall(audioEventMessage);

    // the docs say this takes a PCM Audio byte buffer, so i assume the wrappers above aren't necessary. Tried them anyways with no luck.
    // (https://docs.amplify.aws/lib/predictions/transcribe/q/platform/js#set-up-the-backend)
    transcribe(pcmEncodedBuffer);
  });

  function startTranscribing() {
    MicStream.init({
      bufferSize: 4096 * 32, // tried multiplying this buffer size to send longer  - still no luck
      // sampleRate: 44100,
      sampleRate: 16000,
      bitsPerChannel: 16,
      channelsPerFrame: 1,
    });

    MicStream.start();
    console.log('Started mic stream');
  }

  function stopTranscribing() {
    MicStream.stop();
    listener.remove();
  }

  return (
    <View style={styles.container}>
      <View style={styles.horizontalView}>
        <TouchableOpacity
          style={styles.mediumButton}
          onPress={() => {
            // Voice.start('en_US');
            // transcribeAudio();
            startTranscribing();
          }}>
          <Text style={styles.mediumButtonText}>START</Text>
        </TouchableOpacity>

        <TouchableOpacity
          style={styles.mediumButton}
          onPress={() => {
            stopTranscribing();
          }}>
          <Text style={styles.mediumButtonText}>STOP</Text>
        </TouchableOpacity>
      </View>
      <TextInput
        style={styles.editableText}
        multiline
        onChangeText={(editedText) => setText(editedText)}>
        {text}
      </TextInput>
    </View>
  );
}

const App = () => {
  return (
    <View style={styles.container}>
      <VoiceCapture />
    </View>
  );
};

export const colors = {
  primary: '#0049bd',
  white: '#ffffff',
};

export const padding = {
  sm: 8,
  md: 16,
  lg: 24,
  xl: 32,
};
const styles = StyleSheet.create({
  container: {
    flex: 1,
    backgroundColor: 'white',
  },
  bodyText: {
    fontSize: 16,
    height: 20,
    fontWeight: 'normal',
    fontStyle: 'normal',
  },
  mediumButtonText: {
    fontSize: 16,
    height: 20,
    fontWeight: 'normal',
    fontStyle: 'normal',
    color: colors.white,
  },
  smallBodyText: {
    fontSize: 14,
    height: 18,
    fontWeight: 'normal',
    fontStyle: 'normal',
  },
  mediumButton: {
    alignItems: 'center',
    justifyContent: 'center',
    width: 132,
    height: 48,
    padding: padding.md,
    margin: 14,
    backgroundColor: colors.primary,
    fontSize: 20,
    fontStyle: 'normal',
    elevation: 1,
    shadowOffset: {width: 1, height: 1},
    shadowOpacity: 0.2,
    shadowRadius: 2,
    borderRadius: 2,
  },
  editableText: {
    textAlign: 'left',
    textAlignVertical: 'top',
    borderColor: 'black',
    borderWidth: 2,
    padding: padding.md,
    margin: 14,
    flex: 5,
    fontSize: 16,
    height: 20,
  },
  horizontalView: {
    flex: 1,
    flexDirection: 'row',
    alignItems: 'stretch',
    justifyContent: 'center',
  },
});

export default App;
ashika01 commented 4 years ago

@cedricgrothues thoughts on this? Is this related to what you were looking at?

cedricgrothues commented 4 years ago

@ashika01 – You're right, this is related to the docs issue. I'll add it to the contribution board and look into it.

cedricgrothues commented 4 years ago

instead got only empty strings back.

@nathander – Are you getting any error message? I get a promise rejection with the following error: "The requested language doesn't support the specified sample rate. Use the correct sample rate then try again." (using 16000 as the sampleRate).

nathander commented 4 years ago

@cedricgrothues -- It's not throwing an error. I'm just getting back an empty string.

cedricgrothues commented 4 years ago

I'm guessing my audio buffer is formatted incorrectly

@nathander – Quick update: you're right, the audio buffer is formatted incorrectly, but that's an issue with the data stream from the react-native-microphone-stream library. I'm looking into finding a fix for this issue and updating the documentation to avoid any further confusion.

nathander commented 4 years ago

Thanks for the update, @cedricgrothues. Is there another package you'd recommend for pulling the microphone stream?

ashika01 commented 4 years ago

@nathander, the only format transcribe streaming API support as of now is PCM which means the input somehow needs to be formatted that way. There doesn't seem to be any reliable stream, I could find (@cedricgrothues could you comment on this). Mean while I have reached out to AWS Transcribe team for some guidance.

cedricgrothues commented 4 years ago

Sorry for the late response, @nathander. While we wait for a response from the AWS Transcribe team, react-native-pcm-audio might be worth a look (disclaimer: I haven't tested the library myself, it only supports android, and was last updated in late 2017).

nathander commented 4 years ago

Hi @cedricgrothues - thanks, unfortunately I need it to work on iOS. Any updates from the Transcribe team?

mauerbac commented 4 years ago

hi @nathander - just brought this up with the team. At this time, we are working with the service team to support more format types. We will mark this as a feature request as we would follow-up with UI components to better support this.

nathander commented 4 years ago

Hi @mauerbac - I don't think I'm asking for support for more format types, I'm just trying to figure out how to generate an accepted format type in React Native. Is streaming the mic buffer to Amplify Predictions from React Native currently supported? If so, is there any code you could share?

ashika01 commented 4 years ago

@nathander The library you are using is the same one we are using for our docs work. We went deep diving into this issue of PCM buffer while writing our docs and while this issue was opened. There are trying couple of methods like finding a good library we could use to get it working and fork the library code and made some changes ourself but no good way as of now. We will be looking deeper into this for updating docs. But we feel best solution might be, transcribe team to open up other audio format for ease of use from mobile side.

nathander commented 4 years ago

@ashika01 I appreciate the update -- thanks a lot for looking into this!

jeffsteinmetz commented 4 years ago

Similar issue. In Expo / react-native / Android How would you specify .mp4 as the encoding / file type for AWS Transcribe, via the javascript amplify prediction class (instead of PCM)?

https://docs.amplify.aws/lib/predictions/transcribe/q/platform/js#working-with-the-api

Expo on android supports .mp4 AAC and not PCM

According to the Transcribe docs, Transcribe supports FLAC, .mp3, .mp4 and .wav

https://docs.aws.amazon.com/transcribe/latest/dg/input.html

It wasn't clear where the docs are that show all of the available {"transcription" : ... } options

I have an Expo application that successfully records form the microphone, and saves it locally as an .m4a file (on Android) It would be easiest if you could take a local .mp4, .m4a, or .wav file (and if needed, ingest into a buffer) and send with a config stating the file type, or even better yet, have amplify react-native the ability to transcribe direct from a local file (read and buffer behind the scenes)

expo snippet


import { Audio } from 'expo-av';

const recordingOptions = {
    android: {
        extension: '.m4a',
        outputFormat: Audio.RECORDING_OPTION_ANDROID_OUTPUT_FORMAT_MPEG_4,
        audioEncoder: Audio.RECORDING_OPTION_ANDROID_AUDIO_ENCODER_AAC,
        sampleRate: 44100,
        numberOfChannels: 2,
        bitRate: 128000,
    },
    ios: {
        extension: '.wav',
        audioQuality: Audio.RECORDING_OPTION_IOS_AUDIO_QUALITY_HIGH,
        sampleRate: 44100,
        numberOfChannels: 1,
        bitRate: 128000,
        linearPCMBitDepth: 16,
        linearPCMIsBigEndian: false,
        linearPCMIsFloat: false,
    },
};

etc...

cedricgrothues commented 4 years ago

@jeffsteinmetz – Amplify uses the Transcribe Streaming web socket API instead of Transcribe's REST API, and that, as of now, only supports PCM. https://github.com/aws-amplify/amplify-js/blob/abf8e824308c229e09e585a7995d17a51f36c652/packages/predictions/src/Providers/AmazonAIConvertPredictionsProvider.ts#L398

For reference, the Transcribe Streaming docs

jeffsteinmetz commented 4 years ago

@cedricgrothues Ahh! Gotcha. Looking at https://github.com/aws-amplify/amplify-js/blob/abf8e824308c229e09e585a7995d17a51f36c652/packages/predictions/src/Providers/AmazonAIConvertPredictionsProvider.ts#L401

These docs https://docs.amplify.aws/lib/predictions/transcribe/q/platform/js#set-up-the-backend don't make mention of the format or sample-rate it expects, so that may cause some confusion with devs using Predictions.convert

It will throw Source types other than byte source are not supported. if you send it a javascript object containing just the raw binary data. Its not clear what it is expecting (or what Type as it relates the Typescript).

Looking at the test https://github.com/aws-amplify/amplify-js/blob/abf8e824308c229e09e585a7995d17a51f36c652/packages/predictions/__tests__/Providers/AWSAIConvertPredictionsProvider-unit-test.ts#L114

It appears to expect something like source: { bytes: ...

            Predictions.convert({
                transcription: {
                  source: {
                    bytes: new Buffer([0, 1, 2])
                  },
                  // language: "en-US", // other options are "en-GB", "fr-FR", "fr-CA", "es-US"
                }
              })

An example of how to call it with a javascript object would be beneficial.
(The test also appears also to use a type of "Buffer" from a lib which isn't imported by default.)

It also throws an error (note: I do not reference Buffer in my code).

[Unhandled promise rejection: ReferenceError: Can't find variable: Buffer]

Stack trace:
  http://192.168.1.105:19001/node_modules/expo/AppEntry.bundle?platform=ios&dev=true&minify=false&hot=false:175114:68 in <unknown>
  node_modules/promise/setimmediate/core.js:45:6 in tryCallTwo
  node_modules/promise/setimmediate/core.js:200:22 in doResolve
  node_modules/promise/setimmediate/core.js:66:11 in Promise
  node_modules/@aws-amplify/predictions/lib-esm/Providers/AmazonAIConvertPredictionsProvider.js:270:8 in 
cedricgrothues commented 4 years ago

@jeffsteinmetz – You're right, the docs are still missing a Predictions.convert example, but as far as I know, that's because there is currently no library that reliably supports either streaming or recording a PCM buffer from react-native.

@ashika01's comment sounds promising, though:

But we feel best solution might be, transcribe team to open up other audio format for ease of use from mobile side.

meherhowji commented 2 years ago

@nathander Were you able to reliably use a streaming or PCM buffer package on react native that worked well for you?

@ashika01 did you later use a library that worked well for your docs? I am working on a similar problem statement and am stuck on finding a good mic audio buffer streaming library. Any recommendations?

Thanks in advance :)

nathander commented 2 years ago

@meherranjan I ended up abandoning React Native for my project and building in Java and Swift instead.

ashirkhan94 commented 1 year ago

Hi team Same issue for me, Any update?