How to send audio to AWS Transcribe

ajhool commented 4 years ago

We are currently using the gstreamer plugin to stream audio + video into KVS. We can see the video + hear the audio in the KVS AWS console / dashboard. We can also see that the audio has the trackname: "audio". ( https://github.com/awslabs/amazon-kinesis-video-streams-producer-sdk-cpp/issues/465 )

The gstreamer command we used:

gst-launch-1.0 -v v4l2src device=/dev/video0 ! videoconvert ! video/x-raw,width=640,height=480,framerate=30/1,format=I420 ! x264enc bframes=0 key-int-max=45 bitrate=500 tune=zerolatency ! h264parse ! video/x-h264,stream-format=avc,alignment=au,profile=baseline ! kvssink name=sink stream-name="testing-kvs-to-transcribe" alsasrc device=hw:1,0 ! audioconvert ! avenc_aac ! queue ! sink.

Our pipeline is: gstreamer (or KVS Producer SDK) -> AWS Lambda with KVS streams parser library -> AWS Transcribe + AWS Rekognition

For Transcribe, the closest ref arch that we could find was: https://github.com/amazon-connect/amazon-connect-realtime-transcription. That ref arch uses AWS Connect as the audio input, whereas we are using KVS Producer.

AWS Transcribe expects 16bit PCM as input. We believe that our kvs producer is uploading AAC . Does this library contain any convenience functions to convert audio to PCM? Are there any suggestions for converting the audio to a format that is compatible with Transcribe?

bkneff commented 4 years ago

Hello,

Your gstreamer pipeline does indicate that you are using AAC audio. You have a couple of options available.

You can modify your gstreamer pipeline to send PCM audio. However, this will probably break HLS playback as HLS expects AAC audio. (HLS is used to playback in the Console. )
You can add a transcode step in your lambda function to convert from AAC to PCM.

ajhool commented 4 years ago

@bkneff thanks. Does the parser library provide any convenience tools, like an ffmpeg wrapper, for pulling out that PCM stream?

bkneff commented 4 years ago

@ajhool We have an example that will save the stream into an MKV file using gstreamer. You could then pipe it through ffmpeg to extract the audio, or even modify the gstreamer pipeline to output the audio in the desired format. The example to save the file as an MKV can be found here: https://github.com/aws/amazon-kinesis-video-streams-parser-library/blob/master/src/test/java/com/amazonaws/kinesisvideo/parser/utilities/consumer/MergedOutputPiperTest.java

I realize this isn't exactly what you are looking for, but it may help you achieve your goals.

ajhool commented 4 years ago

@bkneff Thanks for the idea. Without really having a solid understanding of the inputs and outputs, that example is a little hard to understand. It appears to be merging the many MKV "chunks" that KVS is producing / parsing into a single MKV?

After a stream has completed, is there a simple way to retrieve the MKV chunks that KVS produced during the stream, ie. are they saved to S3 or anything? KVS has proven to be difficult to work with because there is never a "ground truth" that we can build pipelines around, so we have to do a live stream every time and then debug in real time. Is the easiest way to retrieve these chunks to use the StreamParser library to visit each MKV chunk, save it to a tmp local file, then Put that to S3?

MushMal commented 4 years ago

@ajhool if you have persistence enabled on your stream then you could retrieve the clip with https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/API_reader_GetClip.html

ajhool commented 4 years ago

It looks like GetClip is retrieving an MP4. We are seeking to understand the MKV chunks because we will be performing this processing in real time. MP4 seems like a post-processing step while MKV chunks are how KVS is sending the files down.

Would GetMedia be a better API call? If we call GetMedia with the "Earliest" start selector[1], would that pull down the first MKV chunk that was available in our stream? Are those MKV chunks being recorded anywher? Ideally we could just go to an S3 file and click download.

[1] https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/API_dataplane_StartSelector.html

MushMal commented 4 years ago

Indeed, the GetClip API is an "Offline" API - it operates on data that's been already persisted. HLS or MPEG-DASH are "Live" APIs - they operate on a granularity of the fragment, whereas GetMedia can operated in "Realtime" - on a byte granularity. When selecting Earliest as a selector, the GetMedia API will start retrieving the fragment bits from the earliest fragment and operate "faster-than-realtime" to catch up with the "head" - that is, if your processing and networking allow it to move faster to catch up.

MKV is the underlying packaging format we use for the stream. Internal storage provides the durability guarantees (listed in the docs) but it's not exposed publicly as its implementation specific. In order for you to persist the fragments in S3, you will need to fetch them and persist them directly - again, the main question is whether you actually need those to be stored in S3 (but it's application specific).

Some of the customers which already have workloads operating from S3 do require persistence to S3 but later they migrate to using KVS directly.

MushMal commented 4 years ago

Resolving this issue. Please feel free to reopen or cut a new one

ajhool commented 4 years ago

I certainly don't feel equipped to process audio in real-time KVS streams using this parser library, these responses, or the examples...

The parser library itself is quite complicated, things like getting metadata are really complicated and undocumented. It's proving to be really, really difficult to make a development/debug environment that would allow our code to simulate a live KVS stream so that we can quickly iterate our code.

Without documentation, iterating code and doing trial-and-error is all that we can do. The current debug process of starting a livestream -> starting the program -> testing the code -> inspecting debug logs -> changing code -> starting another stream -> etc. is really clunky. I understand that we could use an MP4 as the input to the system instead of a live stream, but I'm not convinced by documentation that the KVS gstreamer plugin will treat the mp4 input the same way as our live input.

You can add a transcode step in your lambda function to convert from AAC to PCM.

Where? How? This product is marketed as a real-time analytics + ML platform for video streams, so are there standard transcoding or analytics tools like FFMPEG or OpenCV included in the platform with easy integration into these streams?

bkneff commented 4 years ago

@ajhool

Please email kinesis-video-support@amazon.com. I will respond directly to you, and we can setup some time to discuss your use-case and architecture.

bml1g12 commented 2 years ago

Hello,

Your gstreamer pipeline does indicate that you are using AAC audio. You have a couple of options available.

You can modify your gstreamer pipeline to send PCM audio. However, this will probably break HLS playback as HLS expects AAC audio. (HLS is used to playback in the Console. )

You can add a transcode step in your lambda function to convert from AAC to PCM.

I'm currently trying to construct a GStreamer pipeline that sends audio only to KVS.

One thing caught eye here regarding your suggestion:

You can modify your gstreamer pipeline to send PCM audio

The kvsink gstreamer element has the following caps:

 SINK template: 'audio_%u'
    Availability: On request
    Capabilities:
      audio/mpeg
            mpegversion: { (int)2, (int)4 }
          stream-format: raw
               channels: [ 1, 2147483647 ]
                   rate: [ 1, 2147483647 ]
      audio/x-alaw
               channels: { (int)1, (int)2 }
                   rate: [ 8000, 192000 ]
      audio/x-mulaw
               channels: { (int)1, (int)2 }
                   rate: [ 8000, 192000 ]

  SINK template: 'video_%u'
    Availability: On request
    Capabilities:
      video/x-h264
          stream-format: avc
              alignment: au
                  width: [ 16, 2147483647 ]
                 height: [ 16, 2147483647 ]
      video/x-h265
              alignment: au
                  width: [ 16, 2147483647 ]
                 height: [ 16, 2147483647 ]

I thought the lack of an S16LE capability for the sink pad would mean that kvsink cannot accept PCM audio, no matter what pipeline we create?

I'm currently trying to make an artificial H264 video from the audio (based on the suggestion here issue) and stream that instead.

dcoder4 commented 1 year ago

Has anyone found a solution to this.. Its a bit mental.. As I understand it, Kinesis Video Streams (KVS) through GStreamer only supports uploading in AAC or wrapped in a video (H264), and Amazon Transcribe Streaming service only supports PCM, Ogg or Flac. Surely these product teams need to talk to each other! How are we supposed to use these tools?

aws / amazon-kinesis-video-streams-parser-library

How to send audio to AWS Transcribe #97