awslabs / amazon-transcribe-streaming-sdk

The Amazon Transcribe Streaming SDK is an async Python SDK for converting audio into text via Amazon Transcribe.
Apache License 2.0
140 stars 38 forks source link

CPU heavy consumption #84

Open david-oliveira-br opened 1 year ago

david-oliveira-br commented 1 year ago

Hello guys , during some simple local tests I noticed my cpu processing topping 100% while running examples (mic or audio file). I spent some hours reviewing the api in a attempt to find a possible bottleneck but didnt find anything relevant yet. Do you guys have some recommendation or thoughts about it? thanks in advance

laar789 commented 1 year ago

Hi. I just notice this, and your are totally right. It's like the moment we run client.start_stream_transcription, one core of the CPU gets completely occupied, and that will be a problem. @david-oliveira-br, have you found any solution to this?

mikedavidson-evertz commented 1 year ago

Hello! I noticed this same issue as well, where my CPU would spike to 100%. Interestingly enough, this only happens when building my code using python:3.9.16-slim-bullseye, ubuntu:20.04 or ubuntu:22.04 Docker images.

When I use python:3.9-alpine for my docker image, the CPU inside the container sits at around 1 - 5%!

Stepping through the transcribe sdk code and monitoring my CPU usage, I found the exact spot where my CPU spikes. The method that triggers the spike is here:

https://github.com/awslabs/amazon-transcribe-streaming-sdk/blob/develop/amazon_transcribe/client.py#L174

response = await self._session_manager.make_request(
            signed_request.uri,
            method=signed_request.method,
            headers=signed_request.headers.as_list(),
            body=signed_request.body,
        )

and more specifically, when stepping through that request call above, the spike happens when the stream is activated here: https://github.com/awslabs/amazon-transcribe-streaming-sdk/blob/develop/amazon_transcribe/httpsession.py#L56

def _set_stream(self, stream: http.HttpClientStream):
        if self._stream is not None:
            raise HTTPException("Stream already set on AwsCrtHttpResponse object")
        self._stream = stream
        self._stream.completion_future.add_done_callback(self._on_complete)
        self._stream.activate() # <- this call triggers the spike

self._stream.active() calls into the awscrt lib here and this is where the spike happens: https://github.com/awslabs/aws-crt-python/blob/main/awscrt/http.py#L286

  def activate(self):
        """Begin sending the request.

        The HTTP stream does nothing until this is called. Call activate() when you
        are ready for its callbacks and events to fire.
        """
        _awscrt.http_client_stream_activate(self) # <-- 100% CPU SPIKE HAPPENS HERE 

C code for the awscrt http_client_stream_activate python bindings above:

https://github.com/awslabs/aws-crt-python/blob/58de212a9288e64cdb5f698f782abf4281ba8bf6/source/http_stream.c#L301

Also, this spike occurs before I even begin transcribing any audio! It happens the moment this stream is activated.

Do you guys have any idea what's causing this? Thanks.

mikedavidson-evertz commented 1 year ago

Steps to recreate this issue

If you're on ubuntu 20.04 or 22.04, you can run this code directly using python3.9 and monitor you cpu usage with top

Note: Change the region in the transcribe client to the region you want to test with.

from amazon_transcribe.client import TranscribeStreamingClient

import asyncio

async def start_stream():
    transcribe_client = TranscribeStreamingClient(region="us-east-1")
    transcribe_stream = await transcribe_client.start_stream_transcription(
        language_code="en-US",
        media_sample_rate_hz=16000,
        media_encoding="pcm",
        language_model_name=None,
        vocabulary_name=None,
        vocab_filter_method=None,
        vocab_filter_name=None,
        show_speaker_label=None,
        enable_channel_identification=None,
        number_of_channels=None,
        enable_partial_results_stabilization=None,
        partial_results_stability=None,
        session_id=None,
    )

    # put a breakpoint here and look at your CPU usage.
    print("put breakpoint here")

    # loop so we can monitor cpu usage
    while True:
        pass

def main():
    asyncio.run(start_stream())

if __name__ == "__main__":
    main()

top CPU output: image

Build the python code using docker

If you're not using ubuntu, you can build the code using Docker. Put that python code above in a main.py and the Dockerfile below in the same directory:

Note: Fill in your AWS creds in the dockerfile so it can authenticate with transcribe.

FROM ubuntu:20.04

ENV AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
ENV AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>
ENV AWS_SESSION_TOKEN=<AWS_SESSION_TOKEN>

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y software-properties-common && \
    add-apt-repository -y ppa:deadsnakes/ppa

RUN apt-get install --no-install-recommends -y \
    python3.9=3.9.16-1+focal1 \
    python3-pip \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

RUN python3.9 -m pip install amazon-transcribe==0.6.1

WORKDIR /transcribe_high_cpu_test

COPY main.py ./

CMD ["python3.9", "main.py"]

Now build the image and run the image while monitoring your CPU usage. Run these commands in the same directory as the python/Dockerfile.

docker build -t transcribe-cpu-usage-test .
docker run transcribe-cpu-usage-test:latest