Open david-oliveira-br opened 1 year ago
Hi. I just notice this, and your are totally right. It's like the moment we run client.start_stream_transcription, one core of the CPU gets completely occupied, and that will be a problem. @david-oliveira-br, have you found any solution to this?
Hello! I noticed this same issue as well, where my CPU would spike to 100%. Interestingly enough, this only happens when building my code using python:3.9.16-slim-bullseye
, ubuntu:20.04
or ubuntu:22.04
Docker images.
When I use python:3.9-alpine
for my docker image, the CPU inside the container sits at around 1 - 5%!
Stepping through the transcribe sdk code and monitoring my CPU usage, I found the exact spot where my CPU spikes. The method that triggers the spike is here:
response = await self._session_manager.make_request(
signed_request.uri,
method=signed_request.method,
headers=signed_request.headers.as_list(),
body=signed_request.body,
)
and more specifically, when stepping through that request call above, the spike happens when the stream is activated here: https://github.com/awslabs/amazon-transcribe-streaming-sdk/blob/develop/amazon_transcribe/httpsession.py#L56
def _set_stream(self, stream: http.HttpClientStream):
if self._stream is not None:
raise HTTPException("Stream already set on AwsCrtHttpResponse object")
self._stream = stream
self._stream.completion_future.add_done_callback(self._on_complete)
self._stream.activate() # <- this call triggers the spike
self._stream.active()
calls into the awscrt
lib here and this is where the spike happens:
https://github.com/awslabs/aws-crt-python/blob/main/awscrt/http.py#L286
def activate(self):
"""Begin sending the request.
The HTTP stream does nothing until this is called. Call activate() when you
are ready for its callbacks and events to fire.
"""
_awscrt.http_client_stream_activate(self) # <-- 100% CPU SPIKE HAPPENS HERE
C code for the awscrt http_client_stream_activate
python bindings above:
Also, this spike occurs before I even begin transcribing any audio! It happens the moment this stream is activated.
Do you guys have any idea what's causing this? Thanks.
If you're on ubuntu 20.04
or 22.04
, you can run this code directly using python3.9
and monitor you cpu usage with top
Note: Change the region in the transcribe client to the region you want to test with.
from amazon_transcribe.client import TranscribeStreamingClient
import asyncio
async def start_stream():
transcribe_client = TranscribeStreamingClient(region="us-east-1")
transcribe_stream = await transcribe_client.start_stream_transcription(
language_code="en-US",
media_sample_rate_hz=16000,
media_encoding="pcm",
language_model_name=None,
vocabulary_name=None,
vocab_filter_method=None,
vocab_filter_name=None,
show_speaker_label=None,
enable_channel_identification=None,
number_of_channels=None,
enable_partial_results_stabilization=None,
partial_results_stability=None,
session_id=None,
)
# put a breakpoint here and look at your CPU usage.
print("put breakpoint here")
# loop so we can monitor cpu usage
while True:
pass
def main():
asyncio.run(start_stream())
if __name__ == "__main__":
main()
top CPU output:
If you're not using ubuntu, you can build the code using Docker. Put that python code above in a main.py
and the Dockerfile
below in the same directory:
Note: Fill in your AWS creds in the dockerfile so it can authenticate with transcribe.
FROM ubuntu:20.04
ENV AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
ENV AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>
ENV AWS_SESSION_TOKEN=<AWS_SESSION_TOKEN>
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y software-properties-common && \
add-apt-repository -y ppa:deadsnakes/ppa
RUN apt-get install --no-install-recommends -y \
python3.9=3.9.16-1+focal1 \
python3-pip \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
RUN python3.9 -m pip install amazon-transcribe==0.6.1
WORKDIR /transcribe_high_cpu_test
COPY main.py ./
CMD ["python3.9", "main.py"]
Now build the image and run the image while monitoring your CPU usage. Run these commands in the same directory as the python/Dockerfile.
docker build -t transcribe-cpu-usage-test .
docker run transcribe-cpu-usage-test:latest
Hello guys , during some simple local tests I noticed my cpu processing topping 100% while running examples (mic or audio file). I spent some hours reviewing the api in a attempt to find a possible bottleneck but didnt find anything relevant yet. Do you guys have some recommendation or thoughts about it? thanks in advance