awslabs / amazon-transcribe-streaming-sdk

The Amazon Transcribe Streaming SDK is an async Python SDK for converting audio into text via Amazon Transcribe.
Apache License 2.0
142 stars 37 forks source link

amazon_transcribe.exceptions.BadRequestException: Signature expired Exception #17

Open bangnguyen opened 3 years ago

bangnguyen commented 3 years ago

Hi,

I have used this sample to test with a wav file in local, I have just changed param to point to my audio file in local. This script worked perfectly and I was able to get the transcript text for first 5 minutes.
But after 5 minutes, I got the below exception :

File "test_amazon_transcribe.py", line 82, in basic_transcribe await asyncio.gather(write_chunks(), handler.handle_events()) File "/root/apps/build/python36/lib/python3.6/site-packages/amazon_transcribe/handlers.py", line 26, in handle_events async for event in self._transcript_result_stream: File "/root/apps/build/python36/lib/python3.6/site-packages/amazon_transcribe/eventstream.py", line 666, in __aiter__ parsed_event = self._parser.parse(event) File "/root/apps/build/python36/lib/python3.6/site-packages/amazon_transcribe/deserialize.py", line 147, in parse raise self._parse_event_exception(raw_event) amazon_transcribe.exceptions.BadRequestException: Signature expired: 20201123T084949Z is now earlier than 20201123T084949Z (20201123T085449Z - 5 min.)

I have re run the script for few times, but the issue still exists. Could you please help me?

bangnguyen commented 3 years ago

Just want to give the update on my issue, by reading the code TranscribeStreamingClient:. I understood why issue happen, the audio file in my local has the file-size about 100 Mb, in streaming mode i will need to control the speed of data sending by choose the appropriate value of chunk_size and set sleep time between data sending. However, after 30 minutes of getting transcript from audio streaming, It raised another exception as below , could you please give me any help to avoid or how to resolve this exception ?

Traceback (most recent call last): File "test_amazon_transcribe.py", line 111, in main() File "test_amazon_transcribe.py", line 99, in main loop.run_until_complete(basic_transcribe()) File "/root/apps/deploy/build/python36/lib/python3.6/asyncio/base_events.py", line 484, in run_u ntil_complete return future.result() File "test_amazon_transcribe.py", line 96, in basic_transcribe await asyncio.gather(write_chunks(), handler.handle_events()) File "/root/apps/deploy/build/python36/lib/python3.6/site-packages/amazon_transcribe/handlers.py ", line 26, in handle_events async for event in self._transcript_result_stream: File "/root/apps/deploy/build/python36/lib/python3.6/site-packages/amazon_transcribe/eventstream .py", line 666, in aiter parsed_event = self._parser.parse(event) File "/root/apps/deploy/build/python36/lib/python3.6/site-packages/amazon_transcribe/deserialize .py", line 147, in parse raise self._parse_event_exception(raw_event) amazon_transcribe.exceptions.BadRequestException: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the servic e documentation for details.

The Canonical String for this request should have been '' The String-to-Sign should have been 'AWS4-HMAC-SHA256-PAYLOAD 20201123T123617Z 20201123/us-west-2/transcribe/aws4_request 1f820fe62677f9f7c045b44fe5d90942c31e5d1ea9c7a2ca2cefdd9efe4f2906d 5bf08571efbd195df2b9b22e5797b089b2d57206f1eece7db5ab6541bdb59d91a6 4894047d84b322a05fd95db4ceecf8b8ef6432659924ffceadc2ec669afbbd3ce2'

lalogonzalez commented 3 years ago

Hello @bangnguyen , I have the same issue you had at first. Can you provide details on how you fixed it?

I'm not sure on how to determine the chunk size for a 8000Hz pre recorded audio file. I've tried several values without success.

_"I understood why issue happen, the audio file in my local has the file-size about 100 Mb, in streaming mode i will need to control the speed of data sending by choose the appropriate value of chunksize and set sleep time between data sending."

bangnguyen commented 3 years ago

Hi @lalogonzalez

For the first issue, I have used the snap code of this example. All parameters are the same (chunk_size, sample_rate, ..), only one different is my audio in local has file size > 100Mb and the audio duration is longer than 1 hour. Original example did not reproduce the issue because the audio duration is too short. To fix the first issue, I have just put one more line code sleep 0.5 seconds just right after you send a audio chunk

await stream.input_stream.send_audio_event(audio_chunk=chunk) time.sleep(0.5)

I am not sure if this is a good fix or just a workaround, but definitively I did not get any more exception kind of Signature expired

But right now, I sill have second issue The request signature we calculated does not match the signature raised in permanent after roughly 30 minutes or more. This issue, I am thinking to retry with a new TranscribeStreamingClient and reused the session_id,even if this solution works, it sounds to me the workaround as I have to use retry for every times.

Is there anyone test successfully for the audio file with the duration longer than 1 hour?

bangnguyen commented 3 years ago

Hi Eduardo,

I don't have issue like this, can tested with two different audio files and able to get transcript for more than 20 minutes and did not lose anything, Just it raised Exception issue related the signature, Did you test with the audio file that have the duration longer than 1 hour. and let you know if any issue? I suggest you test another audio file, maybe your current using has an issue? Bang

On Thu, 26 Nov 2020 at 22:46, Eduardo notifications@github.com wrote:

Hi @bangnguyen https://github.com/bangnguyen

I tried out your solution and there's a problem. I am debugging the transcript on real time and after 5 minutes (in the conversation, not in processing time) the transcription is skipping a big chunk of the audio.

I suggest you debug as well like this, because you are most likely losing some part of your transcription too.

Let me know.

class MyEventHandler(TranscriptResultStreamHandler): async def handle_transcript_event(self, transcript_event: TranscriptEvent): result = transcript_event.transcript for r in result.results: if (r.is_partial == False): for a in r.alternatives: print("[" + str(r.start_time) + "-" + str(r.end_time) + "] " + a.transcript)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/amazon-transcribe-streaming-sdk/issues/17#issuecomment-734369601, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFQ7PFZHYQZIP23FER5TU3SRZZ4VANCNFSM4T7IG2FQ .

-- Viet Bang Nguyen vbang.nguyen@gmail.com tel: +tel: +84 (0)9 77 04 26 82

joguSD commented 3 years ago

@bangnguyen There's a note on this in the example, prerecorded streams need to match the realtime bitrate to avoid producing events with signatures that will be expired by the time the service actually reads them. Depending on how big the delta is between the actual transmission rate vs the realtime bitrate how long the stream will work varies.

This does seem like a snag and I'm curious how we can help improve this. I have mixed feelings on including functionality within the library to handle this as it seems like this kind of pseudo playback implementation could be at little out of scope. That being said, I don't have a tremendous amount of knowledge on the binary format of a PCM WAV file so I'm not sure how much effort it would take to accurately produce the realtime bit stream. If it's simple enough I think it might make sense to at least include some sample code for how to properly produce the realtime bit stream.

joguSD commented 3 years ago

@bangnguyen

Here's a quick example of what I explained in my previous message, using a modified version of the file based example.

Big disclaimer: I have not thoroughly tested the following code outside of a couple test files -- I have no clue how robust the following sample code is but should be close enough to get something functional streaming for long running streams. In particular the parsing code makes some assumptions but should work for basic PCM WAV files.

import asyncio
import aiofile

from amazon_transcribe.client import TranscribeStreamingClient
from amazon_transcribe.handlers import TranscriptResultStreamHandler
from amazon_transcribe.model import TranscriptEvent

async def parse_int(file, byte_length=4):
    chunk = await file.read(byte_length)
    return int.from_bytes(chunk, 'little')

async def parse_wav_metadata(file):
    riff = await file.read(4)
    assert riff == b'RIFF'

    overall_size = await parse_int(file)

    wave = await file.read(4)
    assert wave == b'WAVE'

    fmt = await file.read(4)
    assert fmt == b'fmt '

    fmt_data_len = await parse_int(file)
    fmt_type = await parse_int(file, byte_length=2)
    num_channels = await parse_int(file, byte_length=2)
    sample_rate = await parse_int(file)
    byte_rate = await parse_int(file)
    block_align = await parse_int(file, byte_length=2)
    bits_per_sample = await parse_int(file, byte_length=2)

    # Byte rate should equal (Sample Rate * BitsPerSample * Channels) / 8
    assert (sample_rate * bits_per_sample * num_channels) / 8 == byte_rate

    data_header = await file.read(4)
    assert data_header == b'data'

    data_len = await parse_int(file)

    wav_metadata = {
        'OverallSize': overall_size,
        'FormatLength': fmt_data_len,
        'FormatType': fmt_type,
        'Channels': num_channels,
        'SampleRate': sample_rate,
        'ByteRate': byte_rate,
        'BlockAlign': block_align,
        'BitsPerSample': bits_per_sample,
        'DataLength': data_len,
    }

    return wav_metadata

async def rate_limit(file, byte_rate):
    chunk = await file.read(byte_rate)
    loop = asyncio.get_event_loop()
    last_yield_time = -1.0 # -1 to allow the first yield immediately
    while chunk:
        time_since_last_yield = loop.time() - last_yield_time
        if time_since_last_yield < 1.0:
            # Only yield once per second at most, compensating for how long
            # between the last yield it's been
            await asyncio.sleep(1.0 - time_since_last_yield)
        last_yield_time = loop.time()
        yield chunk
        chunk = await file.read(byte_rate)

class MyEventHandler(TranscriptResultStreamHandler):
    async def handle_transcript_event(self, transcript_event: TranscriptEvent):
        results = transcript_event.transcript.results
        for result in results:
            for alt in result.alternatives:
                print(alt.transcript)

async def write_chunks(stream, f, wav_metadata):
    async for chunk in rate_limit(f, wav_metadata['ByteRate']):
        await stream.input_stream.send_audio_event(audio_chunk=chunk)
    await stream.input_stream.end_stream()

async def basic_transcribe(filepath):
    # Setup up our client with our chosen AWS region
    client = TranscribeStreamingClient(region="us-west-2")

    async with aiofile.async_open(filepath, 'rb') as f:
        wav_metadata = await parse_wav_metadata(f)

        # Start transcription to generate our async stream
        stream = await client.start_stream_transcription(
            language_code="en-US",
            media_sample_rate_hz=wav_metadata['SampleRate'],
            media_encoding="pcm",
        )

        # Instantiate our handler and start processing events
        await asyncio.gather(
            write_chunks(stream, f, wav_metadata),
            MyEventHandler(stream.output_stream).handle_events(),
        )

loop = asyncio.get_event_loop()
loop.run_until_complete(basic_transcribe('tests/integration/assets/test.wav'))
loop.close()
bangnguyen commented 3 years ago

Thank you so much @joguSD for your very useful solution. I just take your snap code and do the testing for a big audio file. So far everything is good. Streaming from the local file is just my first step of my project. My main use-case is to able to do streaming from website where provide big number of concurrent streaming links. Link is in format m3u8. I will comeback with further feedback.

joguSD commented 3 years ago

@bangnguyen I believe that's actually a playlist format, so as long as all of the items in the playlist are a standward wav file I'd imagine something like the above would be sufficient. If you need to process many different formats perhaps using something like ffmpeg and asking it to transcode and output in realtime might be a better/more robust solution.

bangnguyen commented 3 years ago

@joguSD

parse_wav_metadata function test with my audio file.

from this result, if I initiate the streaming_client with sample_rate 16000, and send_audio_event witch chunk_size 64000, I got exception likely chunk_size is too big. I change ByteRate=32000 , it's working well. I think to get the robust solution by this way, it will need test with more data. For my use case, the audio data came from only one website, the m3u8 has the playlist format. The playlist contains multiple audio files with 'aac' format and I used pydub lib to convert into wave format at sample rate 16000 and sending data with ByteRate: 32000 I have tested this solution with few streaming link, duration of each more than 1 hour. It works like a charm. So now, my prototype streaming from website is working, next step I will need to integrate our system. Our current system has a concurrent streaming at peak time up to 100 (we used another transcribe service, quite expensive). I am very excited to see how is is going with aws transcribe.

bangnguyen commented 3 years ago

@joguSD I was confused when I said it was working well more than 1 hour. I read your code and saw that there is a control on amount of data sending by 1 second and thought it is the best solution. In fact, I tested with a new streaming audio, I still have issue with signature. My log below showed exception throw at second 1735.82. Do you have any other suggestion, I am really looking forward? does it still issue of amount of data sending in the short period ?

10:25:07,968 root INFO start_time 1688.56, end_time 1691.77 : We will hear first today from Shannon Cross Cross research. 10:25:19,467 root INFO start_time 1692.49, end_time 1702.92 : Thank you very much, um, empty talk a bit more about China. And you know, in terms of Lenny already, I think look atyou nights in that service is, um 10:25:35,322 root INFO start_time 1703.46, end_time 1718.09 : In all regions were about an all time high. I'm not sure exactly what your comment was, but, you know, maybe give us a little idea of you know whether you're seeing any blowback or benefit from the lottery situation and just dig a bit more into the transition in China. 10:25:36,574 root INFO start_time 1718.4, end_time 1719.57 : Don't ever follow it. Thank you. 10:25:44,784 root INFO start_time 1720.48, end_time 1727.47 : Thanks, Shannon. If you look at China and look at last quarter's I'll talk about both last quarter in this quarter of that. 10:25:51,331 root INFO start_time 1728.01, end_time 1734.01 : Last quarter. What we saw was our non iPhone business was up strong double digits. 10:25:53,388 root INFO start_time 1734.64, end_time 1735.82 : For the full quarter. 10:25:57,631 root ERROR ex :The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been ''

The String-to-Sign should have been 'AWS4-HMAC-SHA256-PAYLOAD 20201204T102557Z 20201204/us-west-2/transcribe/aws4_request f444d66ebfbd4a822831fefe453b7791707190635ba161415d46d6be9c993503 b8bc85946f1b68cc1480b651f8cb1a710b737a07ed6b168f27c97e02fc71e35a 1990f1e24f9f1a75642a6f6c790016dde90107d5ca41cd88e3aba9b6f544735e' , traceback Traceback (most recent call last): File "streaming_from_website.py", line 56, in handle_events async for event in self._transcript_result_stream: File "/dat/wqatrax/workspace/amazon_transcribe/eventstream.py", line 668, in aiter parsed_event = self._parser.parse(event) File "/dat/wqatrax/workspace/amazon_transcribe/deserialize.py", line 147, in parse raise self._parse_event_exception(raw_event) amazon_transcribe.exceptions.BadRequestException: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been ''

The String-to-Sign should have been 'AWS4-HMAC-SHA256-PAYLOAD 20201204T102557Z 20201204/us-west-2/transcribe/aws4_request f444d66ebfbd4a822831fefe453b7791707190635ba161415d46d6be9c993503 b8bc85946f1b68cc1480b651f8cb1a710b737a07ed6b168f27c97e02fc71e35a 1990f1e24f9f1a75642a6f6c790016dde90107d5ca41cd88e3aba9b6f544735e'

bangnguyen commented 3 years ago

@joguSD I put the log into __sign_event_ method of eventstream, just to get detail the value of parameters before sending signature for each chunk. And I found that, after getting the transcript for amount of minutes, the temporary access_key & secret_key is changed automatically just before getting the exception signature. My ec2 instance is configured with IAM host-based, so I don't have the specific access_key, secret key to access aws transcribe. It seems the sdk python use the aws temporary and generated each time start streaming client. Could you please guide me how to avoid this behavior changing the access_key (set expire time with bigger value ? ), Or the only solution is to use with static access_key, secret_key ?

11:53:32,523 root INFO 111---CHUNK_SIZE real 32000, ByteRate 32000
11:53:32,524 root INFO --- string_to_sign b'AWS4-HMAC-SHA256-PAYLOAD
 access_key_id **{ACCESS_KEY}**, secret_access_key **{SECRET_KEY}**''
11:53:32,849 root INFO start_time 761.17, end_time 772.74 : There is great pain of a lost loved one
11:53:33,625 root INFO 111---CHUNK_SIZE real 32000, ByteRate 32000
11:53:33,628 root INFO --- string_to_sign b'AWS4-HMAC-SHA256-PAYLOAD
access_key_id **{ANOTHER_ACCESS_KEY}**, secret_access_key **{ANOTHER_SECRET_KEY}**
11:53:33,642 root ERROR ex :The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
bangnguyen commented 3 years ago

Hi @joguSD It seems that current code base using temporary security credentials. I am wondering that during the streaming, if the credentials is changed and is used to create signature at client side, does AWS service also catch this change of credential? Just to test the assumption that if client side don't change the credential during the streaming, I will not have the signature issue. I wrote the code below and tested with several m3u8 links. It work well for more than 1 hour for each link. So my solution for this issue is before running for m3u8 link (audio duration ~ 1hour), is to run the code below to make sure only one credential used during streaming. But I still have concern if the credential is expired during streaming. I would love to hear your feedback?

class AwsCustomCredentialResolver(CredentialResolver):
    def __init__(self, eventloop):
        self._crt_resolver = AwsCredentialsProvider.new_default_chain(eventloop)
        self.credentials = None

    async def get_credentials(self) -> Optional[Credentials]:
        if self.credentials is None:
            self.credentials = await asyncio.wrap_future(self._crt_resolver.get_credentials())
        return self.credentials

async def basic_transcribe(url, wav_metadata, output_file):
    client = TranscribeStreamingClient(region="us-west-2", credential_resolver = AwsCustomCredentialResolver(AWSCRTEventLoop().bootstrap))
    stream = await client.start_stream_transcription(
        language_code="en-US",
        media_sample_rate_hz=wav_metadata['SampleRate'],
        media_encoding="pcm"
    )
    handler = MyEventHandler(stream.output_stream)
    request_id = stream._response.request_id
    session_id  = stream._response.session_id
    logging.info("request_id {}, session_id {} ".format(request_id, session_id))
    await asyncio.gather(chunk_generator_from_playlist(stream, url, wav_metadata), handler.handle_events(output_file), return_exceptions=False)
joguSD commented 3 years ago

@bangnguyen Yeah this seems like a slightly different issue now and could definitely be related to temporary credentials expiring during a stream. I'll see if we can get some time to dig into this a little more.

bangnguyen commented 3 years ago

@joguSD

My workaround now is to reuse the existing credential on each sign for stream_chunk, to do this I use the my MyCustomCredentialResolver in stead of AwsCrtCredentialResolver. Its run smoothly for several hours without any exception. But Ideally should have API to control on the expired time for credential, I can set it before starting streaming.
I am still looking forward your proper solution.

class MyCustomCredentialResolver(CredentialResolver):
    def __init__(self, eventloop):
        self._crt_resolver = AwsCredentialsProvider.new_default_chain(eventloop)
        self.credentials = None

    async def get_credentials(self) -> Optional[Credentials]:
        if self.credentials is None:
            self.credentials = await asyncio.wrap_future(self._crt_resolver.get_credentials())
        return self.credentials
joguSD commented 3 years ago

@bangnguyen Ahhh, I see what you mean now. I thought your temporary credentials were expiring during the stream and refreshed credentials weren't being picked up. Seems like you're almost having the opposite problem -- the underlying credentials are rotating during the stream and the access key changes during a single stream. This is a tough one as neither behavior is really correct. If we freeze credentials for the entirety of the stream refreshes within the same access key won't work, but if we don't freeze them we might get a different access key. I'll do some poking to see what the general behavior across SDKs is here.

bangnguyen commented 3 years ago

@joguSD checked from the log, the refreshed credentials were being picked up and were used at client side to calculate the signature and I wondered whether the server side caught the refreshed credentials, if yes why it raised the mismatched signature issue constantly every time i tested. By freezing the credentials for entirety of stream, I tested for 5 entirety of stream in concurrence for 2, 3 hours no exception raised, but maybe don't work for the longer stream. I will be happy to see your official solution.

mikeballou-augmedix commented 3 years ago

Thanks for all the background info @bangnguyen and @joguSD. Using your posts, I was able to understand the same issue you reported originally here. For those coming across it, hopefully this summary helps, and then someone from AWS can answer my question:

  1. Original issue: amazon_transcribe.exceptions.BadRequestException: Signature expired The write_chunks() function in the sample seems to read a local file and creates all the data packets to send as fast as possible, which puts the AWS signature on each packet from the initial time. After about 5 minutes, those signatures expire, so if you have an audio greater than 5 minutes, you will hit this. bangnguyen original solution is to add a sleep of .5 as a rough estimate to make it send the chunks in approximation of realtime of the audio.
    BTW, you could simplify the math of the wav file by using the built-in python wave library:
with wave.open(file_path, "rb") as wave_file:
        frames = wave_file.getnframes()
        sample_rate_hertz = wave_file.getframerate()
        duration = frames / float(sample_rate_hertz)

In my case, the sample rate is 44,100, so to hit the max chunk size of 32KB, you divide 32,000/44100 = .72. That means .72 seconds of audio can fit into 1 - 32KB chunk. So the sleep time should be that calculation to be accurate. (At least it seems to work that way).

  1. bangnguyen then ran into a separate issue above where is access keys were set to expire every 30 minutes (which I believe is a configuration on the AWS account). So he had to make the separate custom cred resolver to load a new key.

#1 leads me to this question: @joguSD Does the API require us to send it real-time? In the example above, we add a sleep to approximate real time, but I'm transcribing a lot of saved audio, so sending it faster than real time is a big help.

Also, there is something the AWS teams seems to be missing in how we are putting this together. We are using medical transcribe (with the help of another issue to make it work). Medical transcribe has different models available based on specialty, but those models are only available via the streaming API. So even though I have a lot of saved audio files, I cannot use the batch mode because those medical models are not available in batch mode. Hence the requirement to use the streaming API only. I understand why you made the comment above about "out of scope", but for medical transcribe, it's really a requirement to stream audio files since batch mode is not available.

dbalosh commented 2 years ago

Hi, thanks for this repo and the existence of the issue. Created a pull request that works for me on files longer than 5 minutes.

Including @joguSD suggested solution, with @mikeballou-augmedix getting wav metadata section and @bangnguyen custom credentials.

Here: https://github.com/awslabs/amazon-transcribe-streaming-sdk/pull/62

joguSD commented 2 years ago

@mikeballou-augmedix This SDK / particular API (transcribe-streaming) is really intended for streaming real time audio. If you need to process prerecorded audio in faster than real time the standard transcribe service is more apt.

Unfortunately, we don't have an async SDK for the standard transcribe service but it is available in the synchronous AWS Python SDK boto3. See the docs for start_transcription_job here.

For what it's worth, I didn't use the wave library because it's using the standard open and thus has the potential to block the event loop if not used correctly. It's likely not a huge deal but could lead to issues depending on exactly how it's used.

Lastly, medical is definitely missing but I'm not sure if or when we'll have time to add it.

dbalosh commented 2 years ago

Ignore https://github.com/awslabs/amazon-transcribe-streaming-sdk/pull/62. I’ve closed it it is not stable, my current guess: the best way is the stream to a virtual mic, then run the example mic example.

@joguSD is the streaming mic example capable of streaming for three hours without code modification? Just by running the

If so, I would create a docker example of FFmpeg play a file to a virtual mic and run https://github.com/awslabs/amazon-transcribe-streaming-sdk/blob/develop/examples/simple_mic.py in parallel.