dazza-codes / aio-aws

Asyncio utils for AWS Services
Apache License 2.0
3 stars 1 forks source link

Revise batch jobs with a client config that does not timeout #40

Open dazza-codes opened 3 years ago

dazza-codes commented 3 years ago

This is related to https://github.com/aio-libs/aiobotocore/issues/864

A batch job monitor can crash hard due to an expired signature in the client, e.g.


  File "/opt/conda/envs/gis/lib/python3.7/site-packages/aio_aws/aio_aws_batch.py", line 1102, in aio_batch_job_manager
    await aio_batch_job_waiter(job, config=config)
  File "/opt/conda/envs/gis/lib/python3.7/site-packages/aio_aws/aio_aws_batch.py", line 1001, in aio_batch_job_waiter
    response = await aio_batch_job_status([job.job_id], config)
  File "/opt/conda/envs/gis/lib/python3.7/site-packages/aio_aws/aio_aws_batch.py", line 824, in aio_batch_job_status
    return await batch_client.describe_jobs(jobs=jobs)
  File "/opt/conda/envs/gis/lib/python3.7/site-packages/aiobotocore/client.py", line 155, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DescribeJobs operation: Signature expired: 20210908T202653Z is now earlier than 20210908T202940Z (20210908T203440Z - 5 min.)

It might be possible to work around this by adding options to increase a connection or read timeout, e.g.

    client_config = AioConfig(
        connect_timeout=20,
        read_timeout=900,
        max_pool_connections=max_pool_connections,
    )
    async with aio_batch_config.create_client("batch", config=client_config) as batch_client:
        # run batch monitoring for any long-running batch jobs
        pass

Another similar pattern uses a default config:


        client_config = aio_config.session.get_default_client_config()
        s3_config = AioConfig(signature_version=UNSIGNED)
        s3_config = client_config.merge(s3_config)

        async with aio_config.create_client("s3", config=s3_config) as s3_client:
            # do s3 stuff
            pass

The monitoring code might need to detect and catch exceptions for invalid signatures. It could replace the client with a new one, or find some way to update the signature for a client.

dazza-codes commented 3 years ago

This might be solved by limiting configs to using a single client in any connection pool, so that clients cannot become stale in the pool. e.g.


    aio_batch_config = AWSBatchConfig(
        aio_batch_db=jobs_db,
        min_pause=20,
        max_pause=40,
        start_pause=60,
        max_pool_connections=1,
        sem=500,
    )
    asyncio.run(aio_batch_monitor_jobs(jobs=jobs, config=aio_batch_config))