aws-deepracer-community / deepracer-for-cloud

Creates an AWS DeepRacing training environment which can be deployed in the cloud, or locally on Ubuntu Linux, Windows or Mac.
MIT No Attribution
337 stars 184 forks source link

Intermittent failure to connect to S3 with credentials issue #174

Closed MarkRoss-Eviden closed 5 months ago

MarkRoss-Eviden commented 6 months ago

This issue occurs randomly and has done for a long time (i.e. not been introduced to my knowledge by recent changes to 5.1 or 5.2), and not very often, making it difficult to troubleshoot. Training works fine for hours (perhaps days), and then robomaker exits.

For example this one happened 960 episodes in, you can see it's working and then suddenly it's not. Instance uses an IAM Instance Profile with full access to S3 (if permissions were an issue it'd fail immediately) : - image

Seems this is an issue not limited to DRfC, but is seen by other users doing other things: - https://github.com/boto/botocore/issues/2117 https://github.com/rom1504/img2dataset/issues/137

There's a suggestion increasing var 'AWS_METADATA_SERVICE_NUM_ATTEMPTS' could work, as we might be getting throttled: - image

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html

larsll commented 6 months ago

So two possible workarounds:

MarkRoss-Eviden commented 6 months ago

I'll work on 'Add the environment variable into docker/docker-compose-training.yml' and see what happens, as long as it doesn't introduce new issues it should be safe to merge as adding static creds to instances isn't aws best practice and is actively discourage for security.

MarkRoss-Eviden commented 6 months ago

are there any commands in the containers that would be getting the creds specifically, or is it just background stuff the instance is doing?

MarkRoss-Eviden commented 6 months ago

fixed by https://github.com/aws-deepracer-community/deepracer-for-cloud/pull/178