aws-deepracer-community / deepracer-for-cloud

Creates an AWS DeepRacing training environment which can be deployed in the cloud, or locally on Ubuntu Linux, Windows or Mac.
MIT No Attribution
325 stars 175 forks source link

DRfC intermittently hangs and instance is totally unresponsive on AWS G6 instances #168

Open MarkRoss-Eviden opened 2 months ago

MarkRoss-Eviden commented 2 months ago

Using a fresh install of DRfC (or using DOTS) on g series instances in AWS results in intermittent hanging of the servers and they become totally unresponsive, the only was to access them again is to reboot the server, even systems manager cannot connect to the instance. This can happen at any time and has happened to me multiple times, usually around 8 - 24 hours. Problem seems to have only occurred in May, with the new 5.2.2 release, prior to 5.2.2 instances wouldn't hang at all and could train for weeks in the March and April races

there is little in syslog to indicate the problem, but it shows an unresponsive network: - image

same with journalctl: - image

Final error in robomaker: - One of the workers: - image

Another of the workers: - image

another of the workers: - image

another of the workers: - image

another of the workers: - image

another of the workers: - image

No errors in sagemaker or rlcoach

System.env settings: - image

MarkRoss-Eviden commented 1 month ago

Issue has occurred with this set of config too image