Using a fresh install of DRfC (or using DOTS) on g series instances in AWS results in intermittent hanging of the servers and they become totally unresponsive, the only was to access them again is to reboot the server, even systems manager cannot connect to the instance. This can happen at any time and has happened to me multiple times, usually around 8 - 24 hours. Problem seems to have only occurred in May, with the new 5.2.2 release, prior to 5.2.2 instances wouldn't hang at all and could train for weeks in the March and April races
there is little in syslog to indicate the problem, but it shows an unresponsive network: -
Using a fresh install of DRfC (or using DOTS) on g series instances in AWS results in intermittent hanging of the servers and they become totally unresponsive, the only was to access them again is to reboot the server, even systems manager cannot connect to the instance. This can happen at any time and has happened to me multiple times, usually around 8 - 24 hours. Problem seems to have only occurred in May, with the new 5.2.2 release, prior to 5.2.2 instances wouldn't hang at all and could train for weeks in the March and April races
there is little in syslog to indicate the problem, but it shows an unresponsive network: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/ff13d0c3-fb8b-49e8-9f7f-163da59a4384)
same with journalctl: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/9b1e5ad7-dab4-4f65-9d7f-aee1825eadea)
Final error in robomaker: - One of the workers: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/8ad24316-8b11-4046-b03e-c4536570a965)
Another of the workers: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/bd9e6aa0-c631-4616-99ba-c76e97f9ab3a)
another of the workers: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/557c42f2-7c27-4d77-b858-8ad6ce861b43)
another of the workers: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/b87d7e2e-bed1-4d2a-8cc8-285184b4e1e7)
another of the workers: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/05d241c7-ae45-4eb0-baa2-298535076cc0)
another of the workers: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/23938231-ad6f-42a9-bd1d-a16ca14beb03)
No errors in sagemaker or rlcoach
System.env settings: -![image](https://github.com/aws-deepracer-community/deepracer-for-cloud/assets/53598199/20b8c018-7311-4c44-801e-fd8404c85ffd)