I started one batch job and then started another identical one by accident (didn't see the progress bar for the first one which was higher up in the results list). Just after launching the second batch, the workers all ground to a halt and EC2 showed 0 CPU usage. I was unable to determine what happened, but here are some log messages from around that time:
13:12:56.876 ERROR c.c.a.s.utils.ClusterQueueManager - Error retrieving job status org.apache.http.conn.HttpHostConnectException: Connect to 10.0.0.130:9001 [/10.0.0.130] failed: Connection refused
13:12:59.673 ERROR c.c.a.s.utils.ClusterQueueManager - Network error enqueing requests, trying again in 15 seconds org.apache.http.conn.HttpHostConnectException: Connect to 10.0.0.130:9001 [/10.0.0.130] failed: Connection refused`
This is not a super detailed/helpful bug report, but mostly a reminder to check that Analyst system properly handles multiple simultaneous enqueued analyses.
I started one batch job and then started another identical one by accident (didn't see the progress bar for the first one which was higher up in the results list). Just after launching the second batch, the workers all ground to a halt and EC2 showed 0 CPU usage. I was unable to determine what happened, but here are some log messages from around that time:
This is not a super detailed/helpful bug report, but mostly a reminder to check that Analyst system properly handles multiple simultaneous enqueued analyses.