Failure undetected, resulting in 6 hour timeout

melvyn-apryl commented 2 years ago

This is v16, but looking at the code, this can still happen with v20. The timeout is only checked if status is ready and version labels match, the other branches have no timeout. In this case, status is ready, but version labels seem to not match or more likely the deployment failed message was not caught. Log from action:

Deployment started, "wait_for_deployment" was true...
16:16:40 INFO: Environment update is starting.
16:17:20 INFO: Deploying new version to instance(s).
16:17:34 INFO: Still updating, status is "Updating", health is "Green", health status is "Info"
16:17:38 INFO: Environment health has transitioned from Ok to Info. Application update in progress (running for 13 seconds).
16:17:43 INFO: Instance deployment successfully generated a 'Procfile'.
16:18:01 ERROR: Instance deployment failed. For details, see 'eb-engine.log'.
16:18:05 ERROR: [Instance: i-02e54e6924cc0c92a] Command failed on instance. Return code: 1 Output: Engine execution has encountered an error..
16:18:05 INFO: Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
16:18:05 ERROR: Unsuccessful command execution on instance id(s) 'i-02e54e6924cc0c92a'. Aborting the operation.
16:18:37 INFO: Still updating, status is "Updating", health is "Red", health status is "Degraded"
16:18:38 WARN: Environment health has transitioned from Info to Degraded. Command failed on all instances. Incorrect application version found on all instances. Expected version "staging-533c634" (deployment 277). Application update is aborting. 1 out of 1 instance completed (running for 2 minutes). Impaired services on all instances.
16:19:40 INFO: Still updating, status is "Ready", health is "Red", health status is "Degraded"
16:20:44 INFO: Still updating, status is "Ready", health is "Red", health status is "Degraded"
16:21:47 INFO: Still updating, status is "Ready", health is "Red", health status is "Degraded"
... # etc

And eb environment:

2022-06-28 18:18:05 UTC+0200    ERROR Failed to deploy application.
2022-06-28 18:18:05 UTC+0200    ERROR Unsuccessful command execution on instance id(s) 'i-02e54e6924cc0c92a'. Aborting the operation.
2022-06-28 18:18:05 UTC+0200    INFO Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
2022-06-28 18:18:05 UTC+0200    ERROR [Instance: i-02e54e6924cc0c92a] Command failed on instance. Return code: 1 Output: Engine execution has encountered an error..

So the last of the 4 in the same second was missed here and then the loop is endless. The setting wait_for_environment_recovery: 120 is in the job:

 Version description: 
          AWS Region: eu-central-1
                File: Deploy.zip
      AWS Access Key: 20 characters long, starts with H
      AWS Secret Key: 40 characters long, starts with D
 Wait for deployment: true
  Recovery wait time: 120

dantehemerson commented 2 years ago

I suggest you check the /var/log/eb-engine.log in the EC2 instance associated to check the error in more detail. Probably you included a change that is breaking your application.

melvyn-apryl commented 2 years ago

I know the cause. But the problem is that this message was sent by AWS: 2022-06-28 18:18:05 UTC+0200 ERROR Failed to deploy application.

΅But was not processed at this line: if (ev.Message.match(/Failed to deploy application/)) {

And so the action never terminates, which means the 120 seconds to prevent this kind of thing from happening cannot be trusted.

tomgrowflow commented 1 year ago

this happened to me, i set a git action timeout to prevent it, not great though

https://stackoverflow.com/a/59076067/1869299

my-job: runs-on: ubuntu-latest timeout-minutes: 30

einaregilsson / beanstalk-deploy

Failure undetected, resulting in 6 hour timeout #86