F5Networks / f5-aws-cloudformation

CloudFormation Templates for quickly deploying BIG-IP services in Amazon Web Services EC2
112 stars 118 forks source link

v5.7.0: signaling to CF stack not working #120

Closed amolari closed 4 years ago

amolari commented 4 years ago

Do you already have an issue opened with F5 support?

Yes

Description

The new signaling function to CloudFormation Stack is not working (never received from the Stack).

I see in /var/log/cloud/aws/install.log

2020-07-24T11:44:38.080Z info: [pid: 16106] [scripts/runScript.js] 2020-07-24T11:44:38.080Z info: [pid: 2007] [scripts/verifyDeploymentCompletion.js] Device is in cluster.
2020-07-24T11:44:38.081Z info: [pid: 16106] [scripts/runScript.js] 2020-07-24T11:44:38.081Z info: [pid: 2007] [scripts/verifyDeploymentCompletion.js] Sending DONE signal to CloudFormation.
2020-07-24T11:44:38.373Z info: [pid: 16106] [scripts/runScript.js] /config/cloud/aws/node_modules/@f5devcentral/f5-cloud-libs-aws/scripts/verifyDeploymentCompletion.js exited with code 0

Looking in /config/cloud/aws/node_modules/@f5devcentral/f5-cloud-libs-aws/lib/awsCloudProvider.js for:

function signalResourceReady(cloudFormation, stackName, instanceId) {
    const deferred = q.defer();

    getStackResources(cloudFormation, stackName)
        .then((resources) => {
            resources.forEach((resource) => {
                if (resource.ResourceType === 'AWS::AutoScaling::AutoScalingGroup') {
                    const signalParams = {
                        LogicalResourceId: resource.LogicalResourceId,
                        StackName: stackName,
                        Status: 'SUCCESS',
                        UniqueId: instanceId
                    };

                    cloudFormation.signalResource(signalParams, (err, data) => {
                        if (err) {
                            // May signal outside a CloudFormation event, and shouldn't reject then
                            logger.warn('Unable to signal resource', err);
                            deferred.resolve();
                        }
                        logger.info(`Signaled Stack for instance: ${instanceId}`);
                        deferred.resolve(data);
                    });
                }
            });
        })
        .catch((err) => {
            deferred.reject(err);
        });

    return deferred.promise;
}

as far as I understand the message "Signaled Stack for instance: ${instanceId}" should be logged in case of success, and "Unable to signal resource" logged in case of error. I cannot see neither message in any log file (grep'd all files in /var/log/) on the primary instance. Am I missing something?

Template

f5-aws-cloudformation/supported/autoscale/ltm/via-lb/1nic/existing-stack/bigiq/ v5.7.0

Severity Level

Severity: 5

andreykashcheev commented 4 years ago

Yes - the log message should be seen in a case when signal is sent to AWS Cloud Formation. Here is an example of log message from /var/log/cloud/aws/install.log

[admin@ip-10-0-10-51:Active:In Sync] ~ # tail -f /var/log/cloud/aws/install.log 
2020-07-23T19:33:22.992Z info: [pid: 17712] [scripts/runScript.js] 2020-07-23T19:33:22.992Z info: [pid: 22942] [scripts/verifyDeploymentCompletion.js] Device is in cluster. 
2020-07-23T19:33:22.992Z info: [pid: 17712] [scripts/runScript.js] 2020-07-23T19:33:22.992Z info: [pid: 22942] [scripts/verifyDeploymentCompletion.js] Sending DONE signal to CloudFormation. 
2020-07-23T19:33:23.304Z info: [pid: 17712] [scripts/runScript.js] 2020-07-23T19:33:23.304Z info: [pid: 22942] [scripts/verifyDeploymentCompletion.js] Signaled Stack for instance: i-01cd2bebe6c8783aa 
2020-07-23T19:33:23.305Z info: [pid: 17712] [scripts/runScript.js] 2020-07-23T19:33:23.305Z info: [pid: 22942] [scripts/verifyDeploymentCompletion.js] Signal response: undefined 
2020-07-23T19:33:23.306Z info: [pid: 17712] [scripts/runScript.js] 2020-07-23T19:33:23.305Z info: [pid: 22942] [scripts/verifyDeploymentCompletion.js] Finally case got executed. 
2020-07-23T19:33:23.325Z info: [pid: 17712] [scripts/runScript.js] /config/cloud/aws/node_modules/@f5devcentral/f5-cloud-libs-aws/scripts/verifyDeploymentCompletion.js exited with code 0 

Additional information:

I would like to ask for additional details:

amolari commented 4 years ago

@andreykashcheev I've opened the case 1-6538561390 and uploaded a qkiew of the primary. Yes, the deployment times out and tries to rollback (stack delete)

andreykashcheev commented 4 years ago

Thanks for providing qkview! I am looking at this issue and will provide an update today.

Looking at our daily tests runs/results, I can tell that the AWS WAF Autoscale via BIGIQ template was deployed 7 times in last 5 days and all deployments were successful.

Question:

amolari commented 4 years ago

@andreykashcheev I think I've found the issue (on my side). I still have some pre-5.7.0 config parts (need NLB not ELB) and the PolicyDocument still has this code:

"Fn::If": [
       "useDefaultCert",

I haven't ported the new actions "cloudformation:ListStackResources" & "cloudformation:SignalResource" to both the if and else parts. I'm unable to test right now but I will asap. But anyway, the code function should report an error (lack of permissions), isn'it? I see in the code logger.warn('Unable to signal resource', err); which I do not see in my logs.

andreykashcheev commented 4 years ago

Here is list of changes made on template to enable signaling:

  1. Actions were added to BigipAutoScalingAccessRole:
    "cloudformation:ListStackResources",
    "cloudformation:SignalResource"

    https://github.com/F5Networks/f5-aws-cloudformation/blob/master/supported/autoscale/waf/via-lb/1nic/existing-stack/bigiq/f5-bigiq-autoscale-bigip-waf.template#L886

    1. Added CreationPolicy to BigipAutoscaleGroup
      "BigipAutoscaleGroup": {
      "CreationPolicy": {
      "ResourceSignal": {
      "Count": {
      "Ref": "scalingMinSize"
      },
      "Timeout": "PT30M"
      }
      }

      https://github.com/F5Networks/f5-aws-cloudformation/blob/master/supported/autoscale/waf/via-lb/1nic/existing-stack/bigiq/f5-bigiq-autoscale-bigip-waf.template#L924

    2. Included verifyDeploymentCompletion.js script to complete AWS Cloud Formation deployment after devices in sync: https://github.com/F5Networks/f5-aws-cloudformation/blob/master/supported/autoscale/waf/via-lb/1nic/existing-stack/bigiq/f5-bigiq-autoscale-bigip-waf.template#L1434

The script did not work due to missing Actions on BigipAutoscaleGroup; I was able to replicate the issue after removing actions:

2020-07-27T19:29:12.662Z info: [pid: 18685] [scripts/runScript.js] 2020-07-27T19:29:12.662Z silly: [pid: 12323] [scripts/verifyDeploymentCompletion.js] solution: autoscale 
2020-07-27T19:29:12.663Z info: [pid: 18685] [scripts/runScript.js] 2020-07-27T19:29:12.662Z silly: [pid: 12323] [scripts/verifyDeploymentCompletion.js] instance-count: 1 
2020-07-27T19:29:12.663Z info: [pid: 18685] [scripts/runScript.js] 2020-07-27T19:29:12.663Z info: [pid: 12323] [scripts/verifyDeploymentCompletion.js] This solution does not require clustering or less than 2 instances were provisioned with deployment. 
2020-07-27T19:29:12.666Z info: [pid: 18685] [scripts/runScript.js] 2020-07-27T19:29:12.666Z info: [pid: 12323] [scripts/verifyDeploymentCompletion.js] Sending DONE signal to CloudFormation. 
2020-07-27T19:29:12.935Z info: [pid: 18685] [scripts/runScript.js] /config/cloud/aws/node_modules/@f5devcentral/f5-cloud-libs-aws/scripts/verifyDeploymentCompletion.js exited with code 0 

After looking at source code, I suspect that we do not see error in logs because getStackResources method returns empty list: https://github.com/F5Networks/f5-cloud-libs-aws/blob/586c37eccb873ba369afdf4d1cd67f40679ac6b8/lib/awsCloudProvider.js#L2033 due to missing cloudformation:ListStackResources action.

Today, I did several (~10) deployments using v5.7.0 and they all worked fine; in addition, there were 45 AWS Autoscale deployments using our daily tests and they also worked fine.

amolari commented 4 years ago

@andreykashcheev Thank you for the detailed explanation. I confirm that, after adding the missing Actions, it works as expected