equinix / terraform-equinix-metal-anthos-on-vsphere

[Deprecated] Automated Anthos Installation via Terraform for Equinix Metal with vSphere
https://registry.terraform.io/modules/equinix/anthos-on-vsphere/metal/latest
Apache License 2.0
62 stars 41 forks source link

improve error handling in deploy_admin_ws.sh #79

Closed dfong closed 3 years ago

dfong commented 4 years ago

the "if(/BEGIN/) {a++};" seemed pointless, so i got rid of it. also it is easier to redirect the printout externally to awk, so i did that.

my version:

awk '/BEGIN/,/END/ {print}'
paemason commented 4 years ago

What testing has been performed here? My Admin workstation deployed but there were some 'inconsequential failures' in the whole process, which are expected, but it caused my terraform to exit when it shouldn't have:

null_resource.anthos_deploy_workstation[0] (remote-exec): ********************************************************************
null_resource.anthos_deploy_workstation[0] (remote-exec): Some admin workstation preparation failed.
null_resource.anthos_deploy_workstation[0] (remote-exec): Check the messages to fix the error and finish preparation manually.

null_resource.anthos_deploy_workstation[0] (remote-exec): Admin workstation information saved to /root/anthos/admin-workstation
null_resource.anthos_deploy_workstation[0] (remote-exec): This file is required for future upgrades
null_resource.anthos_deploy_workstation[0] (remote-exec): SSH into the admin workstation with the following command:
null_resource.anthos_deploy_workstation[0] (remote-exec): ssh -i /root/anthos/ssh_key ubuntu@172.16.0.3
null_resource.anthos_deploy_workstation[0] (remote-exec): ********************************************************************
null_resource.anthos_deploy_workstation[0] (remote-exec): Exit with error:
null_resource.anthos_deploy_workstation[0] (remote-exec): Failed to create and prepare admin workstation: Failed to complete admin workstation preparation. Check above messages to decide what preparation is still needed.
null_resource.anthos_deploy_workstation[0] (remote-exec): edge-gateway01: Cmd /root/anthos/gkeadm failed, exit 1

The above failures are actually normal and expected

dfong commented 4 years ago

What testing has been done here?

i ran it locally on my mac, and it passed.

is there a standard set of tests that gets run on all merges automatically? or that i should run manually before i submit a pull request?

there were some 'inconsequential failures' in the whole process, which are expected, but it caused my terraform to exit when it shouldn't have:

is there documentation on which commands have expected inconsequential failures?

my goal is to make the process more reliable by catching errors early. many times i have seen terraform report success where it should have reported failure. or terraform fails but far downstream of the root cause, which wastes time and makes debugging more difficult.

i'm also trying to make the process more understandable by printing the command that is running, so the user can see which command is generating what errors/warnings.

paemason commented 4 years ago

On, /root/anthos/gkeadm create admin-workstation --config /root/anthos/admin-ws-config.yaml --ssh-key-path /root/anthos/ssh_key --skip-validation we would expect this:

null_resource.anthos_deploy_workstation[0] (remote-exec): Getting whitelisted service account...

null_resource.anthos_deploy_workstation[0] (remote-exec): Enabling APIs...
null_resource.anthos_deploy_workstation[0] (remote-exec):     - project xxxx (for gke-on-prem-lab)
null_resource.anthos_deploy_workstation[0] (remote-exec):         - serviceusage.googleapis.com
null_resource.anthos_deploy_workstation[0] (remote-exec): failed to enable API "serviceusage.googleapis.com" in project "xxxx": error running command 'gcloud services enable serviceusage.googleapis.com --project xxxx --verbosity=error --quiet': error: exit status 1, stderr: 'ERROR: (gcloud.services.enable) PERMISSION_DENIED: The caller does not have permission
null_resource.anthos_deploy_workstation[0] (remote-exec): ', output: ""

null_resource.anthos_deploy_workstation[0] (remote-exec):         - iam.googleapis.com
null_resource.anthos_deploy_workstation[0] (remote-exec): failed to enable API "iam.googleapis.com" in project "xxxx": error running command 'gcloud services enable iam.googleapis.com --project xxxx --verbosity=error --quiet': error: exit status 1, stderr: 'ERROR: (gcloud.services.enable) PERMISSION_DENIED: The caller does not have permission
null_resource.anthos_deploy_workstation[0] (remote-exec): ', output: ""

null_resource.anthos_deploy_workstation[0] (remote-exec):         - cloudresourcemanager.googleapis.com
null_resource.anthos_deploy_workstation[0] (remote-exec): failed to enable API "cloudresourcemanager.googleapis.com" in project "xxxx": error running command 'gcloud services enable cloudresourcemanager.googleapis.com --project xxxx --verbosity=error --quiet': error: exit status 1, stderr: 'ERROR: (gcloud.services.enable) PERMISSION_DENIED: The caller does not have permission
null_resource.anthos_deploy_workstation[0] (remote-exec): ', output: ""

null_resource.anthos_deploy_workstation[0] (remote-exec): failed to enable APIs "serviceusage.googleapis.com,iam.googleapis.com,cloudresourcemanager.googleapis.com" in project "xxxx"

null_resource.anthos_deploy_workstation[0] (remote-exec): Configuring IAM roles for service accounts...
null_resource.anthos_deploy_workstation[0] (remote-exec):     - gke-on-prem-lab for project xxxx
null_resource.anthos_deploy_workstation[0] (remote-exec):         - roles/serviceusage.serviceUsageViewer
null_resource.anthos_deploy_workstation[0] (remote-exec): failed to add IAM role "roles/serviceusage.serviceUsageViewer" to service account "gke-on-prem-lab@xxxx.iam.gserviceaccount.com": error running command 'gcloud projects add-iam-policy-binding xxxx --member serviceAccount:gke-on-prem-lab@xxxx.iam.gserviceaccount.com --role roles/serviceusage.serviceUsageViewer --verbosity=error --quiet': error: exit status 1, stderr: 'ERROR: (gcloud.projects.add-iam-policy-binding) User [gke-on-prem-lab@xxxx.iam.gserviceaccount.com] does not have permission to access project [xxxx:setIamPolicy] (or it may not exist): Policy update access denied.
null_resource.anthos_deploy_workstation[0] (remote-exec): ', output: ""

null_resource.anthos_deploy_workstation[0] (remote-exec):         - roles/iam.serviceAccountCreator
null_resource.anthos_deploy_workstation[0] (remote-exec): failed to add IAM role "roles/iam.serviceAccountCreator" to service account "gke-on-prem-lab@xxxx.iam.gserviceaccount.com": error running command 'gcloud projects add-iam-policy-binding xxxx --member serviceAccount:gke-on-prem-lab@xxxx.iam.gserviceaccount.com --role roles/iam.serviceAccountCreator --verbosity=error --quiet': error: exit status 1, stderr: 'ERROR: (gcloud.projects.add-iam-policy-binding) User [gke-on-prem-lab@xxxx.iam.gserviceaccount.com] does not have permission to access project [xxxx:setIamPolicy] (or it may not exist): Policy update access denied.
null_resource.anthos_deploy_workstation[0] (remote-exec): ', output: ""

null_resource.anthos_deploy_workstation[0] (remote-exec):         - roles/iam.roleViewer
null_resource.anthos_deploy_workstation[0]: Still creating... [5m50s elapsed]
null_resource.anthos_deploy_workstation[0] (remote-exec): failed to add IAM role "roles/iam.roleViewer" to service account "gke-on-prem-lab@xxxx.iam.gserviceaccount.com": error running command 'gcloud projects add-iam-policy-binding xxxx --member serviceAccount:gke-on-prem-lab@xxxx.iam.gserviceaccount.com --role roles/iam.roleViewer --verbosity=error --quiet': error: exit status 1, stderr: 'ERROR: (gcloud.projects.add-iam-policy-binding) User [gke-on-prem-lab@xxxx.iam.gserviceaccount.com] does not have permission to access project [xxxx:setIamPolicy] (or it may not exist): Policy update access denied.
null_resource.anthos_deploy_workstation[0] (remote-exec): ', output: ""

null_resource.anthos_deploy_workstation[0] (remote-exec): failed to set IAM roles "roles/serviceusage.serviceUsageViewer,roles/iam.serviceAccountCreator,roles/iam.roleViewer" to service account "gke-on-prem-lab@xxxx.iam.gserviceaccount.com"

null_resource.anthos_deploy_workstation[0] (remote-exec): Copying files to admin workstation...
null_resource.anthos_deploy_workstation[0] (remote-exec):     - /root/anthos/vspherecert.pem
null_resource.anthos_deploy_workstation[0] (remote-exec):     - /root/anthos/gcp_keys/whitelisted-key.json

null_resource.anthos_deploy_workstation[0] (remote-exec): Preparing "admin-cluster.yaml" for gkectl...
null_resource.anthos_deploy_workstation[0] (remote-exec): Preparing "user-cluster.yaml" for gkectl...

null_resource.anthos_deploy_workstation[0] (remote-exec): ********************************************************************
null_resource.anthos_deploy_workstation[0] (remote-exec): Some admin workstation preparation failed.
null_resource.anthos_deploy_workstation[0] (remote-exec): Check the messages to fix the error and finish preparation manually.

null_resource.anthos_deploy_workstation[0] (remote-exec): Admin workstation information saved to /root/anthos/admin-workstation
null_resource.anthos_deploy_workstation[0] (remote-exec): This file is required for future upgrades
null_resource.anthos_deploy_workstation[0] (remote-exec): SSH into the admin workstation with the following command:
null_resource.anthos_deploy_workstation[0] (remote-exec): ssh -i /root/anthos/ssh_key ubuntu@172.16.0.3

I'm not sure how from you code this output triggers a failure but it does and it should not.

paemason commented 4 years ago

"is there a standard set of tests that gets run on all merges automatically? or that i should run manually before i submit a pull request?"

Not yet, would be fantastic to have. Currently I run all the tests manually in a number of scenarios before I merge.

dfong commented 4 years ago

I'm not sure how from you code this output triggers a failure but it does and it should not.

how it works: in my change, every "major" command is called via the wrapper function "xrun". xrun prints the command to stderr, runs it, and checks the exit status. if the exit status is nonzero, the script exits.

the command "/root/anthos/gkeadm create ..." is a "major command" in this regard.

if this command is "expected" to fail, the downside is that there is no way to distinguish an unexpected failure (ie, a "real" failure) from an expected failure. consequently the script cannot honor the shell script convention of exiting with nonzero status on failure. in turn, this means terraform won't become aware of the problem until much later. it's even possible that terraform apply could eventually appear to "succeed", even though the cluster wasn't correctly provisioned.

if there is no way to avoid the "expected" failures from gkeadm, i would propose adding a command afterward that validates the conditions that gkeadm was expected to create.

which raises an obvious question: what does "--skip-validation" do? naively it seems that some kind of validation would be a good thing here.

the same goes for any other command that is "expected" to fail. are you aware of any other places where failures are expected ?

dfong commented 4 years ago

Currently I run all the tests manually in a number of scenarios before I merge.

can you describe these scenarios? i'm thinking that i might be able to create a kokoro job to do this. i already have a kokoro job that is intended as CI for my own automation.

of course with all the recent packet.com issues, my kokoro job fails a lot for reasons having nothing to do with my stuff or google-anthos.

dfong commented 4 years ago

also, regarding this:

null_resource.anthos_deploy_workstation[0] (remote-exec): Enabling APIs... null_resource.anthos_deploy_workstation[0] (remote-exec): - project xxxx (for gke-on-prem-lab) null_resource.anthos_deploy_workstation[0] (remote-exec): - serviceusage.googleapis.com null_resource.anthos_deploy_workstation[0] (remote-exec): failed to enable API "serviceusage.googleapis.com" in project "xxxx": error running command 'gcloud services enable serviceusage.googleapis.com --project xxxx --verbosity=error --quiet': error: exit status 1, stderr: 'ERROR: (gcloud.services.enable) PERMISSION_DENIED: The caller does not have permission null_resource.anthos_deploy_workstation[0] (remote-exec): ', output: "

under what conditions are you getting these messages? i don't see them in my own logfiles.

rather than treat this as an "expected" error, i think it'd be better to avoid the problem one way or another. if the caller doesn't have permission, change the setup procedure so that it does have permission, or so that the permission isn't needed.

i am also wondering why this enabling of APIs is happening in the remote-exec? i thought that enabling APIs was something done "locally" in the script create_service_accounts.sh . is there an need to do this remotely?

i also thought that enabling APIs was only needed once per "GCP project", not once per cluster clreation?

if you were able to create the cluster successfully in spite of the PERMISSION_DENIED errors, then i would reason that the APIs were already enabled. so what is the point of redoing these calls remotely and in a context where the operations fail?

dfong commented 4 years ago

i have updated my pull request so that a failure on /root/anthos/gkeadm will not fail the script.

is that sufficient to satisfy the objections?

dfong commented 3 years ago

FYI, this is the comment that Paul Mason sent me in email:

The issue is that the service account (following least permission model) does not have the ability to edit IAM roles and enable services. However, the IAM roles have already been set and services enabled by the create_service_accounts.sh.

This is because we have fully automated the process and use service accounts to work on behalf of the user where the GKE on-prem code expects an authorized user to be running these commands.

You may not be hitting this error because you have given your service account the proper permissions (project editor) for your project. We however do not want to encourage users to give a service account such permissions to the project. 

i probably did add the needed permission to my own SA automation, so i don't see the error.

therefore under the constraints described by Paul, the error is indeed expected.

therefore i modified my CL, so it will not check the exit status of that command.

dfong commented 3 years ago

i have modified the function to call hostname directly instead of using the HOSTNAME var.