GoogleCloudPlatform / fda-mystudies

FDA MyStudies
Other
62 stars 42 forks source link

Load Balancer: Configure and Deploy Your Applications - Verify the Status of the Kubernetes Cluster #4495

Closed jaseva closed 2 years ago

jaseva commented 2 years ago

What are you trying to accomplish? Hi, I am deploying the platform following the semi-automated instructions for the most recent release version v2.0.9. I am currently on the third bullet of Step 6:

image

What challenge are you running into? I am unsure if the Load Balancers were configured properly. For the TCP/UDP (internal) load balancer type using the TCP protocol, the health check displays that 1 backend service is unhealthy.

image

All 3 backend instance groups appear unhealthy:

image

What have your tried so far? When I inspect each backend instance group separately, all appear healthy:

image

image

image

In the Logs Explorer for the apps project, there are many errors reported in the log:

image

I haven't completed the database migration steps yet though I completed the rest of the deployment up to the configure your first study section.

Neither the studybuilder or particpant-manager load in the webpages and time out when I visit the url in a web browser. My domains are from a 3rd party other than Google so I may have configured the Cloud DNS incorrectly. The certificates issued to the studies and participants domains are both Google managed and are currently active with healthy checkmarks.

I mapped the {PREFIX}-{ENV}.{DOMAIN} to the first 2 Google domains listed through Cloud DNS > Registrar Setup and used the domain's IP address for 2 A records through my hosting provider. For the participants.{PREFIX}-{ENV}.{DOMAIN} and studies.{PREFIX}-{ENV}.{DOMAIN} I used the last 2 Google domains listed through the Registrar setup and used the IP addresses for the A records. This may be incorrect. I am able to verify that the domains have been mapped to the proper domain IP addresses by using nslookup through the command prompt.

image

image

When I check the APIs & Services, I have the following errors for the past hour for the Apps and Network projects:

{PREFIX}-{DEV}-apps Cloud Monitoring API: google.monitoring.v3.MetricService.CreateTimeSeries Compute Engine API: compute.googleapis.com: compute.beta.HealthChecksService.Get Compute Engine API: compute.googleapis.com: compute.beta.InstanceGroupManagersService.Get Compute Engine API: compute.googleapis.com: compute.v1.BackendServicesService.Get Compute Engine API: compute.googleapis.com: compute.v1.GlobalForwardingRulesService.Get Compute Engine API: compute.googleapis.com: compute.v1.HealthChecksService.Get Compute Engine API: compute.googleapis.com: compute.v1.HttpHealthChecksService.Get Compute Engine API: compute.googleapis.com: compute.v1.InstanceGroupManagersService.Get Compute Engine API: compute.googleapis.com: compute.v1.InstanceGroupsService.Get Compute Engine API: compute.googleapis.com: compute.v1.InstancesService.Get Compute Engine API: compute.googleapis.com: compute.v1.InstanceTemplatesService.Delete Compute Engine API: compute.googleapis.com: compute.v1.NetworkEndpointGroupsService.Get Compute Engine API: compute.googleapis.com: compute.v1.ProjectsService.GetXpnResources Compute Engine API: compute.googleapis.com: compute.v1.RegionAddressesService.Get Compute Engine API: compute.googleapis.com: compute.v1.RegionBackendServicesService.Get Compute Engine API: compute.googleapis.com: compute.v1.RegionForwardingRulesService.Get Compute Engine API: compute.googleapis.com: compute.v1.RegionsService.Get Compute Engine API: compute.googleapis.com: compute.v1.RegionUrlMapsService.Get Compute Engine API: compute.googleapis.com: compute.v1.SslCertificatesService.Get Compute Engine API: compute.googleapis.com: compute.v1.TargetHttpProxiesService.Get Compute Engine API: compute.googleapis.com: compute.v1.TargetHttpsProxiesService.Get Compute Engine API: compute.googleapis.com: compute.v1.TargetPoolsService.Get Compute Engine API: compute.googleapis.com: compute.v1.TargetSslProxiesService.Get Compute Engine API: compute.googleapis.com: compute.v1.UrlMapsService.Get Compute Engine API: compute.googleapis.com: compute.v1.ZonesService.Get

{PREFIX}-{DEV}-network Compute Engine API: compute.v1.FirewallsService.Update

Please advise @mohangmk or anyone else who may have encountered a similar problem. Thank you.

Labels Kubernetes Cluster, Verify Status, Load Balancer, Configure and Deploy Your Applications, TCP/UDP (Internal), TCP Protocol

jaseva commented 2 years ago

In the {PREFIX}-{ENV} apps project one of the messages is for the Kubernetes cluster {PREFIX}-{ENV}-gke-cluster:

{PREFIX}-{ENV}-gke-cluster scale down blocked by pod - Kubernetes Engine - {PREFIX}-{ENV}-apps project in GCP. Pod is blocking scale down because it has local storage.

Perhaps this is not an issue to be concerned about?

The recommended action is: Set annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" for Pod

image

It seems, there are 3 storage classes to this Kubernetes cluster:

image

Which storage class would I add the recommended action command to?

This issue may be related to IAM/ permissions as one of the error messages states that there was an unrecoverable error sending samples to remote storage.

image

I also see a cloudsql-proxy error and hydra-ic error. Prior to this installation/ deployment attempt, I had previously configured another folder and set of projects though had ran into issues with the database and deleted. Perhaps these issues are related to my prior installation attempt? I had thought I had deleted all resources prior. If not, the only other thought I have as to yesterday when I was completing the section for Configure and Deploy Your Application, during step 4, I entered all secret values for the first time, and accidentally refreshed the Kubernetes cluster and restarted the relevant pods though this was unnceccessary as it was my first time entering secret values during the initial deployment. Here is a screenshot of the semi-automated deployment instructions I am referring to:

image

Please advise, thank you.

yugandhar-btc commented 2 years ago

@jaseva The health check is failing for one of the backend service ,cloud you please tell us what is the unhealthy TCP Port number is using for TCP/UDP (internal) load balancer.

jaseva commented 2 years ago

@yugandhar-btc thanks for the reply.

I deleted the Kubernetes cluster earlier today prior to receiving your message and tried re-deploying the platform following the semi-automated deployment steps from the section 'Configure and deploy your applications'. I am currently at step 6: 'Verify the status of your Kubernetes Cluster'.

The Kubernetes cluster still produces the notification about "scale down blocked by pod" message.

image

What I notice is that the workloads keep failing for participant-manager-datastore and promsd. The promsd workload intermittenly goes healthy and then fails and goes unhealthy while participant-manager-datastore remains in an unhealthy state. This has persisted for about an hour now. Perhaps I will wait until tomorrow and re-check to see if the workloads become healthy.

image

image

image

image

image

Please indicate where I can find the TCP Port number for the backend service(s) that are failing.

I did notice a warning message about the networking.k8s.io/v1beta1 Ingress being deprecated when I was running the steps for step 5. When I ran the firewall command to update the firewall, I had an error messaage the first attempt. The 2nd attempt it worked.

image

image

image

When running a command kubectl describe pod, I receive an error and another failure in participant-manager-datastore:

image

image

image

image

image

jaseva commented 2 years ago

I also configured my participants.{PREFIX}-{ENV}.{DOMAIN}. and studies.{PREFIX}-{ENV}.{DOMAIN}. A-records through my 3rd party hosting provider to point to the IP Address 34.120.219.140. Not sure if that is the right thing to do.

image

The command I ran above (I blacked out for privacy the prefix-env.domain) was: nslookup {PREFIX}-{ENV}.{DOMAIN}.

image

Through the Registrat Setup in Cloud DNS, I was provided these domains:

image

image

image

yugandhar-btc commented 2 years ago

@jaseva Looks like the participant manager was not deployed properly, Please try to redeploy the participant manager datastore, and all the above-reported participant manager errors are related to that, since the participant manager datastore is failed to get deployed in the back end, the scheduler will keep trying to schedule the pod until it got deployed successfully.

The warning message about the networking.k8s.io/v1beta1 Ingress can be ignored for now, and it is not creating any issues in our deployment, the latest versions will be updated in further release.

To find the Internal load balancer TCP Port number for the backend services , please follow the steps below GCP Console > Search for the Apps Project >Navigate to the VM instances > Health Checks

The backend health check failure is a known issue and we found the root cause of the issue as well and the solution, you can expect this fix as part of the next platform release.

jaseva commented 2 years ago

Hi @yugandhar-btc, thank you for the info. I followed the steps, deleted the workloads that were giving errors, and re-built all the application triggers. Now, all workloads have healthy checkmarks, however for the Ingress, there are 2 errors related to the backend for participant manager:

image

image

image

image

image

image

image

Here are the Internal Load Balancer TCP Port Number for the backend services as per your instruction:

image

Please advise. Thanks again for your assistance.

yugandhar-btc commented 2 years ago

@jaseva The ingress will be healthy until all your Backend services became Ok state, Looks like your backend services are running and it will take while to became healthy.

jaseva commented 2 years ago

Okay, great, thanks for the reply, I'll give it some time and report back and after a bit if there is change. Much appreciated @yugandhar-btc.

jaseva commented 2 years ago

Hi @yugandhar-btc it looks like only one of the backend services for the load balancing resources started working. There is still one backend service load balancing resource that is unhealthy.

image

image

Both frontend TLS certificates are healthy.

When I look in the logs and events the message states that the ingress is ready to sync and that the firewall change is required by security admin (which I have ran successfully multiple times).

image

image

image

image

The only thing I can think of which I may have incorrectly setup on my end was the steps for configuring my domain for the deployment.

image

I use separate providers for domain name registration (NameCheap) and DNS (HostGator) and I think that I may have incorrectly configured my DNS settings for the delegated subzone step in the semi-automated deployment instructions.

The instructions state to create a NS resource record however through my DNS provider however I am only able to configure A, CNAME, MX, SRV, and TXT zone records.

The provided nameservers in the registrar setup are:

ns-cloud-b1.googledomains.com. ns-cloud-b2.googledomains.com. ns-cloud-b3.googledomains.com. ns-cloud-b4.googledomains.com.

I used the nslookup command to get the IP address of each nameserver above:

216.239.32.107 216.239.34.107 216.239.36.107 216.239.38.107

Through HostGator, should I create 4 separate A records for each IP address configured to the {PREFIX}-{ENV} subdomain? I have only configured one A record for one IP address for {PREFIX}-{ENV}-{DOMAIN}.

image

Do I need to configure any A records for the participants.{PREFIX}-{ENV}.{DOMAIN} and studies.{PREFIX}-{ENV}.{DOMAIN} subdomains? I noticed that the external HTTP(S) load balancer IP address 34.120.219.140 is listed for these two sub-domains in the routing policy column. Which IP address should I configure for the studies and pariticpants sub-domains if any at all?

image

Thanks!

yugandhar-btc commented 2 years ago

@jaseva I just looked at your ingress events, did you updated the firewalls, by following the below steps?

Update firewalls:

The Unhealthy Backend service will not create any issues and We found the root cause of the issue and you can expect this fix as part of next public platform release.

If you are using 3rd party domain name registration, make sure you providing right hosts details and No need to follow the Configure your domain for the deployment Steps.

Hosts ex:-

jaseva commented 2 years ago

Hi @yugandhar-btc, I repeatedly tried updating the firewall yesterday with the following steps you outlined however it did not seem to work even though it says that it was updated. I tried again this morning to the same resolve:

image

I'll ignore the Unhealthy Backend service issue with the k8s1-f048d143-default-participant-manager-datasto-5000-975ff333k8s1 load balancing resource since it will not cause any issues.

image

As per connectivity to the {PREFIX}-{ENV}.{DOMAIN} that I configured through my hosting provider's DNS Zone settings, it returns connectivity when I run a nslookup command with one of the IP addresses 216.239.32.107 for the provided nameserver ns-cloud-b1.googledomains.com.

image

This is expected as I have configured an A record for my domain through my hosting provider's DNS Zone settings for the 216.239.32.107 domain. I have not configured any A or other records for the domain's DNS Zone settings.

When I visit the url: https://studies.{PREFIX}-{ENV}.{DOMAIN}/studybuilder it returns a message in the browser stating it can't reach the page.

image

I receive the same browser message when I visit the url: https://participants.{PREFIX}-{ENV}.{DOMAIN}/participant-manager/

I should mention that I am following the same process as I did when I first attempted to deploy this platform. I was successful doing so however ran into issues with the database when trying to create a first study. At least I was able to log in the first time to the studybuilder. With this deployment, I do not seem to have the same connectivity. Here is the thread with the issue I was experiencing with my first deployment. https://github.com/GoogleCloudPlatform/fda-mystudies/issues/4480 I eventually gave up and deleted all projects and re-ran the deployment steps again starting fresh.

yugandhar-btc commented 2 years ago

@jaseva Looks like your domain is not pointed properly, could you please check and confirm the status of TLS certificates status ?

I hope you replaced the {PREFIX}-{ENV}.{DOMAIN} variables in your host details

jaseva commented 2 years ago

@yugandhar-btc both certificate statuses are active and healthy:

image

image

Yes, I can confirm that I replaced the variables with my host details.

As a test I just created two unique A records in my hosting provider for the studies.{PREFIX}-{ENV}.{DOMAIN} and participants.{PREFIX}-{ENV}.{DOMAIN} pointing to the IP addresses 216.239.34.107, 216.239.36.107 though I think that is unneccessary as I already have an A record to the {PREFIX}-{ENV}.{DOMAIN} at 216.239.32.107.

I should mention for this deployment attempt, I have not ran any of the database migration steps below in the semi-automated instructions. Not sure if I need to do that or not however I did that with my first deployment and seemed to have made further progress than I am making now as I cannot seem to connect to the studybuilder site.

In the Load Balancing GCP screen, I see that the TCP/UDP load balancer type is unhealthy though I think that is the same known issue you mentioned earlier that will not have any impact:

image

image

When I click each instance group listed, it says that the status is unmanaged though each instance group member appears healthy.

yugandhar-btc commented 2 years ago

@jaseva as I mentioned earlier, it is a know issue and it will not create any issues to the application.

jaseva commented 2 years ago

@yugandhar-btc When re-running the step 7 for configuring the initial application credentials, I receive an error while running the registration commands, I assume that is because I had already previously completed this step successfully:

image

image

image

Do I need to proceed to the database migration steps or is this unneccessary? https://github.com/GoogleCloudPlatform/fda-mystudies/blob/master/db-migration/README.md I saw in another thread that it was neccessary to do so however the semi-automated deployment instructions where not that clear to me.

yugandhar-btc commented 2 years ago

@jaseva if you already run step 7 to configuring the initial application credentials, you no need to re-run because the data is already exists and those are obvious errors.

jaseva commented 2 years ago

@yugandhar-btc thanks for confirming. Do I need to conduct the database migration steps? I doubt that has anything to do with the domain being improperly pointed. Perhaps the root cause of the issue is related to the firewall? I still seem to be receiving the same event message for the ingress after having re-ran the firewall steps successfully earlier. Perhaps it could be IAM/ permission related? If that is the case do you think it would be resolved with the database migration steps as it seems there were IAM changes in some of the database version upgrades?

image

--Update-- I inspected all terraform and kubernetes scripts for the database migration semi-automated deployment steps and all the correct updates have been applied in the v2.0.9 release version copy of the Github repo master branch that I am using for this deployment.

In the apps project, for the past 1 minute, the log data is showing IAM, sidecar, and hydra-ic errrors.

image

Here are some of the log data for the errors that seem to keep repeating copied to clipboard. I replaced all environment variables with {PREFIX}-{ENV}:

{ "protoPayload": { "@type": "type.googleapis.com/google.cloud.audit.AuditLog", "status": { "code": 7, "message": "Permission monitoring.timeSeries.create denied (or the resource may not exist)." }, "authenticationInfo": { "serviceAccountDelegationInfo": [ {} ], "principalSubject": "serviceAccount:{PREFIX}-{ENV}-apps.svc.id.goog[istio-system/promsd]" }, "requestMetadata": { "callerIp": "10.0.0.29", "callerSuppliedUserAgent": "StackdriverPrometheus/0.4.0 grpc-go/1.10.0-dev,gzip(gfe)", "callerNetwork": "//compute.googleapis.com/projects/{PREFIX}-{ENV}-networks/global/networks/unknown", "requestAttributes": { "time": "2022-03-29T21:09:59.507398651Z", "auth": {} }, "destinationAttributes": {} }, "serviceName": "monitoring.googleapis.com", "methodName": "google.monitoring.v3.MetricService.CreateTimeSeries", "authorizationInfo": [ { "resource": "130817636032", "permission": "monitoring.timeSeries.create", "resourceAttributes": {} } ], "resourceName": "projects/{PREFIX}-{ENV}-apps", "request": { "@type": "type.googleapis.com/google.monitoring.v3.CreateTimeSeriesRequest", "name": "projects/{PREFIX}-{ENV}-apps" } }, "insertId": "1vfbfbieth3td", "resource": { "type": "audited_resource", "labels": { "service": "monitoring.googleapis.com", "project_id": "{PREFIX}-{ENV}-apps", "method": "google.monitoring.v3.MetricService.CreateTimeSeries" } }, "timestamp": "2022-03-29T21:09:59.424378716Z", "severity": "ERROR", "logName": "projects/{PREFIX}-{ENV}-apps/logs/cloudaudit.googleapis.com%2Fdata_access", "receiveTimestamp": "2022-03-29T21:10:00.263812852Z" }


{ "textPayload": "level=warn ts=2022-03-29T21:09:59.509304398Z caller=queue_manager.go:551 component=queue_manager msg=\"Unrecoverable error sending samples to remote storage\" err=\"rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).\"\n", "insertId": "p9qve3z9tyn4ncut", "resource": { "type": "k8s_container", "labels": { "container_name": "sidecar", "cluster_name": "{PREFIX}-{ENV}-gke-cluster", "project_id": "{PREFIX}-{ENV}-apps", "location": "us-central1", "pod_name": "promsd-574ccb9745-r8c2m", "namespace_name": "istio-system" } }, "timestamp": "2022-03-29T21:09:59.509683341Z", "severity": "ERROR", "labels": { "k8s-pod/pod-template-hash": "574ccb9745", "compute.googleapis.com/resource_name": "gke-{PREFIX}-{ENV}-gke-default-node-poo-14798bb5-tvk9", "k8s-pod/app": "promsd" }, "logName": "projects/{PREFIX}-{ENV}-apps/logs/stderr", "receiveTimestamp": "2022-03-29T21:10:01.990543196Z" }


{ "textPayload": "time=2022-03-29T21:09:59Z level=info msg=started handling request http_request=map[headers:map[user-agent:GoogleHC/1.0] host:172.16.2.12 method:GET path:/health/ready query: remote:35.191.10.114:32896 scheme:http]\n", "insertId": "vd05ubg4ma6bo96a", "resource": { "type": "k8s_container", "labels": { "container_name": "hydra-ic", "cluster_name": "{PREFIX}-{ENV}-gke-cluster", "pod_name": "hydra-ic-7857b69d8f-h294x", "project_id": "{PREFIX}-{ENV}-apps", "namespace_name": "default", "location": "us-central1" } }, "timestamp": "2022-03-29T21:09:59.962019001Z", "severity": "ERROR", "labels": { "compute.googleapis.com/resource_name": "gke-{PREFIX}-{ENV}-gke-default-node-poo-1e92dbeb-km9x", "k8s-pod/app": "hydra-ic", "k8s-pod/pod-template-hash": "7857b69d8f" }, "logName": "projects/{PREFIX}-{ENV}-apps/logs/stderr", "receiveTimestamp": "2022-03-29T21:10:03.758474005Z" }


{ "textPayload": "time=2022-03-29T21:09:59Z level=info msg=completed handling request http_request=map[headers:map[user-agent:GoogleHC/1.0] host:172.16.2.12 method:GET path:/health/ready query: remote:35.191.10.114:32896 scheme:http] http_response=map[status:200 text_status:OK took:1.996124ms]\n", "insertId": "nl5q1qi7bno15yap", "resource": { "type": "k8s_container", "labels": { "container_name": "hydra-ic", "namespace_name": "default", "project_id": "{PREFIX}-{ENV}-apps", "pod_name": "hydra-ic-7857b69d8f-h294x", "cluster_name": "{PREFIX}-{ENV}-gke-cluster", "location": "us-central1" } }, "timestamp": "2022-03-29T21:09:59.963946968Z", "severity": "ERROR", "labels": { "k8s-pod/app": "hydra-ic", "k8s-pod/pod-template-hash": "7857b69d8f", "compute.googleapis.com/resource_name": "gke-{PREFIX}-{ENV}-gke-default-node-poo-1e92dbeb-km9x" }, "logName": "projects/{PREFIX}-{ENV}-apps/logs/stderr", "receiveTimestamp": "2022-03-29T21:10:03.758474005Z" }


I see errors in the auth-server workload of the apps project:

image

Errors in the hydra-ic workload:

image

Errors in the istio-ingressgateway workload:

image

Errors in the istio-policy workload:

image

Errors in the istio-telemetry workload:

image

Errors in the participant-enroll-datastore workload:

image

Errors in the participant-manager datastore workload:

image

Errors in the participant-user-datastore workload:

image

Errors in the prometheus workload:

image

Errors in the promsd workload:

image

Errors in the response datastore workload:

image

Errors in the study builder workload:

image

Errors in the study-datastore workload:

image

yugandhar-btc commented 2 years ago

@jaseva I could see two TCP Target pool based load balancers were created , their should be one TCP Target pool based load balancer and could you please verify and removed the extra created one.

image

if all the working are in OK state ,you can conduct the database migration steps.

jaseva commented 2 years ago

@yugandhar-btc good eye for catching that extra TCP network. I deleted the instance that was unhealthy.

I am following the database migration instuctions here: https://github.com/GoogleCloudPlatform/fda-mystudies/blob/master/db-migration/README.md and either the documentation is out of date or my bastion-vm environment is different from the instructions as many of the commands involving paths are incorrect. I'll report back once I am complete.

When I inspected all the terraform/main.tf files all the script file updates outlined in the semi-automated instrcutions for v2.0.5 upgrade onwards to v2.0.8 have been applied and are present in the v2.0.9 master branch in the GoogleCloudPlatofmr/fda-mystudies repo that I am using. Is the separate database migration instructions redundant or do I still need to go through these steps everytime a new version release is available of the platform repo?

yugandhar-btc commented 2 years ago

@jaseva I hope you deleted the wrongly created TCP Target pool-based load balancers not the unhealthy ones and Yes you need to follow the same steps every time you do database migration. If any changes in the db migration steps they will be available in the new version.

jaseva commented 2 years ago

@yugandhar-btc thanks for confirming about the database migration steps. Do you know when the new version is planned to be released? This is how the load balancers remain:

image

Upon inspecting the load balancers listed in the above screenshot and comparing it to the first screenshot of my issue thread, I may have deleted the incorrect load balancer. If that is the case, what are your thoughts about if I deleted the kubernetes cluster and restart from this section of the semi-automated deployment instructions: https://github.com/GoogleCloudPlatform/fda-mystudies/blob/master/deployment/README.md#configure-and-deploy-your-applications

image

When I first load the bastion-vm, it produces an error message while connecting.

image

I retry and then am successful connecting so something is going on with it.

I also cannot seem to connect using the gcloud auth login --update-adc command as the bastion-vm lacks access to a web browser.

image

I upgraded the bastion-vm environment in an attempt to fix the initial bastion-vm connection attempt error and that had no impact. Hopefully upgrading the vm environment did not further mess things up.

image

I'll have a bit more time to spend on it tomorrow. Thanks for your help, much appreciated.

jaseva commented 2 years ago

Hi @yugandhar-btc, I reached out to Google Support, they were unable to help resolve the issues I was experiencing. I'm going to delete all projects, and conduct another attempt to re-deploy. I'll close this ticket for now and if I run into issues again, I'll reach out. Thanks again for your help.