dynatrace-oss / dynatrace-gcp-monitor

Dynatrace integration for Google Cloud Platform monitoring
https://www.dynatrace.com/support/help/technology-support/cloud-platforms/google-cloud-platform/
Apache License 2.0
35 stars 20 forks source link

Metrics and Logs Containers crash after some time with install on existing GKE Standard Cluster #527

Closed nmarchini closed 3 weeks ago

nmarchini commented 1 month ago

Describe the bug Running ./deploy-helm.sh give a prompt on step 5 not mentioned in the documentation. Containers start but then crash after a few mins

To Reproduce Steps to reproduce the behavior:

  1. Configure Values.yaml
  2. Authenticate to existing K8s Cluster in GCP
  3. Execute ./deploy-helm.sh

Expected behavior Helm deployment completes and containers start and send metrics to Dynatrace

Additional context

Running the deployment in GCP Cloud Shell

~/helm-deployment-package (<project_id>)$ ./deploy-helm.sh
Dynatrace GCP integration on GKE

Updated property [core/project].
- Deploying dynatrace-gcp-monitor in [<project_id>]
Deploying metrics and logs ingest
Sending logs through selected dynatraceLogIngestUrl

- checking activated extensions in Dynatrace

- Getting list of google extensions available on environment

- There are some google extensions already enabled on the tenant.

- 1. Create dynatrace namespace in k8s cluster.
namespace dynatrace already exists

- 2. Create IAM service account.
Service Account [dynatrace-gcp-monitor-sa] already exists, skipping

- 3. Configure the IAM service account for Workload Identity.
Updated IAM policy for serviceAccount [dynatrace-gcp-monitor-sa@<project_id>.iam.gserviceaccount.com].
bindings:
- members:
  - serviceAccount:<project_id>.svc.id.goog[dynatrace/dynatrace-gcp-monitor-sa]
  role: roles/iam.workloadIdentityUser
etag: BwYdSufAcZo=
version: 1

- 4. Create dynatrace-gcp-monitor IAM role(s).
Updating existing IAM role dynatrace_function.logs. It was probably created for previous GCP integration deployment and you can safely replace it.
description: Role for Dynatrace GCP Monitor operating in logs mode
etag: BwYdSuf-680=
includedPermissions:
- monitoring.dashboards.create
- monitoring.dashboards.list
- monitoring.metricDescriptors.create
- monitoring.metricDescriptors.delete
- monitoring.metricDescriptors.list
- monitoring.timeSeries.create
- pubsub.subscriptions.consume
name: projects/<project_id>/roles/dynatrace_function.logs
stage: GA
title: Dynatrace GCP Logs Monitor
Updating existing IAM role dynatrace_function.metrics. It was probably created for previous GCP integration deployment and you can safely replace it.
description: Role for Dynatrace GCP Monitor operating in metrics mode
etag: BwYdSug_b2k=
includedPermissions:
- cloudfunctions.functions.list
- cloudsql.instances.list
- compute.instances.list
- compute.zones.list
- monitoring.dashboards.create
- monitoring.dashboards.list
- monitoring.metricDescriptors.create
- monitoring.metricDescriptors.delete
- monitoring.metricDescriptors.list
- monitoring.monitoredResourceDescriptors.get
- monitoring.monitoredResourceDescriptors.list
- monitoring.timeSeries.create
- monitoring.timeSeries.list
- pubsub.subscriptions.list
- resourcemanager.projects.get
- serviceusage.services.list
name: projects/<project_id>/roles/dynatrace_function.metrics
stage: GA
title: Dynatrace GCP Metrics Monitor

- 5. Grant the required IAM policies to the service account.
 [1] EXPRESSION=api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/viewer','roles/storage.admin','roles/iam.securityReviewer','roles/compute.serviceAgent','roles/websecurityscanner.serviceAgent','roles/containerregistry.ServiceAgent','roles/bigquery.jobUser','roles/container.admin','roles/storage.objectAdmin','roles/bigquery.dataEditor'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/dns.admin','roles/bigquery.admin','roles/cloudbuild.serviceAgent','roles/dataproc.worker','roles/pubsub.serviceAgent','roles/cloudbuild.builds.builder','roles/storage.objectViewer','roles/cloudsql.admin','roles/container.developer','roles/monitoring.admin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/bigquery.dataViewer','roles/bigquery.user','roles/firebaserules.system','roles/compute.instanceAdmin','roles/iam.serviceAccountTokenCreator','roles/dataproc.editor','roles/appengine.serviceAgent','roles/firestore.serviceAgent','roles/iam.serviceAccountKeyAdmin','roles/secretmanager.secretAccessor'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/aiplatform.admin','roles/compute.instanceAdmin.v1','roles/logging.logWriter','roles/logging.privateLogViewer','roles/cloudscheduler.serviceAgent','roles/bigquerydatatransfer.serviceAgent','roles/logging.viewAccessor','roles/servicenetworking.serviceAgent','roles/containerthreatdetection.serviceAgent','roles/cloudtrace.user'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/composer.admin','roles/firebase.sdkAdminServiceAgent','roles/monitoring.metricWriter','roles/secretmanager.admin','roles/container.nodeServiceAgent','roles/networkmanagement.serviceAgent','roles/pubsub.subscriber','roles/cloudkms.serviceAgent','roles/container.clusterAdmin','roles/dataflow.developer'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/pubsub.publisher','roles/containeranalysis.ServiceAgent','roles/pubsub.admin','roles/pubsub.editor','roles/bigquery.dataOwner','roles/cloudscheduler.admin','roles/compute.osAdminLogin','roles/compute.osLogin','roles/compute.osLoginExternalUser','roles/serviceusage.serviceUsageAdmin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/iap.tunnelResourceAccessor','roles/logging.admin','roles/logging.configWriter','roles/storage.objectCreator','roles/firebase.admin','roles/redis.serviceAgent','roles/cloudfunctions.admin','roles/iam.serviceAccountCreator','roles/ml.serviceAgent','roles/monitoring.viewer'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/source.writer','roles/artifactregistry.serviceAgent','roles/cloudfunctions.developer','roles/run.admin','roles/appengine.appAdmin','roles/bigquery.readSessionUser','roles/cloudsql.client','roles/bigquery.resourceViewer','roles/stackdriver.resourceMetadata.writer','roles/artifactregistry.admin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/compute.storageAdmin','roles/containerscanning.ServiceAgent','roles/dataflow.admin','roles/dataproc.viewer','roles/monitoring.notificationChannelEditor','roles/networkmanagement.admin','roles/source.reader','roles/automl.serviceAgent','roles/container.viewer','roles/dataproc.admin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/gkehub.admin','roles/iam.workloadIdentityUser','roles/sourcerepo.serviceAgent','roles/aiplatform.customCodeServiceAgent','roles/automl.predictor','roles/deploymentmanager.editor','roles/gkehub.serviceAgent','roles/iam.serviceAccountDeleter','roles/monitoring.editor','roles/apigee.admin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/bigquery.resourceAdmin','roles/cloudkms.admin','roles/iam.serviceAccountUser','roles/documentai.viewer','roles/documentai.apiUser','roles/documentai.editor','roles/identitytoolkit.admin','roles/notebooks.admin','roles/vpcaccess.admin','roles/cloudsupport.techSupportEditor'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/composer.ServiceAgentV2Ext','roles/composer.worker','roles/osconfig.patchJobExecutor','roles/backupdr.admin','roles/compute.admin','organizations/773567514706/roles/ACMEProjectBillingAccessor']), TITLE=allowed roles, DESCRIPTION=Allows to assign only specific roles
 [2] EXPRESSION=api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/viewer','roles/storage.admin','roles/iam.securityReviewer','roles/compute.serviceAgent','roles/websecurityscanner.serviceAgent','roles/containerregistry.ServiceAgent','roles/bigquery.jobUser','roles/container.admin','roles/storage.objectAdmin','roles/bigquery.dataEditor'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/dns.admin','roles/bigquery.admin','roles/cloudbuild.serviceAgent','roles/dataproc.worker','roles/pubsub.serviceAgent','roles/cloudbuild.builds.builder','roles/storage.objectViewer','roles/cloudsql.admin','roles/container.developer','roles/monitoring.admin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/bigquery.dataViewer','roles/bigquery.user','roles/firebaserules.system','roles/compute.instanceAdmin','roles/iam.serviceAccountTokenCreator','roles/dataproc.editor','roles/appengine.serviceAgent','roles/firestore.serviceAgent','roles/iam.serviceAccountKeyAdmin','roles/secretmanager.secretAccessor'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/aiplatform.admin','roles/compute.instanceAdmin.v1','roles/logging.logWriter','roles/logging.privateLogViewer','roles/cloudscheduler.serviceAgent','roles/bigquerydatatransfer.serviceAgent','roles/logging.viewAccessor','roles/servicenetworking.serviceAgent','roles/containerthreatdetection.serviceAgent','roles/cloudtrace.user'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/composer.admin','roles/firebase.sdkAdminServiceAgent','roles/monitoring.metricWriter','roles/secretmanager.admin','roles/container.nodeServiceAgent','roles/networkmanagement.serviceAgent','roles/pubsub.subscriber','roles/cloudkms.serviceAgent','roles/container.clusterAdmin','roles/dataflow.developer'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/pubsub.publisher','roles/containeranalysis.ServiceAgent','roles/pubsub.admin','roles/pubsub.editor','roles/bigquery.dataOwner','roles/cloudscheduler.admin','roles/compute.osAdminLogin','roles/compute.osLogin','roles/compute.osLoginExternalUser','roles/serviceusage.serviceUsageAdmin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/iap.tunnelResourceAccessor','roles/logging.admin','roles/logging.configWriter','roles/storage.objectCreator','roles/firebase.admin','roles/redis.serviceAgent','roles/cloudfunctions.admin','roles/iam.serviceAccountCreator','roles/ml.serviceAgent','roles/monitoring.viewer'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/source.writer','roles/artifactregistry.serviceAgent','roles/cloudfunctions.developer','roles/run.admin','roles/appengine.appAdmin','roles/bigquery.readSessionUser','roles/cloudsql.client','roles/bigquery.resourceViewer','roles/stackdriver.resourceMetadata.writer','roles/artifactregistry.admin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/compute.storageAdmin','roles/containerscanning.ServiceAgent','roles/dataflow.admin','roles/dataproc.viewer','roles/monitoring.notificationChannelEditor','roles/networkmanagement.admin','roles/source.reader','roles/automl.serviceAgent','roles/container.viewer','roles/dataproc.admin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/gkehub.admin','roles/iam.workloadIdentityUser','roles/sourcerepo.serviceAgent','roles/aiplatform.customCodeServiceAgent','roles/automl.predictor','roles/deploymentmanager.editor','roles/gkehub.serviceAgent','roles/iam.serviceAccountDeleter','roles/monitoring.editor','roles/apigee.admin'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/bigquery.resourceAdmin','roles/cloudkms.admin','roles/iam.serviceAccountUser','roles/documentai.viewer','roles/documentai.apiUser','roles/documentai.editor','roles/identitytoolkit.admin','roles/notebooks.admin','roles/vpcaccess.admin','roles/cloudsupport.techSupportEditor'])||api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', []).hasOnly(['roles/composer.ServiceAgentV2Ext','roles/composer.worker','roles/osconfig.patchJobExecutor','roles/backupdr.admin','roles/compute.admin','organizations/773567514706/roles/ACMEProjectBillingAccessor']), TITLE=role_list, DESCRIPTION=Allows to assign only specific roles
 [3] EXPRESSION=request.time < timestamp("2024-04-24T14:05:03.964Z"), TITLE=cloudbuild-connection-setup
 [4] None
 [5] Specify a new condition
The policy contains bindings with conditions, so specifying a condition is required when adding a binding. Please specify a condition.:  1
Updated IAM policy for project [<project_id>].

- 6. Enable the APIs required for monitoring.
Operation "operations/acat.p2-528051941672-7cc6e42d-57e0-45b6-8a56-612557ce659a" finished successfully.

- 7. Install dynatrace-gcp-monitor with helm chart in gke_<project_id>_europe-west4_acme-01
Release "dynatrace-gcp-monitor" does not exist. Installing it now.
NAME: dynatrace-gcp-monitor
LAST DEPLOYED: Mon Jul 15 15:46:51 2024
NAMESPACE: dynatrace
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing dynatrace-gcp-monitor.

Your release is named dynatrace-gcp-monitor.

To learn more about the release, try:

  $ helm -n dynatrace status dynatrace-gcp-monitor
  $ helm -n dynatrace get all dynatrace-gcp-monitor

- Deployment complete, check if containers are running:
kubectl -n dynatrace logs -l app=dynatrace-gcp-monitor -c dynatrace-gcp-monitor-logs
kubectl -n dynatrace logs -l app=dynatrace-gcp-monitor -c dynatrace-gcp-monitor-metrics

- Check logs in Dynatrace in 5 min. Log Viewer: https://redacted.live.dynatrace.com/ui/log-monitoring?query=cloud.provider%3D%22gcp%22

- cleaning up
- removing extensions files
You can verify if the installation was successful by following the steps from: https://www.dynatrace.com/support/help/shortlink/deploy-k8#anchor_verify
Additionally you can enable self-monitoring for quick diagnosis: https://www.dynatrace.com/support/help/how-to-use-dynatrace/infrastructure-monitoring/cloud-platform-monitoring/google-cloud-platform-monitoring/set-up-gcp-integration-on-new-cluster#verify

Error message from container logs (metrics container)

2024-07-15 15:23:19.133996 Dynatrace GCP Monitor startup
2024-07-15 15:23:19.134816 GCP Monitor - Dynatrace integration for Google Cloud Platform monitoring

2024-07-15 15:23:19.135132 Release version: release-1.4.5
2024-07-15 15:23:19.135908 Trying to use default service account
2024-07-15 15:23:19.141720 [webserver] Setting up webserver... 

2024-07-15 15:23:19.203935 Resolved host: metadata.google.internal info: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('169.254.169.254', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('169.254.169.254', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('169.254.169.254', 80))]
2024-07-15 15:23:19.286055 GCP instance metadata: InstanceMetadata(project_id=None, container_name='gke-acmh-01-pool-2-2868729b-8dpq.europe-west4-c.c.hl2-seme-mach-t1iylu.internal', token_scopes='https://www.googleapis.com/auth/cloud-platform\nhttps://www.googleapis.com/auth/userinfo.email\n', service_account='default/\ndynatrace-gcp-monitor-sa@hl2-seme-mach-t1iylu.iam.gserviceaccount.com/\n', audience={'aud': 'https://accounts.google.com', 'azp': '116583019369936416822', 'email': 'dynatrace-gcp-monitor-sa@hl2-seme-mach-t1iylu.iam.gserviceaccount.com', 'email_verified': True, 'exp': 1721060599, 'iat': 1721056999, 'iss': 'https://accounts.google.com', 'sub': '116583019369936416822'}, hostname='dynatrace-gcp-monitor-d5d48555b-zhk8h', zone='europe-west4-c')
2024-07-15 15:23:19.288294 Trying to use default service account
2024-07-15 15:23:19.557563 Failed to import a self monitoring dashboard, because: 'NoneType' object is not iterable
2024-07-15 15:23:19.559436 Operation mode: Metrics
2024-07-15 15:23:19.560577 Trying to use default service account
Traceback (most recent call last):
  File "./run_docker.py", line 201, in <module>
    main()
  File "./run_docker.py", line 191, in main
    asyncio.run(run_metrics_fetcher_forever())
  File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "./run_docker.py", line 129, in run_metrics_fetcher_forever
    pre_launch_check_result = await metrics_pre_launch_check()
  File "./run_docker.py", line 66, in metrics_pre_launch_check
    extensions_fetch_result = await extensions_fetch(gcp_session, dt_session, token)
  File "/code/lib/dt_extensions/dt_extensions.py", line 33, in extensions_fetch
    extension_fetcher_result = await ExtensionsFetcher(
  File "/code/lib/dt_extensions/extensions_fetcher.py", line 43, in execute
    extension_name_to_version_dict = await self._get_extensions_dict_from_dynatrace_cluster()
  File "/code/lib/dt_extensions/extensions_fetcher.py", line 63, in _get_extensions_dict_from_dynatrace_cluster
    dynatrace_extensions = await self._get_extension_list_from_dynatrace_cluster()
  File "/code/lib/dt_extensions/extensions_fetcher.py", line 70, in _get_extension_list_from_dynatrace_cluster
    response = await self.dt_session.get(url, headers=headers, params=params,
  File "/usr/local/lib/python3.8/site-packages/aiohttp/client.py", line 682, in _request
    break
  File "/usr/local/lib/python3.8/site-packages/aiohttp/helpers.py", line 735, in __exit__
    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7d2b720d5e80>
Stream closed EOF for dynatrace/dynatrace-gcp-monitor-d5d48555b-zhk8h (dynatrace-gcp-monitor-metrics)
joaquinfilipic-dynatrace commented 3 weeks ago

Hello.

Regarding the prompt on step 5, it's because the user added some kind of conditionals when creating the roles to be bound. Not something we include in the OOO roles we offer. And it's something on GCP's side, not a bug in the integration.

For the second thing, which is an actual problem, we offer the option to create an Autopilot cluster with the proper configuration on behalf of the users. Besides that, going with the existing cluster approach means that users need to take care of K8s related settings. I see that the container starts running properly and it crashes, so later it maybe starts again and does the same thing. This is a Kubernetes thing, not the integration. Some guesses include not provisioning the cluster with enough resources, correct probes for the pods, etc.

In many cases users need to go outside of the default offering because of business use cases, but they need to handle the extra steps if they administrate their environment.