Xabaril / AspNetCore.Diagnostics.HealthChecks

Enterprise HealthChecks for ASP.NET Core Diagnostics Package
Apache License 2.0
4.12k stars 800 forks source link

K8S Operator: Possible permissions issue? UI cannot reach system under monitoring #2144

Open SeanKilleen opened 10 months ago

SeanKilleen commented 10 months ago

First off -- @sungam3r thank you for all the work you've put in and continue to put in on maintaining this. I know you're spread thin these days and I don't have enough familiarity to dig in yet as a contributor but I'm keeping it in mind now that I'm becoming an active user. I'll continue to look into this issue actively.


I followed the directions to setup the K8S operator and UI using the HealthCheck CRD. The operator is deployed in cluster mode.

✅ I see in the operator's logs that the system under monitoring is discovered, using its cluster IP. The The annotations for port and endpoint are correctly discovered (:8080/healthz). The endpoint for monitoring is pushed correctly to the UI.

Logs from the operator showing success:

 [18:08:07 INF] [PushService] Namespace observability - Sending Type: Added - Service [RedactedAppName] with uri : http://[RedactedClusterIP]:8080/healthz to ui endpoint: http://100.106.236.107:80
[18:08:07 INF] Start processing HTTP request POST http://100.106.236.107/healthchecks/push?key=8709dabc-ca13-4a61-9367-6b0f0b8958b3
[18:08:07 INF] Sending HTTP request POST http://100.106.236.107/healthchecks/push?key=8709dabc-ca13-4a61-9367-6b0f0b8958b3
[18:08:07 INF] Received HTTP response headers after 4.2877ms - 200
[18:08:07 INF] End processing HTTP request after 4.5692ms - 200
[18:08:07 INF] [PushService] Notification result for [RedactedAppName] - status code: OK

✅ When port forwarding the system under monitoring to my local machine, I can get to the health checks via :8080/healthz. So I know they're accessible from the app at that URL. ❌ Despite this, the health check UI fails to retrieve the health check:

 GetHealthReport threw an exception when trying to get report from http://[RedactedClusterIP]:8080/healthz configured with name [RedactedAppName].
System.Net.Http.HttpRequestException: An error occurred while sending the request.
---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.

I'm thinking there may be an issue with the service account's permissions when operating in cluster mode. I'll post my kubernetes definitions shortly and will double-check that they match against the docs (I'm using Terraform so will make sure nothing got lost in translation there).

Environment:

SeanKilleen commented 10 months ago

Namespace definition -- appears to match the definition with the exception of the name (it previously existed):

Expand YAML ```yaml kind: Namespace apiVersion: v1 metadata: name: observability uid: e71dfd05-928d-45b9-9a50-1beaaad0b4ef resourceVersion: '71973636' creationTimestamp: '2023-10-19T23:46:54Z' labels: app.kubernetes.io/part-of: healthchecks-operator kubernetes.io/metadata.name: observability spec: finalizers: - kubernetes status: phase: Active ```

Service account: matches definition, with the exception of namespace name:

Expand YAML ```yaml kind: ServiceAccount apiVersion: v1 metadata: name: healthchecks-admin namespace: observability uid: 61d4339f-259f-40b3-93df-ff32ab1f88f0 resourceVersion: '71811869' creationTimestamp: '2024-01-17T14:03:49Z' automountServiceAccountToken: true ```

Cluster Role -- matches definition:

Expand YAML ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: healthchecks-admin uid: dd38820c-1fd1-45ce-8360-bbfd6adb5497 resourceVersion: '71811870' creationTimestamp: '2024-01-17T14:03:49Z' rules: - verbs: - '*' apiGroups: - '' resources: - services - pods - deployments - secrets - configmaps - verbs: - '*' apiGroups: - apps resources: - deployments - verbs: - '*' apiGroups: - aspnetcore.ui resources: - '*' ```

Same for the Cluster Role Binding ([definition]()):

Expand YAML ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: healthchecks-admin uid: 1b4721ac-f995-4ce6-bb21-06c279249393 resourceVersion: '71813591' creationTimestamp: '2024-01-17T14:06:49Z' subjects: - kind: ServiceAccount name: healthchecks-admin namespace: observability roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: healthchecks-admin ```

Same with operator deployment (reference) (I omitted managedFields and status for brevity)

Expand YAML ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: healthchecks-ui-k8s-operator namespace: observability uid: 7e96d4ee-d6c2-4a1e-b867-20b440c27d3b resourceVersion: '71974348' generation: 1 creationTimestamp: '2024-01-17T14:17:34Z' annotations: deployment.kubernetes.io/revision: '1' spec: replicas: 1 selector: matchLabels: app: healthchecks-ui-k8s-operator template: metadata: creationTimestamp: null labels: app: healthchecks-ui-k8s-operator spec: containers: - name: healthchecks-ui-k8s-operator image: xabarilcoding/healthchecksui-k8s-operator:latest resources: limits: cpu: 500m memory: 300Mi requests: cpu: 300m memory: 100Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: Always restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst serviceAccountName: healthchecks-admin serviceAccount: healthchecks-admin automountServiceAccountToken: true shareProcessNamespace: false securityContext: {} schedulerName: default-scheduler enableServiceLinks: true strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600 ```

I'll check the CRD resource deployment and send another update.

SeanKilleen commented 10 months ago

And here's the HealthCheck resource for, which appears to match the docs in spirit:

apiVersion: aspnetcore.ui/v1
kind: HealthCheck
metadata:
  creationTimestamp: '2024-01-17T14:55:22Z'
  generation: 1
  name: healthchecks-ui
  namespace: observability
  resourceVersion: '71841262'
  uid: d67be564-06df-4e5e-8e17-ee6f68e78489
spec:
  name: healthchecks-ui
  scope: Cluster
  serviceType: ClusterIP
  servicesLabel: HealthChecks
  stylesheetContent: "        :root {    \r\n        --primaryColor: #2a3950;\r\n        --secondaryColor: #f4f4f4;  \r\n        --bgMenuActive: #e1b015;\r\n        --bgButton: #e1b015;\r\n        --logoImageUrl: url('https://upload.wikimedia.org/wikipedia/commons/thumb/e/eb/WoW_icon.svg/1200px-WoW_icon.svg.png');\r\n        --bgAside: var(--primaryColor);   \r\n      }\r\n"

A noticeable difference is that I'm specifying ClusterIP rather than LoadBalancer, but I'd be surprised if this was the issue.

SeanKilleen commented 10 months ago

My hunch at this point is that the issue is here: https://github.com/Xabaril/AspNetCore.Diagnostics.HealthChecks/blob/master/src/HealthChecks.UI.K8s.Operator/Operator/KubernetesAddressFactory.cs#L11C1-L11C46

The CreateAddress() function refers to service.Spec.ClusterIP for the address. I'm relatively newer to this, but my understanding is that for cross-namespace interaction, you'd really want something that's DNS-compatible, e.g. $"{service.Metadata.Name}.{service.Metadata.Namespace()}.svc.cluster.local".

I'd be happy to submit a PR for this if you agree. I'm brand new to contributing but would be happy to work through the process of testing it etc.