OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.
Creative Commons Zero v1.0 Universal
17 stars 1 forks source link

Service outage 20230717 #571

Closed danrademacher closed 12 months ago

danrademacher commented 1 year ago

Bug description Several of our OHM services are falling down for unknown reasons:

Per https://stats.uptimerobot.com/0BBDoIkXKJ, currently the website, Overpass, and Taginfo are all down: image

I confirmed that these are not loading for me.

The one red light on New Relic is this pod image

danrademacher commented 1 year ago

Now Taginfo is back up: image

I suspect Overpass and the website will come back online as well, but they have not yet done so.

danrademacher commented 1 year ago

The other services came back at 8:18 pm PT

Rub21 commented 1 year ago

@danrademacher @geohacker @batpad I have been doing a research to understand why the service was down for almost two hours. I have thoroughly evaluated each of the involved aspects.

Incidents

Services Incident started at (UTC) Resolved at (UTC) Duration
OHM Website 2023-07-18 00:29:15 2023-07-18 00:33:08 3 minutes and 53 seconds
OHM Website 2023-07-18 00:43:15 2023-07-18 03:18:14 2 hours and 34 minutes
OHM TagInfo 2023-07-18 00:26:25 2023-07-18 01:14:32 48 minutes and 7 seconds
OHM Overpass API 2023-07-18 01:03:43 2023-07-18 03:18:42 2 hours and 14 minutes

AWS Service health

The service has not experienced any interruptions on EC2. There was only one issue related to CloudFront, which occurred at a different time than the service downtime

https://health.aws.amazon.com/health/status

image

osmseed-production cluster

During the service downtime, other services within the cluster were operating normally. There were some that generated errors, but these seem to be related to another container that is not in use, namely 'dashboard-metrics-scraper'

https://onenr.io/0ERPMPYnvjW https://onenr.io/0nQxP0YY5QV

image

DB Logs

For some reason, the database pod was halted during the service downtime. No logs were recorded for this time period. https://onenr.io/0oQDKkGrDjy

image

Web container Logs

Similarly, the web service did not generate any unusual logs during the downtime

https://onenr.io/01wZvD3mvw6 image

Taginfo logs

https://onenr.io/0KQXGgP5Eja

Overpass API logs

https://onenr.io/0yw4NqoxLj3

Metrics-server Issue

After investigating, it seems that issues is related to "metrics-server", The metrics-server is a Kubernetes system component that collects and stores performance metrics from nodes and pods in the cluster.

Screenshot 2023-07-18 at 4 49 16 PM

kubectl get pods --all-namespaces -o wide
kubectl describe pod metrics-server-7668599459-dft95 -n kube-system
kubectl logs metrics-server-7668599459-dft95 -n kube-system

outputs:

Error: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied
.....
panic: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied

goroutine 1 [running]:
main.main()
    /go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b

I still don't have an understanding of why this issue is happening. I checked the staging cluster, and everything is working fine there. I'm not sure if it's related to a token expiration, but I can't pinpoint it yet.

Due to the "token: permission denied" issue, the web, taginfo, and overpass pods were unable to restart or collect metrics. However, it's strange that the services were restored after two hours. The first step would be to resolve this "open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied" issue. Here are some possible solutions according to chatgpt

@batpad ,I would like to your suggestion about this issue. From my perspective, the most viable option would be to delete the metrics-server pods and restore them using a metrics template. e.g: components.yml

jeffreyameyer commented 1 year ago

Overpass API appears to be working (i.e., returning READ requests), but is out of sync or possibly not syncing at all. Changes I made yesterday have yet to show up in its results.