berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
63 stars 39 forks source link

Automatically watch for errors in our infrastructure logs #2693

Closed yuvipanda closed 2 years ago

yuvipanda commented 3 years ago

https://cloud.google.com/error-reporting can help us watch for exceptions in various parts of our infrastructure (like hub, notebook users, etc). We can stop relying 100% on user reports for noticing errors this way.

yuvipanda commented 3 years ago

It's already set up actually! https://console.cloud.google.com/errors

yuvipanda commented 3 years ago

I don't see the 503s from hub pods there though.

balajialg commented 3 years ago

@yuvipanda This is an awesome improvement! I have the following questions/suggestions,

balajialg commented 3 years ago

2601

yuvipanda commented 3 years ago

Yeah, I think a lot of the errors here are false positives, and it's not catching most actual errors. A deep dive here is needed to understand this product, see if it's actually helpful to us, and how we can use it. I have never seen that page until I opened this issue, so I don't know much about it yet.

@felder do you think this is something you might be able to tackle?

felder commented 3 years ago

@yuvipanda yeah I can take a crack at it

felder commented 3 years ago

@yuvipanda glancing at the documentation for the google error handling service, it appears that to get application level errors we would require code changes to send the data to the error reporting service or to call the API.

https://cloud.google.com/error-reporting/docs/setup/kubernetes-engine

felder commented 3 years ago

This may prove more fruitful: https://console.cloud.google.com/logs/query?project=ucb-datahub-2018

felder commented 3 years ago

Here's a query that looks for 503s on 9/16 when we last had an issue with #2677

https://cloudlogging.app.goo.gl/yHks4Rcc8FgtzAYu5

Note the peak between 4:30pm and 5:00pm PDT

However there is a lot of noise in here, for example each pod gets a 503 for metrics and requests resulting in a 200 are also logged as errors. If we could get the logging down such that actual errors are reported as type error and the noise wasn't, that would help a lot for isolating errors to alert on.

felder commented 3 years ago

https://cloud.google.com/logging/docs/alerting/monitoring-logs

felder commented 3 years ago

Looks like I can define log metrics based on queries and then I can setup alerts based on those metrics.

felder commented 3 years ago

The following query to Logs Explorer seems to be pointing in a possible good direction. It shows a large spike for data100-prod during the incident on 9/16.

labels."k8s-pod/release"="data100-prod"
textPayload=~"\[E "
textPayload!~"metrics"
labels."k8s-pod/component"="singleuser-server"

Screen Shot 2021-09-21 at 3 58 36 PM

felder commented 3 years ago

@balajialg requested I do a similar log query for prob140-prod around Sept. 13.

Screen Shot 2021-09-21 at 4 04 33 PM

felder commented 3 years ago

Similar query again for r-prod for September

Screen Shot 2021-09-21 at 4 13 28 PM

felder commented 3 years ago

publichealth-prod for the last week, the first spike matches up with a report made on slack of students having trouble, similarly there is a spike last night where another issue was reported.

Screen Shot 2021-09-21 at 4 39 14 PM

balajialg commented 3 years ago

This is brilliant work, @felder! Thanks for spending the time to figure things out.

yuvipanda commented 3 years ago

@felder for alerting, can you explore https://grafana.com/docs/grafana/latest/alerting/ instead? Since all our metrics are already present in grafana...

yuvipanda commented 3 years ago

oooh, github just showed me your other messages. We definitely don't have metrics on 503s from user servers - we tried putting it in prometheus (https://github.com/berkeley-dsep-infra/datahub/issues/1993 and https://github.com/berkeley-dsep-infra/datahub/pull/1973) but prometheus was crushed under the weight of those (https://github.com/berkeley-dsep-infra/datahub/pull/1977). So let's explore that and see how it goes.

Thanks for putting time into this, @felder

yuvipanda commented 3 years ago

With https://github.com/berkeley-dsep-infra/datahub/pull/2792, we're now collecting prometheus metrics on each request that goes through nginx ingress. https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/ has some more info. https://github.com/berkeley-dsep-infra/datahub/issues/2693 means that datahub is not represented here yet.

https://grafana.datahub.berkeley.edu/d/IxT315H7z/http-requests-nginx-copy?orgId=1 is a trial dashboard i started playing with that has some of these metrics. I haven't really explored the data. Curious how you think this is vs the google cloud dashboard, @felder? Think you have the time to poke around this too?

balajialg commented 3 years ago

@yuvipanda I was looking at this dashboard today. We had more than 100+ 5xx errors consistently over the past 6 hours (9/24). However, there were no 5xx issues were raised by our users which got me wondering whether Prometheus is spitting out error logs alone?

image

Or the assumption is a certain percentage of our users will always have this kind of error but it never gets reported to us because it is only for a small percentage of users (200/4000 ~5%). I also realized that we may not be able to verify the error logs for previous outages (#2791) the way Jonathan did for GCP as this data is not available as part of grafana error dashboard.

felder commented 3 years ago

@yuvipanda yeah can take a look at this, but it'll be after my vacation

yuvipanda commented 3 years ago

There's a lot of spurious errors like this:

[W 2021-10-04 18:56:50.738 JupyterHub base:1349] Failing suspected API request to not-running server: /hub/user/USERNAME/api/metrics/v1
[E 2021-10-04 18:56:50.740 JupyterHub log:189] 503 GET /hub/user/USERNAME/api/metrics/v1 (sanjyot_bakshi@10.0.0.19) 5.68ms

This is when their servers have been culled but their browser tab is still open. I think this should be a 404 instead... We should clean that up.

yuvipanda commented 3 years ago

@balajialg I opened https://github.com/jupyterhub/jupyterhub/pull/3636 to get rid of a lot of these spurious errors.

balajialg commented 3 years ago

@yuvipanda Sounds great! Interesting to know that even 404 could be spit as 503 errors. Thanks for looking into this.

yuvipanda commented 3 years ago

Once we get https://github.com/jupyterhub/jupyterhub/pull/3636 deployed, our 503s will have much higher signal to noise ratio. We can use the logs to quickly investigate where they might be and fix them as we find them.

yuvipanda commented 3 years ago

We can also access data from stackdriver in our grafana - https://grafana.com/grafana/plugins/stackdriver/. That might be helpful?

balajialg commented 3 years ago

@yuvipanda Will Stack Driver push the data that @felder analyzed via GCP console? CC'ing @felder so that he can look into this issue and add his thoughts about the value of this data when he is back.

yuvipanda commented 3 years ago
image

Now our background levels of 503 have dropped to 0, and we can treat any spike as something to investigate!!!!!

Note that the main datahub is still not going through this proxy yet - see https://github.com/berkeley-dsep-infra/datahub/issues/2693. I'll try fix that up soon.

yuvipanda commented 3 years ago

So grafana supports defining alerts (https://grafana.com/docs/grafana/latest/alerting/), and supports many different notification channels. I've signed us up for pagerduty, and invited @felder and @balajialg. We can define alerts on each graph - I've defined a test one for pods that aren't scheduled (https://grafana.datahub.berkeley.edu/d/R3O4mbg7z/cluster-information?orgId=1&viewPanel=7), and it seems to work!

We can pipe these alerts to slack as well.

balajialg commented 3 years ago

@yuvipanda, this is an awesome update!

Correct me If I understand the workflow right! For eg: We can configure slack (or any other real-time) notifications whenever the 500 errors increase beyond a certain threshold which is not considered usual. Is this an accurate understanding of one of the use cases?

yuvipanda commented 3 years ago

@balajialg yeah that's correct.

balajialg commented 2 years ago

@yuvipanda Considering the 5xx dashboard, is this issue done? Anything else to be done pertaining to the scope of the issue?

balajialg commented 2 years ago

@yuvipanda Considering that we are tracking pagerduty related alerts through #2988, I am closing this issue. Please feel free to reopen this if you think this is separate/incomplete/requires further attention.