Closed yuvipanda closed 2 years ago
It's already set up actually! https://console.cloud.google.com/errors
I don't see the 503s from hub pods there though.
@yuvipanda This is an awesome improvement! I have the following questions/suggestions,
Yeah, I think a lot of the errors here are false positives, and it's not catching most actual errors. A deep dive here is needed to understand this product, see if it's actually helpful to us, and how we can use it. I have never seen that page until I opened this issue, so I don't know much about it yet.
@felder do you think this is something you might be able to tackle?
@yuvipanda yeah I can take a crack at it
@yuvipanda glancing at the documentation for the google error handling service, it appears that to get application level errors we would require code changes to send the data to the error reporting service or to call the API.
https://cloud.google.com/error-reporting/docs/setup/kubernetes-engine
This may prove more fruitful: https://console.cloud.google.com/logs/query?project=ucb-datahub-2018
Here's a query that looks for 503s on 9/16 when we last had an issue with #2677
https://cloudlogging.app.goo.gl/yHks4Rcc8FgtzAYu5
Note the peak between 4:30pm and 5:00pm PDT
However there is a lot of noise in here, for example each pod gets a 503 for metrics and requests resulting in a 200 are also logged as errors. If we could get the logging down such that actual errors are reported as type error and the noise wasn't, that would help a lot for isolating errors to alert on.
Looks like I can define log metrics based on queries and then I can setup alerts based on those metrics.
The following query to Logs Explorer seems to be pointing in a possible good direction. It shows a large spike for data100-prod during the incident on 9/16.
labels."k8s-pod/release"="data100-prod"
textPayload=~"\[E "
textPayload!~"metrics"
labels."k8s-pod/component"="singleuser-server"
@balajialg requested I do a similar log query for prob140-prod around Sept. 13.
Similar query again for r-prod for September
publichealth-prod for the last week, the first spike matches up with a report made on slack of students having trouble, similarly there is a spike last night where another issue was reported.
This is brilliant work, @felder! Thanks for spending the time to figure things out.
@felder for alerting, can you explore https://grafana.com/docs/grafana/latest/alerting/ instead? Since all our metrics are already present in grafana...
oooh, github just showed me your other messages. We definitely don't have metrics on 503s from user servers - we tried putting it in prometheus (https://github.com/berkeley-dsep-infra/datahub/issues/1993 and https://github.com/berkeley-dsep-infra/datahub/pull/1973) but prometheus was crushed under the weight of those (https://github.com/berkeley-dsep-infra/datahub/pull/1977). So let's explore that and see how it goes.
Thanks for putting time into this, @felder
With https://github.com/berkeley-dsep-infra/datahub/pull/2792, we're now collecting prometheus metrics on each request that goes through nginx ingress. https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/ has some more info. https://github.com/berkeley-dsep-infra/datahub/issues/2693 means that datahub is not represented here yet.
https://grafana.datahub.berkeley.edu/d/IxT315H7z/http-requests-nginx-copy?orgId=1 is a trial dashboard i started playing with that has some of these metrics. I haven't really explored the data. Curious how you think this is vs the google cloud dashboard, @felder? Think you have the time to poke around this too?
@yuvipanda I was looking at this dashboard today. We had more than 100+ 5xx errors consistently over the past 6 hours (9/24). However, there were no 5xx issues were raised by our users which got me wondering whether Prometheus is spitting out error logs alone?
Or the assumption is a certain percentage of our users will always have this kind of error but it never gets reported to us because it is only for a small percentage of users (200/4000 ~5%). I also realized that we may not be able to verify the error logs for previous outages (#2791) the way Jonathan did for GCP as this data is not available as part of grafana error dashboard.
@yuvipanda yeah can take a look at this, but it'll be after my vacation
There's a lot of spurious errors like this:
[W 2021-10-04 18:56:50.738 JupyterHub base:1349] Failing suspected API request to not-running server: /hub/user/USERNAME/api/metrics/v1
[E 2021-10-04 18:56:50.740 JupyterHub log:189] 503 GET /hub/user/USERNAME/api/metrics/v1 (sanjyot_bakshi@10.0.0.19) 5.68ms
This is when their servers have been culled but their browser tab is still open. I think this should be a 404 instead... We should clean that up.
@balajialg I opened https://github.com/jupyterhub/jupyterhub/pull/3636 to get rid of a lot of these spurious errors.
@yuvipanda Sounds great! Interesting to know that even 404 could be spit as 503 errors. Thanks for looking into this.
Once we get https://github.com/jupyterhub/jupyterhub/pull/3636 deployed, our 503s will have much higher signal to noise ratio. We can use the logs to quickly investigate where they might be and fix them as we find them.
We can also access data from stackdriver in our grafana - https://grafana.com/grafana/plugins/stackdriver/. That might be helpful?
@yuvipanda Will Stack Driver push the data that @felder analyzed via GCP console? CC'ing @felder so that he can look into this issue and add his thoughts about the value of this data when he is back.
Now our background levels of 503 have dropped to 0, and we can treat any spike as something to investigate!!!!!
Note that the main datahub is still not going through this proxy yet - see https://github.com/berkeley-dsep-infra/datahub/issues/2693. I'll try fix that up soon.
So grafana supports defining alerts (https://grafana.com/docs/grafana/latest/alerting/), and supports many different notification channels. I've signed us up for pagerduty, and invited @felder and @balajialg. We can define alerts on each graph - I've defined a test one for pods that aren't scheduled (https://grafana.datahub.berkeley.edu/d/R3O4mbg7z/cluster-information?orgId=1&viewPanel=7), and it seems to work!
We can pipe these alerts to slack as well.
@yuvipanda, this is an awesome update!
Correct me If I understand the workflow right! For eg: We can configure slack (or any other real-time) notifications whenever the 500 errors increase beyond a certain threshold which is not considered usual. Is this an accurate understanding of one of the use cases?
@balajialg yeah that's correct.
@yuvipanda Considering the 5xx dashboard, is this issue done? Anything else to be done pertaining to the scope of the issue?
@yuvipanda Considering that we are tracking pagerduty related alerts through #2988, I am closing this issue. Please feel free to reopen this if you think this is separate/incomplete/requires further attention.
https://cloud.google.com/error-reporting can help us watch for exceptions in various parts of our infrastructure (like hub, notebook users, etc). We can stop relying 100% on user reports for noticing errors this way.