2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

[Solved] Pangeo Hub has several critical issues #815

Closed choldgraf closed 3 years ago

choldgraf commented 3 years ago

Summary

There are a variety of critical issues that have been reported on the Pangeo JupyterHub.

Certificate errors

Some users reported a certificate error when connecting to the hub. Here's an example of the error message:

image

FreshDesk tickets:

JupyterLab and kernal usability errors

Many users on the Pangeo hub are reporting slugging behavior. In particular, the following two actions take upwards of 30 seconds to complete, or do not complete at all:

FreshDesk tickets:

Grafana is not reachable

I tried to go to grafana.us-central1-b.gcp.pangeo.io but received a "your connection is not private" error. So I haven't been able to look at any dashboards to understand what might be going on.


After-action report

What went wrong

We setup a redirect between two URLs where one was a CNAME for the other. This turned out to be a Very Bad Idea ™️ . In PR https://github.com/2i2c-org/infrastructure/pull/812, we replaced the pangeo.2i2c.cloud address with us-central1-b.gcp.pangeo.io meaning that cert-manager was no longer issuing certificates for pangeo.2i2c.cloud and our load balancer would no longer accept traffic from pangeo.2i2c.cloud. All issues were resolved by undoing the redirect and visiting the us-central1-b.gcp.pangeo.io address instead.

Pangeo has been special-cased in that it has had active users before setup development was complete and I think the switch in URL is what confused people. Normally we would only invite users after the DNS has been set and so I don't see the issue arising again.

Action items

Documentation improvements

  1. Pull grafana links from hub config files and add them to documentation sites so it's clear which grafana URL goes with which hub: https://github.com/2i2c-org/infrastructure/pull/817

Actions

sgibson91 commented 3 years ago

Grafana is not reachable

I tried to go to grafana.us-central1-b.gcp.pangeo.io but received a "your connection is not private" error. So I haven't been able to look at any dashboards to understand what might be going on.

That is not the correct grafana URL - this is the one listed in the config and is reachable https://pangeo-grafana.pangeo.2i2c.cloud

sgibson91 commented 3 years ago

I suspect the certificate issues are due to the redirect Ryan asked me to setup last night.

Setup

We have two DNS zones: 2i2c.cloud managed by us through Namecheap, pangeo.io managed by the Pangeo community through Hurricane Electric (though I have access).

In 2i2c.cloud, we have a pangeo.2i2c.cloud A record that points to our LoadBalancer IP address.

In pangeo.io, we have us-central1-b.gcp.pangeo.io that is a CNAME for pangeo.2i2c.cloud.

It is setup this way such that if our LoadBalancer IP changes, we only need to edit the A record in 2i2c.cloud and pangeo.io will inherit the change through the CNAME.

The Redirect

We only assign one domain name to our hubs to avoid confusion, this means that once the CNAME for us-central1-b.gcp.pangeo.io was setup, pangeo.2i2c.cloud begins returning a 404 since ingress-nginx now only accepts traffic from the pangeo.io domain.

Hence Ryan asked me to setup a redirect from pangeo.2i2c.cloud to us-central1-b.gcp.pangeo.io, which I did here https://github.com/2i2c-org/infrastructure/issues/482#issuecomment-963411959

What I suspect is happening

I don't think the certificates are able to resolve properly because they're trying to get a response from ...pangeo.io which is a CNAME for pangeo.2i2c.cloud which is then redirecting back to ...pangeo.io --> vicious loop of nothing giving a correct response.

What I'm going to try

sgibson91 commented 3 years ago

I did the above and logged into the production hub in a private browser. All certificates were present and the connection was private. So the certificates issue is now resolved.

sgibson91 commented 3 years ago

JupyterLab and kernal usability errors

Many users on the Pangeo hub are reporting slugging behavior. In particular, the following two actions take upwards of 30 seconds to complete, or do not complete at all:

  • Starting Kernels
  • Opening Terminals

I could not replicate this so I suspect it was all a certificates/traffic problem, but I'm happy to be proven wrong if someone can provide concrete steps to demonstrate the problem?

sgibson91 commented 3 years ago

I think there has been some confusion regarding the certificates on the pangeo.2i2c.cloud URL.

We stopped supporting multiple domains for a single hub to reduce complexity. See these PRs: https://github.com/2i2c-org/infrastructure/pull/460 and https://github.com/2i2c-org/infrastructure/pull/496

Hence when https://github.com/2i2c-org/infrastructure/pull/812 was merged, we stopped issuing certificates for pangeo.2i2c.cloud and the load balancer stopped accepting traffic from there. Instead, we issue certificates for us-central1-b.gcp.pangeo.io and accept traffic from there. As mentioned above, the pangeo.2i2c.cloud address is only used so we can update the IP address of the load balancer if required in the cases where we don't have access to the desired domain.

There are no certificate issues if folks use the us-central1-b.gcp.pangeo.io address, which I mentioned here https://github.com/2i2c-org/infrastructure/issues/482#issuecomment-963303591 But instead we got waylaid by redirects.

I think the only reason we've had this confusion is because the hub had users throughout the setup process. Normally, we would not have users until after this point.

choldgraf commented 3 years ago

Just a note that the following works as-expected for me:

Quick thoughts:

sgibson91 commented 3 years ago

Do we anticipate moving pangeo-grafana.pangeo.2i2c.cloud to pangeo-grafana.us-central1-b.gcp.pangeo.io ? Or will the pangeo.2i2c.cloud only exist in order to serve pangeo-grafana.pangeo.2i2c.cloud?

There's a bit of a name-clash for grafana atm since I wasn't very clever when setting up the COESSING hub.

https://github.com/2i2c-org/infrastructure/blob/6cf0a3ff3c9edf2154791e26c269dbbf681d5234/config/hubs/pangeo-hubs.cluster.yaml#L13-L16

So my plan was:

I had no intentions to point grafana at grafana.us-central1-b.gcp.pangeo.io unless the community specifically need it or we consider it best practice?

If we move forward with https://github.com/2i2c-org/infrastructure/issues/427 at some point, I had also considered making these URLs *.pangeo-gcp so we could have *.pangeo-aws in the future if need be.

sgibson91 commented 3 years ago
  • Another hub we could to for inspiration is utoronto.2i2c.cloud, which redirects to jupyter.utoronto.ca/hub/login?next=%2Fhub%2F - not sure if that's the same setup or not, but just noting it in case it helps with redirection

Just checked out Namecheap for this. We have an A record utoronto in the 2i2c.cloud domain that points at an IP address (I assume the load balancer), and that's it. The redirect setup must be happening on the utoronto.ca end.

In which case, I wonder if I took the wrong approach by trying to setup the redirect from Namecheap instead of in Hurricane Electric? Update: Had a quick look through Hurricane Electric and it wasn't obvious to me how to do this.

yuvipanda commented 3 years ago

I'm simultaneously proud of and ashamed of how the utoronto redirect works - it works via JS in the homepage! https://github.com/utoronto-2i2c/homepage/blob/master/extra-assets/js/login.js. It only works if you land on the homepage - if you're already logged in it has no effect. Do not recommend.

choldgraf commented 3 years ago

@yuvipanda LOL that is amazing

sgibson91 commented 3 years ago

I've tidied up the top comment. I don't think there's anything actionable left here so I'm going to close this.

damianavila commented 3 years ago

I'm simultaneously proud of and ashamed of how the utoronto redirect works - it works via JS in the homepage! https://github.com/utoronto-2i2c/homepage/blob/master/extra-assets/js/login.js. It only works if you land on the homepage - if you're already logged in it has no effect. Do not recommend.

It is fun to realize I actually thought about something along these lines when I was thinking about possible workarounds 😜

choldgraf commented 3 years ago

Many thanks @sgibson91 for being awesome!