Closed choldgraf closed 3 years ago
Grafana is not reachable
I tried to go to grafana.us-central1-b.gcp.pangeo.io but received a "your connection is not private" error. So I haven't been able to look at any dashboards to understand what might be going on.
That is not the correct grafana URL - this is the one listed in the config and is reachable https://pangeo-grafana.pangeo.2i2c.cloud
I suspect the certificate issues are due to the redirect Ryan asked me to setup last night.
We have two DNS zones: 2i2c.cloud managed by us through Namecheap, pangeo.io managed by the Pangeo community through Hurricane Electric (though I have access).
In 2i2c.cloud, we have a pangeo.2i2c.cloud A record that points to our LoadBalancer IP address.
In pangeo.io, we have us-central1-b.gcp.pangeo.io that is a CNAME for pangeo.2i2c.cloud.
It is setup this way such that if our LoadBalancer IP changes, we only need to edit the A record in 2i2c.cloud and pangeo.io will inherit the change through the CNAME.
We only assign one domain name to our hubs to avoid confusion, this means that once the CNAME for us-central1-b.gcp.pangeo.io was setup, pangeo.2i2c.cloud begins returning a 404 since ingress-nginx now only accepts traffic from the pangeo.io domain.
Hence Ryan asked me to setup a redirect from pangeo.2i2c.cloud to us-central1-b.gcp.pangeo.io, which I did here https://github.com/2i2c-org/infrastructure/issues/482#issuecomment-963411959
I don't think the certificates are able to resolve properly because they're trying to get a response from ...pangeo.io which is a CNAME for pangeo.2i2c.cloud which is then redirecting back to ...pangeo.io --> vicious loop of nothing giving a correct response.
I did the above and logged into the production hub in a private browser. All certificates were present and the connection was private. So the certificates issue is now resolved.
JupyterLab and kernal usability errors
Many users on the Pangeo hub are reporting slugging behavior. In particular, the following two actions take upwards of 30 seconds to complete, or do not complete at all:
- Starting Kernels
- Opening Terminals
I could not replicate this so I suspect it was all a certificates/traffic problem, but I'm happy to be proven wrong if someone can provide concrete steps to demonstrate the problem?
I think there has been some confusion regarding the certificates on the pangeo.2i2c.cloud URL.
We stopped supporting multiple domains for a single hub to reduce complexity. See these PRs: https://github.com/2i2c-org/infrastructure/pull/460 and https://github.com/2i2c-org/infrastructure/pull/496
Hence when https://github.com/2i2c-org/infrastructure/pull/812 was merged, we stopped issuing certificates for pangeo.2i2c.cloud and the load balancer stopped accepting traffic from there. Instead, we issue certificates for us-central1-b.gcp.pangeo.io and accept traffic from there. As mentioned above, the pangeo.2i2c.cloud address is only used so we can update the IP address of the load balancer if required in the cases where we don't have access to the desired domain.
There are no certificate issues if folks use the us-central1-b.gcp.pangeo.io address, which I mentioned here https://github.com/2i2c-org/infrastructure/issues/482#issuecomment-963303591 But instead we got waylaid by redirects.
I think the only reason we've had this confusion is because the hub had users throughout the setup process. Normally, we would not have users until after this point.
Just a note that the following works as-expected for me:
Quick thoughts:
pangeo.2i2c.cloud
only exist in order to serve pangeo-grafana.pangeo.2i2c.cloud
?Do we anticipate moving pangeo-grafana.pangeo.2i2c.cloud to pangeo-grafana.us-central1-b.gcp.pangeo.io ? Or will the
pangeo.2i2c.cloud
only exist in order to servepangeo-grafana.pangeo.2i2c.cloud
?
There's a bit of a name-clash for grafana atm since I wasn't very clever when setting up the COESSING hub.
So my plan was:
pangeo-grafana.pangeo.2i2c.cloud
to grafana.pangeo.2i2c.cloud
*.pangeo
) since pangeo-grafana.pangeo.2i2c.cloud
, staging.pangeo.2i2c.cloud
and pangeo.2i2c.cloud
all point to the same IP address. This will be simpler to maintain.I had no intentions to point grafana at grafana.us-central1-b.gcp.pangeo.io
unless the community specifically need it or we consider it best practice?
If we move forward with https://github.com/2i2c-org/infrastructure/issues/427 at some point, I had also considered making these URLs *.pangeo-gcp
so we could have *.pangeo-aws
in the future if need be.
- Another hub we could to for inspiration is utoronto.2i2c.cloud, which redirects to jupyter.utoronto.ca/hub/login?next=%2Fhub%2F - not sure if that's the same setup or not, but just noting it in case it helps with redirection
Just checked out Namecheap for this. We have an A record utoronto
in the 2i2c.cloud
domain that points at an IP address (I assume the load balancer), and that's it. The redirect setup must be happening on the utoronto.ca
end.
In which case, I wonder if I took the wrong approach by trying to setup the redirect from Namecheap instead of in Hurricane Electric? Update: Had a quick look through Hurricane Electric and it wasn't obvious to me how to do this.
I'm simultaneously proud of and ashamed of how the utoronto redirect works - it works via JS in the homepage! https://github.com/utoronto-2i2c/homepage/blob/master/extra-assets/js/login.js. It only works if you land on the homepage - if you're already logged in it has no effect. Do not recommend.
@yuvipanda LOL that is amazing
I've tidied up the top comment. I don't think there's anything actionable left here so I'm going to close this.
I'm simultaneously proud of and ashamed of how the utoronto redirect works - it works via JS in the homepage! https://github.com/utoronto-2i2c/homepage/blob/master/extra-assets/js/login.js. It only works if you land on the homepage - if you're already logged in it has no effect. Do not recommend.
It is fun to realize I actually thought about something along these lines when I was thinking about possible workarounds 😜
Many thanks @sgibson91 for being awesome!
Summary
There are a variety of critical issues that have been reported on the Pangeo JupyterHub.
Certificate errors
Some users reported a certificate error when connecting to the hub. Here's an example of the error message:
FreshDesk tickets:
JupyterLab and kernal usability errors
Many users on the Pangeo hub are reporting slugging behavior. In particular, the following two actions take upwards of 30 seconds to complete, or do not complete at all:
FreshDesk tickets:
Grafana is not reachable
I tried to go to grafana.us-central1-b.gcp.pangeo.io but received a "your connection is not private" error. So I haven't been able to look at any dashboards to understand what might be going on.
After-action report
What went wrong
We setup a redirect between two URLs where one was a CNAME for the other. This turned out to be a Very Bad Idea ™️ . In PR https://github.com/2i2c-org/infrastructure/pull/812, we replaced the
pangeo.2i2c.cloud
address withus-central1-b.gcp.pangeo.io
meaning that cert-manager was no longer issuing certificates forpangeo.2i2c.cloud
and our load balancer would no longer accept traffic frompangeo.2i2c.cloud
. All issues were resolved by undoing the redirect and visiting theus-central1-b.gcp.pangeo.io
address instead.Pangeo has been special-cased in that it has had active users before setup development was complete and I think the switch in URL is what confused people. Normally we would only invite users after the DNS has been set and so I don't see the issue arising again.
Action items
Documentation improvements
Actions