Hubs-Foundation / hubs-cloud

Resources for self hosted Hubs Cloud instances
Mozilla Public License 2.0
149 stars 88 forks source link

unable to run cbb.sh most of the time #330

Closed Utopiah closed 5 months ago

Utopiah commented 5 months ago

Describe the bug I managed to get cbb.sh work few times. Consequently I'm rather convinced it and my configurations are fine. Yet, I would say relatively often, e.g 9/10 times it fails.

To Reproduce Steps to reproduce the behavior:

  1. start a new k8s cluster and apply
  2. get the lb IP and update 4 lines of the DNS zone
  3. test target URL, works even though certificates are self-signed
  4. modify hcce.yaml to comment out the HAProxy certificate line
  5. deploy kubectl apply -f hcce.yaml
  6. wait for update then kubectl delete pods --all -n hubsce
  7. once all pods are running, do bash cbb.sh with matching domain and admin email for certbot
  8. kubectl logs -f certbotbot-http -n hubsce

Expected behavior

It should finish with 4 domains certified and kubectl get secrets -n hubsce coherent.

Screenshots

fabien@fabien-CORSAIR-ONE-i160:~/Prototypes/hubsce/hubs-cloud/community-edition$ kubectl logs -f certbotbot-http -n hubsce 
NAMESPACE=hubsce
DOMAIN=mymatrix.ovh
HUB_DOMAIN=
CHALLENGE=http
CERTBOT_EMAIL=
CERT_NAME=cert-mymatrix.ovh
CP_TO_NS=
LETSENCRYPT_ACCOUNT=

 making in-cluster config for kubectl
Cluster "the-cluster" set.
User "pod-token" set.
Context "pod-context" created.
Switched to context "pod-context".

 checking if we need_new_cert
NAME        TYPE                DATA   AGE
cert-hcce   kubernetes.io/tls   2      21m
configs     Opaque              20     21m
Error from server (NotFound): secrets "cert-mymatrix.ovh" not found
-rw-r--r-- 1 root root 0 Jan 23 11:17 tls.crt
unable to load certificate
140547192530240:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
cert sub: 
  bad cert sub ()-- need new cert for mymatrix.ovh
getting new cert

 get_new_cert_http -- requires mymatrix.ovh/.well-known/acme-challenge* routed into this pod
deploy certbot-http ingress and service for http challenge
service/certbotbot-http created
Warning: annotation "kubernetes.io/ingress.class" is deprecated, please use 'spec.ingressClassName' instead
ingress.networking.k8s.io/certbotbot-http created
start nginx and wait 30 sec for ingress to pick up the pod
requesting cert
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator nginx, Installer nginx
Account registered.
Requesting a certificate for mymatrix.ovh
Performing the following challenges:
http-01 challenge for mymatrix.ovh
Waiting for verification...
Challenge failed for domain mymatrix.ovh
http-01 challenge for mymatrix.ovh
Cleaning up challenges
Some challenges have failed.
IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: mymatrix.ovh
   Type:   connection
   Detail: 162.19.109.75: Fetching
   https://mymatrix.ovh:443/.well-known/acme-challenge/hyedVaVHxFMYx_KyWxyZcYfKLadhWJO-oKS8iy8lqxQ:
   Error getting validation data

   To fix these errors, please make sure that your domain name was
   entered correctly and the DNS A/AAAA record(s) for that domain
   contain(s) the right IP address. Additionally, please check that
   your computer has a publicly routable IP address and that no
   firewalls are preventing the server from communicating with the
   client. If you're using the webroot plugin, you should also verify
   that you are serving files from the webroot path you provided.
requesting cert -- retrying -- retries left: 9
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator nginx, Installer nginx
Requesting a certificate for mymatrix.ovh
Performing the following challenges:
http-01 challenge for mymatrix.ovh
Waiting for verification...
Challenge failed for domain mymatrix.ovh
http-01 challenge for mymatrix.ovh
Cleaning up challenges
Some challenges have failed.

From https://hubs.mozilla.com/labs/community-edition-case-study-quick-start-on-gcp-w-aws-services/

mikemorran commented 5 months ago

Hey @Utopiah, if I understand correctly, you are commenting out the haproxy line before you apply the yaml file for the first time for your deployment.

My understanding is that there is a strict order of operations to this that may be causing the issue...

  1. Deploy with the default certificate enabled in haproxy
  2. Update your A records in your DNS with your external IP
  3. Run certbotbot to get all 4 certificates
  4. Edit your deployment to remove the default certificate, re-apply your yaml, and delete/respawn all pods

Have you tried things in this order?

Utopiah commented 5 months ago

Hey @mikemorran, actually I tried to get the certificate with the line commented! Let me try again while keeping it commented and only remove AFTER getting the cerfs! Thanks for the prompt feedback.

Utopiah commented 5 months ago

Damned, it was driving me nuts ;) Works right away, thanks again, closing the issue but hopefully if others get stuck there, they'll find it too!

mikemorran commented 5 months ago

@Utopiah Its an unfortunately rigid order of operations at the current moment, but glad it worked out!